Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customized stats command #113

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions src/corppa/poetry_detection/annotation/command_recipes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
from collections import Counter, defaultdict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to add a docstring either at the top or with the recipe explaining how you run this and showing some sample output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good idea. Although the styling itself is a bit outside of my understanding (I believe it's rendered by an external library)


from prodigy.components.db import connect
from prodigy.core import Arg, recipe
from prodigy.errors import RecipeError
from prodigy.util import SESSION_ID_ATTR, msg


@recipe(
"page-stats",
dataset=Arg(help="Prodigy dataset ID"),
)
def ppa_stats(dataset: str) -> None:
# Load examples
DB = connect()
if dataset not in DB:
raise RecipeError(f"Can't find dataset '{dataset}' in database")
examples = DB.get_dataset_examples(dataset)
n_examples = len(examples)
msg.good(f"Loaded {n_examples} annotations from {dataset} dataset")

# Get stats
examples_by_page = Counter()
examples_by_session = defaultdict(list)
for ex in examples:
# Skip examples without answer or (page) id
if "answer" not in ex and "id" not in ex:
# Ignore "unanswered" examples
continue
page_id = ex["id"]
examples_by_page[page_id] += 1
session_id = ex[SESSION_ID_ATTR]
examples_by_session[session_id].append(page_id)
# Get frequencies of page-level annotation counts
count_freqs = Counter()
total = 0
for count in examples_by_page.values():
count_freqs[count] += 1
total += count
Comment on lines +34 to +39
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're doing here wasn't immediately obvious based on the variable names - am I understanding correctly that you're counting the number of examples/pages that have the same number of annotations, so you can report something like 100 examples have 1 annotation each, 50 have 2 annotation each, etc?

You can let Counter do the aggregation for you by using Counter(examples_by_page.values()) .

I think it would be more readable to tally this way:

Suggested change
# Get frequencies of page-level annotation counts
count_freqs = Counter()
total = 0
for count in examples_by_page.values():
count_freqs[count] += 1
total += count
# Get frequencies of page-level annotation counts
count_freqs = Counter(examples_by_page.values())
total = Sum(examples_by_page.values())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the count frequencies, the frequency at which each page image has been annotated so far (like you described)

Good point on the Counter / sum usage. Not sure why I missed that.


# Build overall table
header = ["# Annotations"]
row = ["# Pages"]
for key, val in sorted(count_freqs.items()):
header.append(f"{key}")
row.append(val)
header.append("Total")
row.append(total)
aligns = ["r", "r", "r", "r"]
msg.table(
[row],
title="Overall Annotation Progress",
header=header,
aligns=aligns,
divider=True,
)

# Build session table
data = []
total = 0
for session, pages in sorted(examples_by_session.items()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sorting so you display sessions in alpha order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so? I don't remember the reason for this; it might be residual code since the other commands sorted things.

count = len(pages)
unique = len(set(pages))
total += count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the total cumulative here? Maybe worth renaming the variable to clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, this is also residual code in terms of naming conventions for total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but yes, it's the total annotations collected as described here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename the variable to cumulative_total if it helps but that seems too long for the reporting output itself

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. 👍 to renaming as cumulative_total

row = [session, count, unique, total]
data.append(row)
header = [
"Session",
"Count",
"Unique",
"Total",
]
aligns = ["l", "r", "r", "r"]
# info = {
# "Session": "Session name",
# "Count": "Completed annotations",
# "Unique": "Unique annotations (distinct pages)",
# "Total": "Total annotations collected",
# }
# msg.table(info, title="Legend")
Comment on lines +74 to +80
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover comments to be cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a design choice. Uncommented, this prints out the legend for this table, but it's fairly verbose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. Maybe add a comment about the comment, then, so someone else doesn't clean it up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I'm happy to remove it, but this was more of "do we want a legend" question that I had forgotten about.

msg.table(
data,
title="Session Annotation Progress",
header=header,
aligns=aligns,
divider=True,
)
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
mock_prodigy_preprocess = MagicMock()
sys.modules["prodigy.components.preprocess"] = mock_prodigy_preprocess

from corppa.poetry_detection.annotation.recipe import (
from corppa.poetry_detection.annotation.annotation_recipes import (
ReviewStream,
add_image,
add_images,
Expand Down Expand Up @@ -57,7 +57,7 @@ def test_add_image():
assert example["image"] == f"prefix/{example['image_path']}"


@patch("corppa.poetry_detection.annotation.recipe.add_image")
@patch("corppa.poetry_detection.annotation.annotation_recipes.add_image")
def test_add_images(mock_add_image):
examples = [{"image_path": "a"}, {"image_path": "b"}]

Expand All @@ -74,7 +74,7 @@ def test_add_images(mock_add_image):
)


@patch("corppa.poetry_detection.annotation.recipe.add_image")
@patch("corppa.poetry_detection.annotation.annotation_recipes.add_image")
def test_remove_image_data(mock_add_image):
# Empty list (i.e. no examples)
assert remove_image_data([]) == []
Expand Down Expand Up @@ -113,7 +113,10 @@ def test_remove_image_data(mock_add_image):
assert mock_add_image.call_args == call(mixed_examples[-1], image_prefix="prefix")


@patch("corppa.poetry_detection.annotation.recipe.SESSION_ID_ATTR", "session_id")
@patch(
"corppa.poetry_detection.annotation.annotation_recipes.SESSION_ID_ATTR",
"session_id",
)
def test_get_session_name():
# Typical case: drop db prefix
example = {"session_id": "db-id-alice"}
Expand All @@ -136,8 +139,8 @@ def test_remove_label_prefix():
assert remove_label_prefix("no_prefix") == "no_prefix"


@patch("corppa.poetry_detection.annotation.recipe.get_session_name")
@patch("corppa.poetry_detection.annotation.recipe.add_label_prefix")
@patch("corppa.poetry_detection.annotation.annotation_recipes.get_session_name")
@patch("corppa.poetry_detection.annotation.annotation_recipes.add_label_prefix")
def test_add_session_prefix(mock_add_label_prefix, mock_get_session_name):
mock_get_session_name.return_value = "session"

Expand All @@ -157,7 +160,7 @@ def test_add_session_prefix(mock_add_label_prefix, mock_get_session_name):
mock_add_label_prefix.assert_has_calls([call("a", "session"), call("b", "session")])


@patch("corppa.poetry_detection.annotation.recipe.remove_label_prefix")
@patch("corppa.poetry_detection.annotation.annotation_recipes.remove_label_prefix")
def test_remove_session_prefix(mock_remove_label_prefix):
# example without spans
example = {"text": "some text..."}
Expand All @@ -172,7 +175,7 @@ def test_remove_session_prefix(mock_remove_label_prefix):
mock_remove_label_prefix.assert_has_calls([call("a"), call("b")])


@patch("corppa.poetry_detection.annotation.recipe.remove_label_prefix")
@patch("corppa.poetry_detection.annotation.annotation_recipes.remove_label_prefix")
def test_has_span_overlap(mock_remove_label_prefix):
# example without spans
ex_no_spans = {}
Expand Down Expand Up @@ -226,7 +229,7 @@ def test_has_span_overlap(mock_remove_label_prefix):
mock_remove_label_prefix.assert_not_called()


@patch("corppa.poetry_detection.annotation.recipe.has_span_overlap")
@patch("corppa.poetry_detection.annotation.annotation_recipes.has_span_overlap")
def test_validate_review_stream(mock_has_span_overlap):
mock_has_span_overlap.side_effect = [False, True, False]

Expand Down Expand Up @@ -263,8 +266,8 @@ def test_init(self, mock_get_data):
mock_get_data.assert_called_once()
assert mock_get_data.call_args == call(data, "pfx", "fetch_media")

@patch("corppa.poetry_detection.annotation.recipe.add_session_prefix")
@patch("corppa.poetry_detection.annotation.recipe.get_session_name")
@patch("corppa.poetry_detection.annotation.annotation_recipes.add_session_prefix")
@patch("corppa.poetry_detection.annotation.annotation_recipes.get_session_name")
def test_create_review_example(
self, mock_get_session_name, mock_add_session_prefix
):
Expand Down Expand Up @@ -329,7 +332,7 @@ def test_create_review_example(
}

@patch.object(ReviewStream, "create_review_example")
@patch("corppa.poetry_detection.annotation.recipe.add_image")
@patch("corppa.poetry_detection.annotation.annotation_recipes.add_image")
def test_get_data(self, mock_add_image, mock_create_review_example):
mock_prodigy_preprocess.fetch_media.reset_mock(
return_value=True, side_effect=True
Expand Down Expand Up @@ -371,8 +374,11 @@ def test_get_data(self, mock_add_image, mock_create_review_example):
mock_add_image.assert_has_calls([call("review", None) for _ in range(3)])


@patch("corppa.poetry_detection.annotation.recipe.INPUT_HASH_ATTR", "input_hash")
@patch("corppa.poetry_detection.annotation.recipe.ReviewStream")
@patch(
"corppa.poetry_detection.annotation.annotation_recipes.INPUT_HASH_ATTR",
"input_hash",
)
@patch("corppa.poetry_detection.annotation.annotation_recipes.ReviewStream")
def test_get_review_stream(mock_stream):
mock_prodigy.set_hashes.reset_mock(return_value=True, side_effect=True)
mock_prodigy.set_hashes.side_effect = lambda x, overwrite, input_keys, task_keys: x
Expand Down