-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customized stats command #113
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,87 @@ | ||||||||||||||||||||
from collections import Counter, defaultdict | ||||||||||||||||||||
|
||||||||||||||||||||
from prodigy.components.db import connect | ||||||||||||||||||||
from prodigy.core import Arg, recipe | ||||||||||||||||||||
from prodigy.errors import RecipeError | ||||||||||||||||||||
from prodigy.util import SESSION_ID_ATTR, msg | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
@recipe( | ||||||||||||||||||||
"page-stats", | ||||||||||||||||||||
dataset=Arg(help="Prodigy dataset ID"), | ||||||||||||||||||||
) | ||||||||||||||||||||
def ppa_stats(dataset: str) -> None: | ||||||||||||||||||||
# Load examples | ||||||||||||||||||||
DB = connect() | ||||||||||||||||||||
if dataset not in DB: | ||||||||||||||||||||
raise RecipeError(f"Can't find dataset '{dataset}' in database") | ||||||||||||||||||||
examples = DB.get_dataset_examples(dataset) | ||||||||||||||||||||
n_examples = len(examples) | ||||||||||||||||||||
msg.good(f"Loaded {n_examples} annotations from {dataset} dataset") | ||||||||||||||||||||
|
||||||||||||||||||||
# Get stats | ||||||||||||||||||||
examples_by_page = Counter() | ||||||||||||||||||||
examples_by_session = defaultdict(list) | ||||||||||||||||||||
for ex in examples: | ||||||||||||||||||||
# Skip examples without answer or (page) id | ||||||||||||||||||||
if "answer" not in ex and "id" not in ex: | ||||||||||||||||||||
# Ignore "unanswered" examples | ||||||||||||||||||||
continue | ||||||||||||||||||||
page_id = ex["id"] | ||||||||||||||||||||
examples_by_page[page_id] += 1 | ||||||||||||||||||||
session_id = ex[SESSION_ID_ATTR] | ||||||||||||||||||||
examples_by_session[session_id].append(page_id) | ||||||||||||||||||||
# Get frequencies of page-level annotation counts | ||||||||||||||||||||
count_freqs = Counter() | ||||||||||||||||||||
total = 0 | ||||||||||||||||||||
for count in examples_by_page.values(): | ||||||||||||||||||||
count_freqs[count] += 1 | ||||||||||||||||||||
total += count | ||||||||||||||||||||
Comment on lines
+34
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What you're doing here wasn't immediately obvious based on the variable names - am I understanding correctly that you're counting the number of examples/pages that have the same number of annotations, so you can report something like 100 examples have 1 annotation each, 50 have 2 annotation each, etc? You can let I think it would be more readable to tally this way:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are the count frequencies, the frequency at which each page image has been annotated so far (like you described) Good point on the |
||||||||||||||||||||
|
||||||||||||||||||||
# Build overall table | ||||||||||||||||||||
header = ["# Annotations"] | ||||||||||||||||||||
row = ["# Pages"] | ||||||||||||||||||||
for key, val in sorted(count_freqs.items()): | ||||||||||||||||||||
header.append(f"{key}") | ||||||||||||||||||||
row.append(val) | ||||||||||||||||||||
header.append("Total") | ||||||||||||||||||||
row.append(total) | ||||||||||||||||||||
aligns = ["r", "r", "r", "r"] | ||||||||||||||||||||
msg.table( | ||||||||||||||||||||
[row], | ||||||||||||||||||||
title="Overall Annotation Progress", | ||||||||||||||||||||
header=header, | ||||||||||||||||||||
aligns=aligns, | ||||||||||||||||||||
divider=True, | ||||||||||||||||||||
) | ||||||||||||||||||||
|
||||||||||||||||||||
# Build session table | ||||||||||||||||||||
data = [] | ||||||||||||||||||||
total = 0 | ||||||||||||||||||||
for session, pages in sorted(examples_by_session.items()): | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this sorting so you display sessions in alpha order? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess so? I don't remember the reason for this; it might be residual code since the other commands sorted things. |
||||||||||||||||||||
count = len(pages) | ||||||||||||||||||||
unique = len(set(pages)) | ||||||||||||||||||||
total += count | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the total cumulative here? Maybe worth renaming the variable to clarify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough, this is also residual code in terms of naming conventions for total. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...but yes, it's the total annotations collected as described here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can rename the variable to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense. 👍 to renaming as |
||||||||||||||||||||
row = [session, count, unique, total] | ||||||||||||||||||||
data.append(row) | ||||||||||||||||||||
header = [ | ||||||||||||||||||||
"Session", | ||||||||||||||||||||
"Count", | ||||||||||||||||||||
"Unique", | ||||||||||||||||||||
"Total", | ||||||||||||||||||||
] | ||||||||||||||||||||
aligns = ["l", "r", "r", "r"] | ||||||||||||||||||||
# info = { | ||||||||||||||||||||
# "Session": "Session name", | ||||||||||||||||||||
# "Count": "Completed annotations", | ||||||||||||||||||||
# "Unique": "Unique annotations (distinct pages)", | ||||||||||||||||||||
# "Total": "Total annotations collected", | ||||||||||||||||||||
# } | ||||||||||||||||||||
# msg.table(info, title="Legend") | ||||||||||||||||||||
Comment on lines
+74
to
+80
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. leftover comments to be cleaned up? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it's a design choice. Uncommented, this prints out the legend for this table, but it's fairly verbose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, I see. Maybe add a comment about the comment, then, so someone else doesn't clean it up? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean, I'm happy to remove it, but this was more of "do we want a legend" question that I had forgotten about. |
||||||||||||||||||||
msg.table( | ||||||||||||||||||||
data, | ||||||||||||||||||||
title="Session Annotation Progress", | ||||||||||||||||||||
header=header, | ||||||||||||||||||||
aligns=aligns, | ||||||||||||||||||||
divider=True, | ||||||||||||||||||||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to add a docstring either at the top or with the recipe explaining how you run this and showing some sample output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a good idea. Although the styling itself is a bit outside of my understanding (I believe it's rendered by an external library)