Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customized stats command #113

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

Customized stats command #113

wants to merge 3 commits into from

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Nov 1, 2024

ref: #112

Primary: Additional command recipe for displaying progress specialized to our annotation task

Secondary: Renaming recipe files to better describe the kinds of Prodigy recipes they will contain.

@laurejt laurejt self-assigned this Nov 1, 2024
Copy link

codecov bot commented Nov 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.34%. Comparing base (4f9d0ca) to head (e35a19b).
Report is 24 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #113      +/-   ##
===========================================
+ Coverage    58.62%   62.34%   +3.71%     
===========================================
  Files            7        8       +1     
  Lines          568      725     +157     
===========================================
+ Hits           333      452     +119     
- Misses         235      273      +38     

Copy link
Collaborator

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reporting recipe looks really great! Asked a few questions to clarify things and made a couple of small suggestions, but I think it's fine to merge whenever you're happy with it.

Question about structure, since you've moved things around a bit: what do you think about dropping the poetry_detection directory and just putting all of this in corppa.annotation ? There's nothing specific to poetry detection in the annotation recipes that I can think of.

src/corppa/poetry_detection/annotation/command_recipes.py Outdated Show resolved Hide resolved
Comment on lines +34 to +39
# Get frequencies of page-level annotation counts
count_freqs = Counter()
total = 0
for count in examples_by_page.values():
count_freqs[count] += 1
total += count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're doing here wasn't immediately obvious based on the variable names - am I understanding correctly that you're counting the number of examples/pages that have the same number of annotations, so you can report something like 100 examples have 1 annotation each, 50 have 2 annotation each, etc?

You can let Counter do the aggregation for you by using Counter(examples_by_page.values()) .

I think it would be more readable to tally this way:

Suggested change
# Get frequencies of page-level annotation counts
count_freqs = Counter()
total = 0
for count in examples_by_page.values():
count_freqs[count] += 1
total += count
# Get frequencies of page-level annotation counts
count_freqs = Counter(examples_by_page.values())
total = Sum(examples_by_page.values())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the count frequencies, the frequency at which each page image has been annotated so far (like you described)

Good point on the Counter / sum usage. Not sure why I missed that.

# Build session table
data = []
total = 0
for session, pages in sorted(examples_by_session.items()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sorting so you display sessions in alpha order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so? I don't remember the reason for this; it might be residual code since the other commands sorted things.

Comment on lines +74 to +80
# info = {
# "Session": "Session name",
# "Count": "Completed annotations",
# "Unique": "Unique annotations (distinct pages)",
# "Total": "Total annotations collected",
# }
# msg.table(info, title="Legend")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover comments to be cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a design choice. Uncommented, this prints out the legend for this table, but it's fairly verbose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. Maybe add a comment about the comment, then, so someone else doesn't clean it up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I'm happy to remove it, but this was more of "do we want a legend" question that I had forgotten about.

for session, pages in sorted(examples_by_session.items()):
count = len(pages)
unique = len(set(pages))
total += count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the total cumulative here? Maybe worth renaming the variable to clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, this is also residual code in terms of naming conventions for total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but yes, it's the total annotations collected as described here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename the variable to cumulative_total if it helps but that seems too long for the reporting output itself

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. 👍 to renaming as cumulative_total

@@ -0,0 +1,87 @@
from collections import Counter, defaultdict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to add a docstring either at the top or with the recipe explaining how you run this and showing some sample output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good idea. Although the styling itself is a bit outside of my understanding (I believe it's rendered by an external library)

@rlskoeser
Copy link
Collaborator

As part of this refactor, I suggest moving out the tested part of the prodigy/annotation code into a some kind of shared utility methods file so that we can exclude the recipe code from code coverage.

Here's the syntax for excluding files from codecov reporting: https://docs.codecov.com/docs/ignoring-paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants