Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(scrapers.tasks.update_from_text): reuse make_citation in update_from_text #4913

Merged
merged 12 commits into from
Jan 16, 2025

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Jan 13, 2025

Solves #4903

  • Move make_citation from cl_scrape_opinions into scrapers.utils
  • Move citation_is_duplicated from cl_back_scrape_citations into scrapers.utils
  • Delete scraped_citation_object_is_valid, now we rely on eyecite, used by make_citation
  • Refactor test site to account for changes

…ate_from_text

Solves #4903

- Move make_citation from cl_scrape_opinions into scrapers.utils
- Move citation_is_duplicated from cl_back_scrape_citations into scrapers.utils
- Delete scraped_citation_object_is_valid, now we rely on eyecite, used by make_citation
- Refactor test site to account for changes
Copy link

sentry-io bot commented Jan 13, 2025

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/scrapers/tasks.py

Function Unhandled Issue
update_document_from_text AttributeError: 'NoneType' object has no attribute 'items' cl.scrapers.tasks.extract_d...
Event Count: 122

Did you find this useful? React with a 👍 or 👎

Comment on lines +74 to +75
if not citation or citation_is_duplicated(citation, data):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to throw the error logging here - and not pass court id into make citation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awkward to pass the court_id only for logging, but if we didn't do it inside this function we would have to make an outside logger.error call everytime make_citation returns None

@grossir grossir self-assigned this Jan 13, 2025
…as not present

Also, fix dictionary modification while iterating on update_document_from_text
@grossir
Copy link
Contributor Author

grossir commented Jan 14, 2025

I fixed the tests; and added a call to extract_doc_content to update_from_text, for when the text extraction failed. I noticed we weren't covering this use case on the command, and it's actually a very important one.

Please give it another check @flooie

UPDATE: we have already done all the steps needed before merging it:

  1. merge the fix(extract_from_text): now returns a plain citation string juriscraper#1298 in Juriscraper
  2. publish a new juriscraper
  3. update the poetry requirement in this PR to that new Juriscraper version

So that it doesn't break anything
I tested it in my local env with the juriscraper branch, and it's working!

# copy some PA Super cluster where citation extraction had failed; then

docker exec -it cl-django pip uninstall juriscraper
docker exec -it cl-django pip install https://github.com/freelawproject/juriscraper/archive/extract_from_text_citations.zip

docker exec -it cl-django python /opt/courtlistener/manage.py update_from_text --courts juriscraper.opinions.united_states.state.pasuperct --verbosity 3 --date-filed-gte=2024-12-01 --date-filed-lte=2026-01-01 --cluster-status=Published

@grossir grossir requested a review from flooie January 14, 2025 00:57
@grossir grossir assigned flooie and unassigned flooie and grossir Jan 14, 2025
poetry.lock Show resolved Hide resolved
@flooie
Copy link
Contributor

flooie commented Jan 16, 2025

@grossir this looks great - and will be great to get these errors gone.

@flooie flooie enabled auto-merge January 16, 2025 20:48
@flooie flooie merged commit 3dd887a into main Jan 16, 2025
15 checks passed
@flooie flooie deleted the extract_from_text_uses_eyecite branch January 16, 2025 20:59
Copy link

sentry-io bot commented Jan 16, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ ValueError: The 'local_path' attribute has no file associated with it. cl.lib.microservice_utils in microservice View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants