Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in .csv version of some of the largest pdfs. #12

Open
prisonpolicy opened this issue Jun 29, 2013 · 4 comments
Open

Add in .csv version of some of the largest pdfs. #12

prisonpolicy opened this issue Jun 29, 2013 · 4 comments

Comments

@prisonpolicy
Copy link
Member

The SumOfUs and ColorofChange filings were 1000+ pages with ~50,000 comments. Those may have been too large to try to OCR as they aren't showing up in the search.

However, I have these in .csv format, and the ability to filter out the redundant comments from the ~5,000 original ones. That said, I'm not sure how to link that back to the individual pages of our system...

@prisonpolicy
Copy link
Member Author

Hmmm. On reflection, one easier solution might be to just take the entries that we rejected from the OCR because they were too large and give them a manual check as to whether they would in fact OCR. I'd guess that most of the really large files are large because they are images, dirty faxes, etc. But a small minority would be the SumofUs, ColorofChange and CREDO petitions. (I don't have CREDO's .csv.) And then we could just let the OCR run on those huge files....

@gyepisam
Copy link
Member

gyepisam commented Jul 3, 2013

On Sat, Jun 29, 2013 at 08:29:12AM -0700, Peter Wagner wrote:

The SumOfUs and ColorofChange filings were 1000+ pages with ~50,000 comments. Those may have been too large to try to OCR as they aren't showing up in the search.

There are about 80 documents that failed to OCR for various reasons. I'll need
to review them manually...

However, I have these in .csv format, and the ability to filter out the redundant comments from the ~5,000 original ones. That said, I'm not sure how to link that back to the individual pages of our system...

Depending on it's structure, we may be able to do it. Could you send me a copy
(or make it available somewhere for download)?

-Gyepi

@gyepisam
Copy link
Member

gyepisam commented Jul 3, 2013

On Sat, Jun 29, 2013 at 08:45:00AM -0700, Peter Wagner wrote:

Hmmm. On reflection, one easier solution might be to just take the entries that we rejected from the OCR because they were too large and give them a manual check as to whether they would in fact OCR. I'd guess that most of the really large files are large because they are images, dirty faxes, etc. But a small minority would be the SumofUs, ColorofChange and CREDO petitions. (I don't have CREDO's .csv.) And then we could just let the OCR run on those huge files....

Yes that would work. Unfortunately, we're not rejecting any documents; they
just fail to process. So I need to determine why they failed and create new
circumstances that will help them succeed. According to the error logs some
of the failures are internal to the PDF files, but I'm sure there's still
room for improvement.

-Gyepi

@prisonpolicy
Copy link
Member Author

I'll send these files to Gyepi via email.
-p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants