Add in .csv version of some of the largest pdfs. #12

prisonpolicy · 2013-06-29T15:29:11Z

The SumOfUs and ColorofChange filings were 1000+ pages with ~50,000 comments. Those may have been too large to try to OCR as they aren't showing up in the search.

However, I have these in .csv format, and the ability to filter out the redundant comments from the ~5,000 original ones. That said, I'm not sure how to link that back to the individual pages of our system...

prisonpolicy · 2013-06-29T15:44:59Z

Hmmm. On reflection, one easier solution might be to just take the entries that we rejected from the OCR because they were too large and give them a manual check as to whether they would in fact OCR. I'd guess that most of the really large files are large because they are images, dirty faxes, etc. But a small minority would be the SumofUs, ColorofChange and CREDO petitions. (I don't have CREDO's .csv.) And then we could just let the OCR run on those huge files....

gyepisam · 2013-07-03T02:58:47Z

On Sat, Jun 29, 2013 at 08:29:12AM -0700, Peter Wagner wrote:

The SumOfUs and ColorofChange filings were 1000+ pages with ~50,000 comments. Those may have been too large to try to OCR as they aren't showing up in the search.

There are about 80 documents that failed to OCR for various reasons. I'll need
to review them manually...

However, I have these in .csv format, and the ability to filter out the redundant comments from the ~5,000 original ones. That said, I'm not sure how to link that back to the individual pages of our system...

Depending on it's structure, we may be able to do it. Could you send me a copy
(or make it available somewhere for download)?

-Gyepi

gyepisam · 2013-07-03T03:04:24Z

On Sat, Jun 29, 2013 at 08:45:00AM -0700, Peter Wagner wrote:

Hmmm. On reflection, one easier solution might be to just take the entries that we rejected from the OCR because they were too large and give them a manual check as to whether they would in fact OCR. I'd guess that most of the really large files are large because they are images, dirty faxes, etc. But a small minority would be the SumofUs, ColorofChange and CREDO petitions. (I don't have CREDO's .csv.) And then we could just let the OCR run on those huge files....

Yes that would work. Unfortunately, we're not rejecting any documents; they
just fail to process. So I need to determine why they failed and create new
circumstances that will help them succeed. According to the error logs some
of the failures are internal to the PDF files, but I'm sure there's still
room for improvement.

-Gyepi

prisonpolicy · 2013-07-03T17:19:43Z

I'll send these files to Gyepi via email.
-p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in .csv version of some of the largest pdfs. #12

Add in .csv version of some of the largest pdfs. #12

prisonpolicy commented Jun 29, 2013

prisonpolicy commented Jun 29, 2013

gyepisam commented Jul 3, 2013

gyepisam commented Jul 3, 2013

prisonpolicy commented Jul 3, 2013

Add in .csv version of some of the largest pdfs. #12

Add in .csv version of some of the largest pdfs. #12

Comments

prisonpolicy commented Jun 29, 2013

prisonpolicy commented Jun 29, 2013

gyepisam commented Jul 3, 2013

gyepisam commented Jul 3, 2013

prisonpolicy commented Jul 3, 2013