-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disabling page auto-rotations #57
Comments
Adding |
@angelborroy-ks thanks for your quick reply back. Unfortunately, -nopreproc is not disabling auto-rotation as expected. We've reached out to Tobias for more insight about pdfsandwich options, so hope we can resolve with his assistance. If it turns out we need to explore other options, do you have any recommendations on other tools to use for searchable image layer? As background, we're using tesseract for base OCR at metadata document level, so the gap we're needing to fill is allowing users to search for words and jump to pages within a document based on matches on top of page image layer. Many thanks in advance for your continued guidance!! We're about 1-2 weeks from first production launch of Alfresco, and this issue (plus another issue related to serious pixel loss after pdfsandwich runs) is causing showstopper concern for the project. So, we're scrambling for ideas on how to resolve. |
Did you tried ocrmypdf? This software includes many different options to deal with Tesseract parameters. |
No, I haven't, but I was just looking into it in fact. Based on your experience, would you say OCRmyPDF is a more sophisticated tool and may be better suited for our needs given the problem at hand? We saw you mentioned pdfsandwich first in the list in your write-up, so maybe we assumed incorrectly that's the one you had more preference for. |
Yes, I suggested OCRmyPDF for your use case because it's more customisable than pdfsandwich. We could say that pdfsandwich is good basic tool (enough for many users) but OCRmyPDF is an expert tool (what requires more tuning and expertise). Let me know how it goes. |
Thanks! Will do. |
FYI, just an update that OCRmyPDF is working out much better with the addt'l options. Thanks again! |
Hi, Please see the attached image where it shows the output PDF is getting distorted on each ocrmypdf command. FYI, we are using auto-rotate options (--rotate-pages --rotate-pages-threshold 1) only for 1st version and for the rest versions PDF, we are not using the auto-rotate option.
NOTE: Could you please help me on this? Also, if I add --oversample 600 option to command in each version, it works fine but output pdf size has increased.
Thanks. |
I'm not OCR expert. Probably you'll get better answers at OCRmyPDF project. |
Is there a way to still run Alfresco Simple OCR (w/ pdfsandwich) on each new document version (so text can continue to be found on the pages) yet kill the auto-rotation portion of the process for subsequent versions of the document after 1.0? The business scenarios here is to avoid manual page rotations (i.e. corrections to improper automatic orientation) from being recursively overridden by the automatic processing. Our thought process to resolve this issue is to consider writing programming logic to consider what the version of the document is in order to apply auto-rotations or not. In other words, apply automatic page rotations to the very first version 1.0, but don't so on any subsequent version edits when manually changes/corrections could have been made. Of course, this is dependent on whether we’re able to pass a command to Simple OCR and/or pdfsandwich to conditionally disable the auto-rotation portion of the process. Is this possible to do? If so, do you know the code or command we need to employ in order to achieve this?
Stepping back, just wondering if you’re heard of this problem before and any other approaches you know of that we may want to consider (instead of the idea described above) to overcome it.
Thank you!
Here's more background:
There are anomalies with some kinds of scanned documents being uploaded where automation logic is not able to determine the page rotation correctly. Auto-rotations is based on what the process finds on the page and how it believes text direction should flow. But, there are times when pages have text flowing in conflicting directions (i.e. some block of text goes one way, and other block of text goes a different way – not to mention times when text is handwriting and not computer-generated). So, when the auto-rotation ends up being incorrect for understandable reasons, the user will proceed by manually rotating the page and then saving changes before adding annotations (via another third-party tool). This results in a new document version in Alfresco, which next triggers Simple OCR / pdfsandwich to run once again against the new version. What happens next is that automatic process reverses the user’s manual correction and ends up auto-rotating the page back to the incorrect orientation. The next time a user views the document, they see the rotation incorrect again plus annotation layer that is no longer corresponding to the proper coordinates of the page. At this point, manually rotating the page in the UI document viewer results in the annotation being rotated incorrectly and often in an illegible manner. The problem is recursive in nature and any annotations added (as they often will be) end up making the problem that much worse.
The text was updated successfully, but these errors were encountered: