-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23
Comments
@diego here some first quick notes:
|
Also, in my experience I found these types of book objects: |
@DiegoPino and @giancarlobi , how can I help move this along? I have no Dockerfile expertise myself, but could possibly get others at Born-Digital to help with getting the binaries into the docker images. |
@patdunlavey we are replacing djvu with pdftoalto. I have a testing docker container. Issue is that Alpine can not build the binary (damn alpine and its lack of proper glibc) so I need to so some moving around in multi stages. This will change also a bit SBR OCR processor (look at Giancarlos open pull) and we should deprecate djvu completely after this. |
Also: new container I build is PHP8 and let me tell you making everything PHP8 is going to be fun.... |
@patdunlavey I think this could be superseded switching to ALTO instead of hOCR, in that case no more needs of djvu instead pdfalto binary but @DiegoPino can be answer better on this |
Jinx! |
Ha! |
@patdunlavey my main issue right now with providing a new Docker Container is that I found a core Docker bug (yes.. I do find bugs too) that makes a multiple sources docker build (one where you have multiple FROM statements). So I got a bit derailed. In specific what is your need/Born Digital one so we can move this somehow faster along before the next release which is May? |
Thanks for asking! Searchable PDF is on the list of functional requirements for a client we are currently building for (you are working for them too @DiegoPino!). Having to rasterize and OCR PDFs is not really acceptable. Not sure if not having this until May would be a problem. |
Ok, so yeah, we can deal with this sooner. I'm just scared about the PHP8 upgrade. I'm running PHP8 right now but will have to do a full phpstan evaluation bc the number of requirements and deprecations is overwhelming. So maybe I should build in the meantime an PHP7.4 with pdftoalto binary while I make that whole fix a larger mega-issue. Let's work on this a bit longer, your pull request + Giancarlo's together is a great motivation to move faster |
Also: I have not much bandwidth today and tomorrow. I'm doing some core work on caching in SBF so might be distracted but will review your (both) code today somehow and see how we can advance faster by Friday |
@giancarlobi this is an issue to ask you for some help for:
1.- Deciding which packages of DJVU (and any python tool) we need to run PDF to DJVU to HOCR extraction/pipelines
2.- Some simple ideas/instructions on how to install from source? I guess Alpine linux has only the djvulibre libraries but not the extra tools
3.- Document your workflow.
The idea is to support this tools in our docker esmero-php container so others can be as cool as you are (and we can also extract layers of text for PDFs that are not image based)
Thanks!
The text was updated successfully, but these errors were encountered: