Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23

Open
DiegoPino opened this issue Nov 24, 2020 · 12 comments
Open
Assignees
Labels
esmero-php All that goes into the PHP container help wanted Extra attention is needed

Comments

@DiegoPino
Copy link
Member

@giancarlobi this is an issue to ask you for some help for:
1.- Deciding which packages of DJVU (and any python tool) we need to run PDF to DJVU to HOCR extraction/pipelines
2.- Some simple ideas/instructions on how to install from source? I guess Alpine linux has only the djvulibre libraries but not the extra tools
3.- Document your workflow.

The idea is to support this tools in our docker esmero-php container so others can be as cool as you are (and we can also extract layers of text for PDFs that are not image based)

Thanks!

@DiegoPino DiegoPino added help wanted Extra attention is needed esmero-php All that goes into the PHP container labels Nov 24, 2020
@giancarlobi
Copy link

@diego here some first quick notes:

  • On Ubuntu there are packages for bin pdf2djvu and djvu2hocr I use. While source repo is here https://github.com/jwilk/pdf2djvu and here https://github.com/jwilk/ocrodjvu. djvu2hocr is included into package/report ocrodjvu. Never installed by source, I have to check/find how make that.
  • a simple pipeline I used also on other report when I have searchable PDF is:
    A) convert the whole PDF to DJVU with pdf2djvu --no-metadata -j0 -o full.djv full.pdf I used also option --guess-dpi when starting from PDF made from tiff with abbeyy, I think this is not the case.
    B) convert single page to hOCR with: djvu2hocr -p 1 full.djv > page_1.html where -p option is page number (base 1).
    C) page_1.html contains the hOCR of page number 1, ready to be managed by script to convert to miniOCR.

@giancarlobi
Copy link

Also, in my experience I found these types of book objects:
A) only TIFF, without any OCR: managed by gs + tesseract
B) TIFF and PDF with searchable text made by Abbeyy finereader: manage with pdf2djvu + djvu2hocr instead of resource eater tesseract
C) only PDF without searchable text: manage as TIFF with gs + tesseract
D) only PDF with searchable text: you can use pdf2djvu + djvu2hocr

@DiegoPino DiegoPino changed the title djvu tools: help djvu tools: help + graphics magic instead of Imagemagic for JP2 support Dec 21, 2020
@patdunlavey
Copy link

@DiegoPino and @giancarlobi , how can I help move this along? I have no Dockerfile expertise myself, but could possibly get others at Born-Digital to help with getting the binaries into the docker images.

@DiegoPino
Copy link
Member Author

DiegoPino commented Mar 9, 2022

@patdunlavey we are replacing djvu with pdftoalto. I have a testing docker container. Issue is that Alpine can not build the binary (damn alpine and its lack of proper glibc) so I need to so some moving around in multi stages. This will change also a bit SBR OCR processor (look at Giancarlos open pull) and we should deprecate djvu completely after this.

@DiegoPino
Copy link
Member Author

Also: new container I build is PHP8 and let me tell you making everything PHP8 is going to be fun....

@giancarlobi
Copy link

@patdunlavey I think this could be superseded switching to ALTO instead of hOCR, in that case no more needs of djvu instead pdfalto binary but @DiegoPino can be answer better on this

@DiegoPino
Copy link
Member Author

Jinx!

@patdunlavey
Copy link

Ha!

@DiegoPino
Copy link
Member Author

@patdunlavey my main issue right now with providing a new Docker Container is that I found a core Docker bug (yes.. I do find bugs too) that makes a multiple sources docker build (one where you have multiple FROM statements). So I got a bit derailed. In specific what is your need/Born Digital one so we can move this somehow faster along before the next release which is May?

@patdunlavey
Copy link

Thanks for asking! Searchable PDF is on the list of functional requirements for a client we are currently building for (you are working for them too @DiegoPino!). Having to rasterize and OCR PDFs is not really acceptable. Not sure if not having this until May would be a problem.

@DiegoPino
Copy link
Member Author

Ok, so yeah, we can deal with this sooner. I'm just scared about the PHP8 upgrade. I'm running PHP8 right now but will have to do a full phpstan evaluation bc the number of requirements and deprecations is overwhelming. So maybe I should build in the meantime an PHP7.4 with pdftoalto binary while I make that whole fix a larger mega-issue. Let's work on this a bit longer, your pull request + Giancarlo's together is a great motivation to move faster

@DiegoPino
Copy link
Member Author

Also: I have not much bandwidth today and tomorrow. I'm doing some core work on caching in SBF so might be distracted but will review your (both) code today somehow and see how we can advance faster by Friday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
esmero-php All that goes into the PHP container help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants