djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23

DiegoPino · 2020-11-24T02:34:53Z

@giancarlobi this is an issue to ask you for some help for:
1.- Deciding which packages of DJVU (and any python tool) we need to run PDF to DJVU to HOCR extraction/pipelines
2.- Some simple ideas/instructions on how to install from source? I guess Alpine linux has only the djvulibre libraries but not the extra tools
3.- Document your workflow.

The idea is to support this tools in our docker esmero-php container so others can be as cool as you are (and we can also extract layers of text for PDFs that are not image based)

Thanks!

giancarlobi · 2020-11-24T09:13:18Z

@diego here some first quick notes:

On Ubuntu there are packages for bin pdf2djvu and djvu2hocr I use. While source repo is here https://github.com/jwilk/pdf2djvu and here https://github.com/jwilk/ocrodjvu. djvu2hocr is included into package/report ocrodjvu. Never installed by source, I have to check/find how make that.
a simple pipeline I used also on other report when I have searchable PDF is:
A) convert the whole PDF to DJVU with pdf2djvu --no-metadata -j0 -o full.djv full.pdf I used also option --guess-dpi when starting from PDF made from tiff with abbeyy, I think this is not the case.
B) convert single page to hOCR with: djvu2hocr -p 1 full.djv > page_1.html where -p option is page number (base 1).
C) page_1.html contains the hOCR of page number 1, ready to be managed by script to convert to miniOCR.

giancarlobi · 2020-11-24T09:32:12Z

Also, in my experience I found these types of book objects:
A) only TIFF, without any OCR: managed by gs + tesseract
B) TIFF and PDF with searchable text made by Abbeyy finereader: manage with pdf2djvu + djvu2hocr instead of resource eater tesseract
C) only PDF without searchable text: manage as TIFF with gs + tesseract
D) only PDF with searchable text: you can use pdf2djvu + djvu2hocr

patdunlavey · 2022-03-09T14:37:47Z

@DiegoPino and @giancarlobi , how can I help move this along? I have no Dockerfile expertise myself, but could possibly get others at Born-Digital to help with getting the binaries into the docker images.

DiegoPino · 2022-03-09T14:41:48Z

@patdunlavey we are replacing djvu with pdftoalto. I have a testing docker container. Issue is that Alpine can not build the binary (damn alpine and its lack of proper glibc) so I need to so some moving around in multi stages. This will change also a bit SBR OCR processor (look at Giancarlos open pull) and we should deprecate djvu completely after this.

DiegoPino · 2022-03-09T14:42:21Z

Also: new container I build is PHP8 and let me tell you making everything PHP8 is going to be fun....

giancarlobi · 2022-03-09T14:42:27Z

@patdunlavey I think this could be superseded switching to ALTO instead of hOCR, in that case no more needs of djvu instead pdfalto binary but @DiegoPino can be answer better on this

DiegoPino · 2022-03-09T14:42:52Z

Jinx!

patdunlavey · 2022-03-09T14:43:04Z

Ha!

DiegoPino · 2022-03-09T14:45:03Z

@patdunlavey my main issue right now with providing a new Docker Container is that I found a core Docker bug (yes.. I do find bugs too) that makes a multiple sources docker build (one where you have multiple FROM statements). So I got a bit derailed. In specific what is your need/Born Digital one so we can move this somehow faster along before the next release which is May?

patdunlavey · 2022-03-09T14:55:24Z

Thanks for asking! Searchable PDF is on the list of functional requirements for a client we are currently building for (you are working for them too @DiegoPino!). Having to rasterize and OCR PDFs is not really acceptable. Not sure if not having this until May would be a problem.

DiegoPino · 2022-03-09T15:00:28Z

Ok, so yeah, we can deal with this sooner. I'm just scared about the PHP8 upgrade. I'm running PHP8 right now but will have to do a full phpstan evaluation bc the number of requirements and deprecations is overwhelming. So maybe I should build in the meantime an PHP7.4 with pdftoalto binary while I make that whole fix a larger mega-issue. Let's work on this a bit longer, your pull request + Giancarlo's together is a great motivation to move faster

DiegoPino · 2022-03-09T15:01:26Z

Also: I have not much bandwidth today and tomorrow. I'm doing some core work on caching in SBF so might be distracted but will review your (both) code today somehow and see how we can advance faster by Friday

DiegoPino assigned giancarlobi Nov 24, 2020

DiegoPino added help wanted Extra attention is needed esmero-php All that goes into the PHP container labels Nov 24, 2020

DiegoPino mentioned this issue Nov 24, 2020

ISSUE-21: esmero-php gets a proper Patch binary and WACZ #22

Merged

DiegoPino changed the title ~~djvu tools: help~~ djvu tools: help + graphics magic instead of Imagemagic for JP2 support Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23

djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23

DiegoPino commented Nov 24, 2020

giancarlobi commented Nov 24, 2020

giancarlobi commented Nov 24, 2020

patdunlavey commented Mar 9, 2022

DiegoPino commented Mar 9, 2022 •

edited

Loading

DiegoPino commented Mar 9, 2022

giancarlobi commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

patdunlavey commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

patdunlavey commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23

djvu tools: help + graphics magic instead of Imagemagic for JP2 support #23

Comments

DiegoPino commented Nov 24, 2020

giancarlobi commented Nov 24, 2020

giancarlobi commented Nov 24, 2020

patdunlavey commented Mar 9, 2022

DiegoPino commented Mar 9, 2022 • edited Loading

DiegoPino commented Mar 9, 2022

giancarlobi commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

patdunlavey commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

patdunlavey commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

DiegoPino commented Mar 9, 2022

DiegoPino commented Mar 9, 2022 •

edited

Loading