(Feature Request) PyMuPDF for pdf parsing #262

aditirao7 · 2023-11-11T02:15:36Z

Currently we are using PDFMiner for text extraction from the pdf. There are other libraries that are faster like PyMuPDF that we could consider. It'll also help generate the report faster for the web app that way.

Reference:
https://github.com/py-pdf/benchmarks#pdf-library-benchmarks

marshalmiller · 2023-12-30T05:05:51Z

@aditirao7 I'm sorry it took so long for me to address this. I think this is a great idea. I appreciate the benchmarks and was hoping to find a way to speed up the web app too. My delay was that I wanted to confirm that their license was compatible with ours. I'm confident that it is now. I think we should move forward with this.

aditirao7 · 2024-01-02T12:17:33Z

No worries, I've managed to migrate to PyMuPDF and it works for the test pdfs in the repo. Do you have any other pdfs I can test on that tend to cause errors? Also I am not sure how to make changes to the unit tests but will try anyway.

marshalmiller · 2024-01-02T14:38:58Z

@aditirao7 There are some random PDFs in this folder. I don't think that any cause more trouble but they are a better representation of what we would get. https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig

Thank you so much for this work.

aditirao7 · 2024-01-03T06:16:07Z

@marshalmiller I was looking into the unit tests and it seems like many of them are not asserting correctly, did these tests ever successfully run?

aditirao7 added the enhancement New feature or request label Nov 11, 2023

marshalmiller added help wanted Extra attention is needed dependencies Pull requests that update a dependency file python labels Dec 30, 2023

aditirao7 mentioned this issue Jan 2, 2024

Move to PyMuPDF from PDFMiner #282

Merged

marshalmiller closed this as completed in #282 Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Feature Request) PyMuPDF for pdf parsing #262

(Feature Request) PyMuPDF for pdf parsing #262

aditirao7 commented Nov 11, 2023

marshalmiller commented Dec 30, 2023

aditirao7 commented Jan 2, 2024

marshalmiller commented Jan 2, 2024

aditirao7 commented Jan 3, 2024

(Feature Request) PyMuPDF for pdf parsing #262

(Feature Request) PyMuPDF for pdf parsing #262

Comments

aditirao7 commented Nov 11, 2023

marshalmiller commented Dec 30, 2023

aditirao7 commented Jan 2, 2024

marshalmiller commented Jan 2, 2024

aditirao7 commented Jan 3, 2024