Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Feature Request) PyMuPDF for pdf parsing #262

Closed
aditirao7 opened this issue Nov 11, 2023 · 4 comments · Fixed by #282
Closed

(Feature Request) PyMuPDF for pdf parsing #262

aditirao7 opened this issue Nov 11, 2023 · 4 comments · Fixed by #282
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request help wanted Extra attention is needed python

Comments

@aditirao7
Copy link
Contributor

Currently we are using PDFMiner for text extraction from the pdf. There are other libraries that are faster like PyMuPDF that we could consider. It'll also help generate the report faster for the web app that way.

Reference:
https://github.com/py-pdf/benchmarks#pdf-library-benchmarks

@aditirao7 aditirao7 added the enhancement New feature or request label Nov 11, 2023
@marshalmiller marshalmiller added help wanted Extra attention is needed dependencies Pull requests that update a dependency file python labels Dec 30, 2023
@marshalmiller
Copy link
Collaborator

@aditirao7 I'm sorry it took so long for me to address this. I think this is a great idea. I appreciate the benchmarks and was hoping to find a way to speed up the web app too. My delay was that I wanted to confirm that their license was compatible with ours. I'm confident that it is now. I think we should move forward with this.

@aditirao7
Copy link
Contributor Author

No worries, I've managed to migrate to PyMuPDF and it works for the test pdfs in the repo. Do you have any other pdfs I can test on that tend to cause errors? Also I am not sure how to make changes to the unit tests but will try anyway.

@marshalmiller
Copy link
Collaborator

@aditirao7 There are some random PDFs in this folder. I don't think that any cause more trouble but they are a better representation of what we would get. https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig

Thank you so much for this work.

@aditirao7
Copy link
Contributor Author

@marshalmiller I was looking into the unit tests and it seems like many of them are not asserting correctly, did these tests ever successfully run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request help wanted Extra attention is needed python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants