-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parallel analysis of PDF documents #32
Comments
Amazingly enough, while on top of a mountain last week I had exactly the same thought, perhaps because PLAYA-PDF could also stand for "Parallel and LAzY Analyzer for PDF". In theory extraction is very scalable because each page has a distinct set of content streams, and (despite the presence of a "resource manager" in pdfminer.six) the bulk of the time is spent parsing and interpreting these streams. Because in PLAYA the entire PDF is memory-mapped read-only, it can obviously be shared between processes. Sharing other resources between processes efficiently is more complicated and probably not worth the trouble. I will do a bit of experimentation with |
I think the general principle can also be applied to (note that because all of this is pure-Python and doesn't really involve I/O, using threads won't improve anything) |
This is actually relatively easy with https://github.com/dhdaines/playa/blob/parallel/benchmarks/parallel_miner.py I get around a 2x speedup for 4 CPUs on a 486-page document: $ python benchmarks/parallel_miner.py -n 4 samples/contrib/Rgl-1314-2021-Z-en-vigueur-20240823.pdf
Submitting pages 0 to 121
Submitting pages 122 to 243
Submitting pages 244 to 365
Submitting pages 366 to 485
pdfminer.six (4 CPUs) took 36.42s
pdfminer.six (single) took 73.97s Without actually profiling this it seems that a lot of the overhead is in the serialization and deserialization of the layout analysis objects as they are deeply nested. Also there is the complication (as shown in the code above) that (note that this backreferenceing is actually also a memory leak in |
It's somewhat apples and oranges but this is borne out if we do something similar using PLAYA's https://github.com/dhdaines/playa/blob/parallel/benchmarks/parallel.py $ python benchmarks/parallel.py -n 4 samples/contrib/Rgl-1314-2021-Z-en-vigueur-20240823.pdf
PLAYA (4 CPUs) took 18.51s
PLAYA (single) took 50.26s (this is even with "spawn" as the |
Thank you for sharing such a comprehensive code example. It clarified why my previous attempts at parallel PDF processing were unsuccessful. Thanks for pointing out that PDFObjRef objects need to be detached from their parent PDFDocument before serialization., as shown in your parallel processing implementation. I noticed how you handle the layout in PLAYA. I recently came across an interesting Chinese library that implements an efficient text layout sorting algorithm. What makes it particularly noteworthy is its compact implementation in a single Python file and its effective bbox-based sorting method. I suggest adding a reference order to the layout processing. This would help create more consistent PDF-to-Markdown conversions by maintaining a clear reading order. Thanks for your help! Hope this helps. |
Ah, very interesting! This could go in PAVÉS where I am implementing actual layout analysis things. By the way if you want to go faster than pdfminer.six on multiple CPUs you could try that as it has a (mostly) pdfminer.six compatible API: https://github.com/dhdaines/paves/blob/main/benchmarks/miner.py
This will probably not be done by PLAYA as it purposefully returns everything in exactly the order it appears in the content streams. That said, it would be useful to support the rather lossy approximation of a reading order that is defined by the Tagged PDF standard (PDF 1.7, Section 14.8.2). This largely means doing something useful with certain properties of marked content sections, namely |
That's very helpful, thanks for the library recommendation. I'll definitely keep an eye on PAVÉS. |
Hello,
I'm currently using pdfminer.high_level.extract_pages from the pdfminer.six library and have observed that it processes PDF pages sequentially, which can be quite slow. so, I'm looking for ways to parallelize this process to enhance performance.
I noticed that you've been actively involved in submitting issues for pdfminer.six. I'm curious if there's a known method to parallelize the extraction process? Given that the extraction of individual pages in a PDF is independent, how can we leverage parallel processing to speed things up?
I'm eager to address this performance bottleneck but am not very familiar with this area. Could you offer some guidance or advice on how to approach this?
Thank you!
The text was updated successfully, but these errors were encountered: