Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel analysis of PDF documents #32

Closed
TheCutestCat opened this issue Dec 26, 2024 · 7 comments · Fixed by #36
Closed

Support parallel analysis of PDF documents #32

TheCutestCat opened this issue Dec 26, 2024 · 7 comments · Fixed by #36
Labels
enhancement New feature or request
Milestone

Comments

@TheCutestCat
Copy link

TheCutestCat commented Dec 26, 2024

Hello,

I'm currently using pdfminer.high_level.extract_pages from the pdfminer.six library and have observed that it processes PDF pages sequentially, which can be quite slow. so, I'm looking for ways to parallelize this process to enhance performance.

I noticed that you've been actively involved in submitting issues for pdfminer.six. I'm curious if there's a known method to parallelize the extraction process? Given that the extraction of individual pages in a PDF is independent, how can we leverage parallel processing to speed things up?

I'm eager to address this performance bottleneck but am not very familiar with this area. Could you offer some guidance or advice on how to approach this?

Thank you!

@dhdaines
Copy link
Owner

Amazingly enough, while on top of a mountain last week I had exactly the same thought, perhaps because PLAYA-PDF could also stand for "Parallel and LAzY Analyzer for PDF".

In theory extraction is very scalable because each page has a distinct set of content streams, and (despite the presence of a "resource manager" in pdfminer.six) the bulk of the time is spent parsing and interpreting these streams. Because in PLAYA the entire PDF is memory-mapped read-only, it can obviously be shared between processes. Sharing other resources between processes efficiently is more complicated and probably not worth the trouble.

I will do a bit of experimentation with ProcessPoolExecutor this morning to see what kind of speedup I can get.

@dhdaines
Copy link
Owner

I think the general principle can also be applied to pdfminer.six, the issue may be that certain objects cannot be pickled and thus can't be shared via multiprocessing.

(note that because all of this is pure-Python and doesn't really involve I/O, using threads won't improve anything)

@dhdaines
Copy link
Owner

This is actually relatively easy with pdfminer.six and extract_pages but it doesn't scale very well. You can see how to do it here:

https://github.com/dhdaines/playa/blob/parallel/benchmarks/parallel_miner.py

I get around a 2x speedup for 4 CPUs on a 486-page document:

$ python benchmarks/parallel_miner.py -n 4 samples/contrib/Rgl-1314-2021-Z-en-vigueur-20240823.pdf 
Submitting pages 0 to 121
Submitting pages 122 to 243
Submitting pages 244 to 365
Submitting pages 366 to 485
pdfminer.six (4 CPUs) took 36.42s
pdfminer.six (single) took 73.97s

Without actually profiling this it seems that a lot of the overhead is in the serialization and deserialization of the layout analysis objects as they are deeply nested. Also there is the complication (as shown in the code above) that PDFObjRef objects must be "detached" from the parent PDFDocument in order to be serialized.

(note that this backreferenceing is actually also a memory leak in pdfminer.six, which is fixed in PLAYA by the use of weak references)

@dhdaines dhdaines changed the title Parallelizing PDF Page Extraction with pdfminer.six Support parallel analysis of PDF documents Dec 27, 2024
@dhdaines dhdaines added the enhancement New feature or request label Dec 27, 2024
@dhdaines dhdaines added this to the 0.3 milestone Dec 27, 2024
@dhdaines
Copy link
Owner

dhdaines commented Dec 27, 2024

Without actually profiling this it seems that a lot of the overhead is in the serialization and deserialization of the layout analysis objects as they are deeply nested.

It's somewhat apples and oranges but this is borne out if we do something similar using PLAYA's layout (BEWARE: will be removed in PLAYA 0.3) which returns a flat list of explicitly serializable objects:

https://github.com/dhdaines/playa/blob/parallel/benchmarks/parallel.py

$ python benchmarks/parallel.py -n 4 samples/contrib/Rgl-1314-2021-Z-en-vigueur-20240823.pdf 
PLAYA (4 CPUs) took 18.51s                                                                
PLAYA (single) took 50.26s

(this is even with "spawn" as the multiprocessing start method which introduces its own extra overhead...)

@TheCutestCat
Copy link
Author

TheCutestCat commented Dec 30, 2024

Thank you for sharing such a comprehensive code example. It clarified why my previous attempts at parallel PDF processing were unsuccessful. Thanks for pointing out that PDFObjRef objects need to be detached from their parent PDFDocument before serialization., as shown in your parallel processing implementation.

I noticed how you handle the layout in PLAYA. I recently came across an interesting Chinese library that implements an efficient text layout sorting algorithm. What makes it particularly noteworthy is its compact implementation in a single Python file and its effective bbox-based sorting method.

I suggest adding a reference order to the layout processing. This would help create more consistent PDF-to-Markdown conversions by maintaining a clear reading order.

Thanks for your help! Hope this helps.

@dhdaines
Copy link
Owner

I noticed how you handle the layout in PLAYA. I recently came across an interesting Chinese library that implements an efficient text layout sorting algorithm. What makes it particularly noteworthy is its compact implementation in a single Python file and its effective bbox-based sorting method.

Ah, very interesting! This could go in PAVÉS where I am implementing actual layout analysis things. By the way if you want to go faster than pdfminer.six on multiple CPUs you could try that as it has a (mostly) pdfminer.six compatible API: https://github.com/dhdaines/paves/blob/main/benchmarks/miner.py

I suggest adding a reference order to the layout processing. This would help create more consistent PDF-to-Markdown conversions by maintaining a clear reading order.

This will probably not be done by PLAYA as it purposefully returns everything in exactly the order it appears in the content streams.

That said, it would be useful to support the rather lossy approximation of a reading order that is defined by the Tagged PDF standard (PDF 1.7, Section 14.8.2). This largely means doing something useful with certain properties of marked content sections, namely /ActualText, /ReversedChars and /TagSuspect.

@TheCutestCat
Copy link
Author

That's very helpful, thanks for the library recommendation. I'll definitely keep an eye on PAVÉS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants