Support parallel analysis of PDF documents #32

TheCutestCat · 2024-12-26T07:38:05Z

Hello,

I'm currently using pdfminer.high_level.extract_pages from the pdfminer.six library and have observed that it processes PDF pages sequentially, which can be quite slow. so, I'm looking for ways to parallelize this process to enhance performance.

I noticed that you've been actively involved in submitting issues for pdfminer.six. I'm curious if there's a known method to parallelize the extraction process? Given that the extraction of individual pages in a PDF is independent, how can we leverage parallel processing to speed things up?

I'm eager to address this performance bottleneck but am not very familiar with this area. Could you offer some guidance or advice on how to approach this?

Thank you!

dhdaines · 2024-12-27T13:48:38Z

Amazingly enough, while on top of a mountain last week I had exactly the same thought, perhaps because PLAYA-PDF could also stand for "Parallel and LAzY Analyzer for PDF".

In theory extraction is very scalable because each page has a distinct set of content streams, and (despite the presence of a "resource manager" in pdfminer.six) the bulk of the time is spent parsing and interpreting these streams. Because in PLAYA the entire PDF is memory-mapped read-only, it can obviously be shared between processes. Sharing other resources between processes efficiently is more complicated and probably not worth the trouble.

I will do a bit of experimentation with ProcessPoolExecutor this morning to see what kind of speedup I can get.

dhdaines · 2024-12-27T13:54:41Z

I think the general principle can also be applied to pdfminer.six, the issue may be that certain objects cannot be pickled and thus can't be shared via multiprocessing.

(note that because all of this is pure-Python and doesn't really involve I/O, using threads won't improve anything)

dhdaines · 2024-12-27T17:16:03Z

This is actually relatively easy with pdfminer.six and extract_pages but it doesn't scale very well. You can see how to do it here:

https://github.com/dhdaines/playa/blob/parallel/benchmarks/parallel_miner.py

I get around a 2x speedup for 4 CPUs on a 486-page document:

$ python benchmarks/parallel_miner.py -n 4 samples/contrib/Rgl-1314-2021-Z-en-vigueur-20240823.pdf 
Submitting pages 0 to 121
Submitting pages 122 to 243
Submitting pages 244 to 365
Submitting pages 366 to 485
pdfminer.six (4 CPUs) took 36.42s
pdfminer.six (single) took 73.97s

Without actually profiling this it seems that a lot of the overhead is in the serialization and deserialization of the layout analysis objects as they are deeply nested. Also there is the complication (as shown in the code above) that PDFObjRef objects must be "detached" from the parent PDFDocument in order to be serialized.

(note that this backreferenceing is actually also a memory leak in pdfminer.six, which is fixed in PLAYA by the use of weak references)

dhdaines · 2024-12-27T18:11:42Z

Without actually profiling this it seems that a lot of the overhead is in the serialization and deserialization of the layout analysis objects as they are deeply nested.

It's somewhat apples and oranges but this is borne out if we do something similar using PLAYA's layout (BEWARE: will be removed in PLAYA 0.3) which returns a flat list of explicitly serializable objects:

https://github.com/dhdaines/playa/blob/parallel/benchmarks/parallel.py

$ python benchmarks/parallel.py -n 4 samples/contrib/Rgl-1314-2021-Z-en-vigueur-20240823.pdf 
PLAYA (4 CPUs) took 18.51s                                                                
PLAYA (single) took 50.26s

(this is even with "spawn" as the multiprocessing start method which introduces its own extra overhead...)

TheCutestCat · 2024-12-30T03:17:13Z

Thank you for sharing such a comprehensive code example. It clarified why my previous attempts at parallel PDF processing were unsuccessful. Thanks for pointing out that PDFObjRef objects need to be detached from their parent PDFDocument before serialization., as shown in your parallel processing implementation.

I noticed how you handle the layout in PLAYA. I recently came across an interesting Chinese library that implements an efficient text layout sorting algorithm. What makes it particularly noteworthy is its compact implementation in a single Python file and its effective bbox-based sorting method.

I suggest adding a reference order to the layout processing. This would help create more consistent PDF-to-Markdown conversions by maintaining a clear reading order.

Thanks for your help! Hope this helps.

dhdaines · 2024-12-30T20:30:31Z

I noticed how you handle the layout in PLAYA. I recently came across an interesting Chinese library that implements an efficient text layout sorting algorithm. What makes it particularly noteworthy is its compact implementation in a single Python file and its effective bbox-based sorting method.

Ah, very interesting! This could go in PAVÉS where I am implementing actual layout analysis things. By the way if you want to go faster than pdfminer.six on multiple CPUs you could try that as it has a (mostly) pdfminer.six compatible API: https://github.com/dhdaines/paves/blob/main/benchmarks/miner.py

I suggest adding a reference order to the layout processing. This would help create more consistent PDF-to-Markdown conversions by maintaining a clear reading order.

This will probably not be done by PLAYA as it purposefully returns everything in exactly the order it appears in the content streams.

That said, it would be useful to support the rather lossy approximation of a reading order that is defined by the Tagged PDF standard (PDF 1.7, Section 14.8.2). This largely means doing something useful with certain properties of marked content sections, namely /ActualText, /ReversedChars and /TagSuspect.

TheCutestCat · 2024-12-31T03:52:56Z

That's very helpful, thanks for the library recommendation. I'll definitely keep an eye on PAVÉS.

dhdaines changed the title ~~Parallelizing PDF Page Extraction with pdfminer.six~~ Support parallel analysis of PDF documents Dec 27, 2024

dhdaines added the enhancement New feature or request label Dec 27, 2024

dhdaines added this to the 0.3 milestone Dec 27, 2024

dhdaines mentioned this issue Dec 29, 2024

Support parallel operations over pages #36

Merged

dhdaines closed this as completed in #36 Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel analysis of PDF documents #32

Support parallel analysis of PDF documents #32

TheCutestCat commented Dec 26, 2024 •

edited

Loading

dhdaines commented Dec 27, 2024

dhdaines commented Dec 27, 2024

dhdaines commented Dec 27, 2024

dhdaines commented Dec 27, 2024 •

edited

Loading

TheCutestCat commented Dec 30, 2024 •

edited

Loading

dhdaines commented Dec 30, 2024

TheCutestCat commented Dec 31, 2024

Support parallel analysis of PDF documents #32

Support parallel analysis of PDF documents #32

Comments

TheCutestCat commented Dec 26, 2024 • edited Loading

dhdaines commented Dec 27, 2024

dhdaines commented Dec 27, 2024

dhdaines commented Dec 27, 2024

dhdaines commented Dec 27, 2024 • edited Loading

TheCutestCat commented Dec 30, 2024 • edited Loading

dhdaines commented Dec 30, 2024

TheCutestCat commented Dec 31, 2024

TheCutestCat commented Dec 26, 2024 •

edited

Loading

dhdaines commented Dec 27, 2024 •

edited

Loading

TheCutestCat commented Dec 30, 2024 •

edited

Loading