Using pypdfium2 instead of pypdf as the default document loader for langchain.document_loaders #3918

jerrytigerxu · 2023-05-01T22:50:04Z

jerrytigerxu
May 1, 2023

The PyPDFLoader() module, which is based on the pypdf.PdfReader() method, is considerably slower than using the pypdfium2.PdfDocument() method, with PyPDFLoader taking (on average), 1000% more time to load PDFs than pypdfium2.

Using pydfium2 instead of pypdf would save time on a exponential level, especially when handling PDFs files with over 500 pages.

sandorkonya · 2023-06-07T14:11:51Z

sandorkonya
Jun 7, 2023

@jerrytigerxu , the pdfloader saves the page number as metadata, could we also save the document's absolute path with it?
Use case: i write articles for which i use multiple dozens of referece articles as base. I would like to see the page itself, where the resulting chunks originate from visually from the pdf (like a semantic search). What do you think, is this feasible within langchain?

2 replies

jerrytigerxu Jun 7, 2023
Author

@sandorkonya That's a good question. I'm not entirely sure if that's feasible within LangChain, but I do think getting the absolute path is possible.

sandorkonya Jun 7, 2023

@jerrytigerxu yes, the File reader API seems to be able to do this on file upload,
so basicly either one more metadata along the chunk or some kind of storage that creates a uuid for each file name and assigns those uuids to the chunk (to avoid bloating the chunk metadata with the same string (path) over and over again).

vader-valencia · 2024-12-04T16:55:34Z

vader-valencia
Dec 4, 2024

Leaving the comment here in case it helps anyone else.

@jerrytigerxu , this was a fantastic idea! It saved me literally hours of debugging work!

Package versions used:

python 3.12
langchain_community==0.3.7
pypdfium2==4.30.0

Loader code:

loader = PyPDFium2Loader(file_path)
pages = loader.load()

1 reply

jerrytigerxu Dec 4, 2024
Author

Appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pypdfium2 instead of pypdf as the default document loader for langchain.document_loaders #3918

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using pypdfium2 instead of pypdf as the default document loader for langchain.document_loaders #3918

jerrytigerxu May 1, 2023

Replies: 2 comments · 3 replies

sandorkonya Jun 7, 2023

jerrytigerxu Jun 7, 2023 Author

sandorkonya Jun 7, 2023

vader-valencia Dec 4, 2024

jerrytigerxu Dec 4, 2024 Author

jerrytigerxu
May 1, 2023

Replies: 2 comments 3 replies

sandorkonya
Jun 7, 2023

jerrytigerxu Jun 7, 2023
Author

vader-valencia
Dec 4, 2024

jerrytigerxu Dec 4, 2024
Author