Using pypdfium2 instead of pypdf as the default document loader for langchain.document_loaders #3918
Replies: 2 comments 3 replies
-
@jerrytigerxu , the pdfloader saves the page number as metadata, could we also save the document's absolute path with it? |
Beta Was this translation helpful? Give feedback.
-
Leaving the comment here in case it helps anyone else. @jerrytigerxu , this was a fantastic idea! It saved me literally hours of debugging work! Package versions used:
Loader code:
|
Beta Was this translation helpful? Give feedback.
-
The PyPDFLoader() module, which is based on the pypdf.PdfReader() method, is considerably slower than using the pypdfium2.PdfDocument() method, with PyPDFLoader taking (on average), 1000% more time to load PDFs than pypdfium2.
Using pydfium2 instead of pypdf would save time on a exponential level, especially when handling PDFs files with over 500 pages.
Beta Was this translation helpful? Give feedback.
All reactions