PyPDFLoader parse pdf with extract_images=True encountered an error #26652

XAGU · 2024-09-19T08:48:42Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

PyPDFLoader

Error Message and Stack Trace (if applicable)

  File "envs\xxx\Lib\site-packages\langchain_core\document_loaders\base.py", line 30, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\pdf.py", line 202, in lazy_load
    yield from self.parser.parse(blob)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_core\document_loaders\base.py", line 126, in parse
    return list(self.lazy_parse(blob))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py", line 124, in lazy_parse
    yield from [
               ^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py", line 127, in <listcomp>
    + self._extract_images_from_page(page),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py", line 142, in _extract_images_from_page
    if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
       ~~~~~~~~~~~~^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\pypdf\generic\_data_structures.py", line 319, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/Filter'

Description

System Info

langchain: 0.2.12
langchain_community: 0.2.11

The text was updated successfully, but these errors were encountered:

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDFLoader parse pdf with extract_images=True encountered an error #26652

PyPDFLoader parse pdf with extract_images=True encountered an error #26652

XAGU commented Sep 19, 2024

PyPDFLoader parse pdf with extract_images=True encountered an error #26652

PyPDFLoader parse pdf with extract_images=True encountered an error #26652

Comments

XAGU commented Sep 19, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info