Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPDFLoader parse pdf with extract_images=True encountered an error #26652

Open
5 tasks done
XAGU opened this issue Sep 19, 2024 · 0 comments
Open
5 tasks done

PyPDFLoader parse pdf with extract_images=True encountered an error #26652

XAGU opened this issue Sep 19, 2024 · 0 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@XAGU
Copy link

XAGU commented Sep 19, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

PyPDFLoader

Error Message and Stack Trace (if applicable)

  File "envs\xxx\Lib\site-packages\langchain_core\document_loaders\base.py", line 30, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\pdf.py", line 202, in lazy_load
    yield from self.parser.parse(blob)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_core\document_loaders\base.py", line 126, in parse
    return list(self.lazy_parse(blob))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py", line 124, in lazy_parse
    yield from [
               ^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py", line 127, in <listcomp>
    + self._extract_images_from_page(page),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py", line 142, in _extract_images_from_page
    if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
       ~~~~~~~~~~~~^^^^^^^^^^^
  File "envs\xxx\Lib\site-packages\pypdf\generic\_data_structures.py", line 319, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/Filter'

Description

image

System Info

langchain: 0.2.12
langchain_community: 0.2.11

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant