Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images contained in objects of type "/Pattern" are not retrieved #2613

Closed
0xNath opened this issue May 1, 2024 · 8 comments · Fixed by #2615 · May be fixed by #2637
Closed

Images contained in objects of type "/Pattern" are not retrieved #2613

0xNath opened this issue May 1, 2024 · 8 comments · Fixed by #2615 · May be fixed by #2637
Labels
workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@0xNath
Copy link

0xNath commented May 1, 2024

Explanation

Hello,
First of all, thanks for your works, it's a very helpful library.

I am not able to extract images from PDF generated with OnlyOffice :
B2.pdf

After looking into the PDF structure, it seems that the image in this PDF page, is contained inside a Tiling Patterns object, which can't be handled by "_page._get_ids_image" nor "_page._get_image".

I've took a look at PDF standards and it's specified that Tiling Patterns can be made of images so it's not an OnlyOffice issue.

I don't have read completely the standards about Patterns, but once this is done I'd like to make a proposition to at least be able to retrieve images from them, so when we try to get images from a page, it also considers Patterns.

What do you think about it ?

Have a nice day !

@stefan6419846
Copy link
Collaborator

Thanks for the report. To determine the images associated with a page, pypdf does indeed not consider nested xobjects for image extraction.

@stefan6419846 stefan6419846 added the workflow-images From a users perspective, image handling is the affected feature/workflow label May 1, 2024
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 1, 2024

pypdf can looks in sub XObjects, however here you are looking for an object which is part of a pattern which is not for me the way to do things.
this is a proposal to extract your image:

import pypdf

r = pypdf.PdfReader("B2.pdf")
img = pypdf.filters._xobj_to_image(r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"])[2]
img.show()

I will try to propose also a easier way to extract an image
edit. I've found a better way

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 1, 2024

with the new PR extraction will be easier:

import pypdf
r = pypdf.PdfReader("B2.pdf")
img = r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"].decode_as_image()
img.show()

@0xNath
Copy link
Author

0xNath commented May 1, 2024

Wouldn't it be better to have the fonction that should extract all images of a page to actually extract all images of the pages ?

The PDF standard said that images can be stored inside Patterns so we should expect to find images in them.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 1, 2024

I agree that images can be stored in patterns, but the solution used inhere is not common. a pattern is expected in a context to provided a repeated image in a surface.
There is too many places where images could be (patterns, annotations, ...); will be quite complex also out of context having the image may not be very efficient.

@0xNath
Copy link
Author

0xNath commented May 1, 2024

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

@pubpub-zz
Copy link
Collaborator

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

We can propose a PR

@0xNath
Copy link
Author

0xNath commented May 1, 2024

Well well well, _page.images isn't a method but a property so passing a parameter to it isn't an option...

@stefan6419846 stefan6419846 changed the title Image contained in objects of type "/Pattern" are not retrived Images contained in objects of type "/Pattern" are not retrieved May 2, 2024
0xNath added a commit to 0xNath/pypdf that referenced this issue Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
3 participants