Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert fails when getting the mediabox property for certain PDFs #2991

Open
Paethon opened this issue Dec 5, 2024 · 13 comments · May be fixed by #3001
Open

Assert fails when getting the mediabox property for certain PDFs #2991

Paethon opened this issue Dec 5, 2024 · 13 comments · May be fixed by #3001
Labels
is-robustness-issue From a users perspective, this is about robustness needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem needs-pdf The issue needs a PDF file to show the problem

Comments

@Paethon
Copy link

Paethon commented Dec 5, 2024

Hi

We are processing quite a lot of PDFs, and from time to time we see the following assert fail on specific PDFs when trying to get the mediabox property of a page.

assert len(arr) == 4

Here is the content of page:

{'/Contents': [IndirectObject(34, 0, 131870331607200)],
 '/CropBox': [0, 0, 595, 841, 0, 0, 595, 841],
 '/MediaBox': [0, 0, 595, 841, 0, 0, 595, 841],
 '/Parent': IndirectObject(1, 0, 131870331607200),
 '/Resources': IndirectObject(5, 0, 131870331607200),
 '/Type': '/Page'}

Is this "just" a malformed PDF (it opens without problem in a wide range of pdf readers)? Unfortunately, I can't share the PDF, since it contains sensitive customer information.

@Paethon Paethon changed the title Assert fails when getting the mediabox property fails for certain PDFs Assert fails when getting the mediabox property for certain PDFs Dec 5, 2024
@stefan6419846
Copy link
Collaborator

I just had a look at the PDF specification (ISO 32000-2:2020/PDF 2.0): According to table 31 in section 7.7.3.3, the MediaBox is a rectangle:

(Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries").

Looking into section 7.9.5, we can read:

A rectangle shall be written as an array of four numbers giving the coordinates of a pair of diagonally opposite corners.

The PDF 1.7 specification (ISO 32000-1:2008) states the same here.

For this reason, your PDF files are indeed broken, although we might consider truncating rectangles to their first four elements and issue a warning instead. Feel free to submit a corresponding PR.

Just out of curiosity: Do you have any information on the origin/software generating these faulty files?

@stefan6419846 stefan6419846 added is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. is-robustness-issue From a users perspective, this is about robustness and removed is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. labels Dec 5, 2024
@Paethon
Copy link
Author

Paethon commented Dec 5, 2024

Yes, it says: Producer: "pypdf" 😂
image

Although I have the feeling that they have a different origin and were just processed afterwards using pypdf (but not 100% sure)

@stefan6419846
Copy link
Collaborator

Ouch. This might have been an issue from our side, but does not necessarily have to be ;)

@Paethon
Copy link
Author

Paethon commented Dec 5, 2024

I'll see if I can open a PR in the next few days 👍

@sjudd
Copy link

sjudd commented Dec 13, 2024

I've run into this as well, my media box looks slightly different:

/CropBox: [
0,
0,
612,
792,
0,
0,
612,
792,
0,
0
],

/MediaBox: [
0,
0,
612,
792,
0,
0,
612,
792,
0,
0
],

Sadly I also can't directly share the file.

sjudd added a commit to sjudd/pypdf that referenced this issue Dec 13, 2024
@sjudd
Copy link

sjudd commented Dec 13, 2024

I sent a PR for the read issue reported here. I also realized that pypdf is the source for the corrupt boxes and I should be able to reproduce it reliably locally, so I'll look at a fix for the write side issue too.

@sjudd
Copy link

sjudd commented Dec 13, 2024

Here's the how I'm reproducing. I have 5 input files (again all with customer data :/):

pdf_writer = PdfWriter()
for i in range(1, 6):
    path = <path_to_file_i>
    with PdfReader(path) as pdf_reader:
        for page in range(len(pdf_reader.pages)):
            pdf_writer.add_page(pdf_reader.pages[page])
output_path = <anywhere>
with open(output_path, "wb") as f:
    pdf_writer.write(f)

reader = PdfReader(output_path)
for i, page in enumerate(reader.pages, 1):
    page.mediabox # this eventually throws

One fun thing, if I read the mediabox value from each page before I add it to the pdf_writer, the merged pdf mediabox is no longer corrupt. E.g. this never throws:

pdf_writer = PdfWriter()
for i in range(1, 6):
    path = <path_to_file_i>
    with PdfReader(path) as pdf_reader:
        for page in range(len(pdf_reader.pages)):
            print(i, pdf_reader.pages[page].mediabox) # Added!!
            pdf_writer.add_page(pdf_reader.pages[page])
output_path = <anywhere>
with open(output_path, "wb") as f:
    pdf_writer.write(f)

reader = PdfReader(output_path)
for i, page in enumerate(reader.pages, 1):
    page.mediabox # this now never throws!

sjudd added a commit to sjudd/pypdf that referenced this issue Dec 13, 2024
@sjudd
Copy link

sjudd commented Dec 13, 2024

I've narrowed this down to the get_object call:

"PageObject", page_org.clone(self, False, excluded_keys).get_object()
. Here's the page instance printed before and after it (I removed the clone part):

27 {'/Type': '/Page', '/Parent': IndirectObject(2, 0, 4371288464), '/Resources': IndirectObject(16, 0, 4371288464), '/Contents': [IndirectObject(15, 0, 4371288464)], '/MediaBox': [0, 0, 612, 792], '/CropBox': [0, 0, 612, 792]} [0, 0, 612, 792]
27 {'/Type': '/Page', '/Resources': IndirectObject(242, 0, 4367985936), '/Contents': [IndirectObject(250, 0, 4367985936)], '/MediaBox': [0, 0, 612, 792, 0, 0, 612, 792], '/CropBox': [0, 0, 612, 792, 0, 0, 612, 792]} [0, 0, 612, 792, 0, 0, 612, 792]

I think I'm going to go back to my normal job for now, but let me know if I can help reproduce later.

@stefan6419846 stefan6419846 added needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Dec 14, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the further analysis. Unfortunately, this seems to indicate that there might be a real bug, thus just fixing the constructor of RectangleObject does not really look like the correct solution.

To go on with this and allow further analysis, we would really need some actual minimal example code showing this behavior, including an actual PDF file. I have not been able to reproduce this actual issue with the current code provided and one random PDF file.

@sjudd
Copy link

sjudd commented Dec 14, 2024

I totally agree there's a bug on the write side.

But I think you can still consider a workaround on the read side. There will be some number of saved PDFs impacted by this issue from pypdf in the real world. Fixing the write issue will stop new corrupt PDFs going forward, but it won't help read the already corrupt files saved previously. Chrome and preview are able to view these files without any issues I can see anyway.

Understood regarding reproducibility, I can keep poking at it eventually. Can you point me to the real implementation of get_object for Page? Or the possible examples? I got lost when that method seems to just return self but then also mutated the mediabox. Or if that is the only possible implementation that's useful to know too

@stefan6419846
Copy link
Collaborator

get_object() indeed returns self in most of the cases and is commonly used to resolve indirect references (IndirectObject) to their actual object (the one pointing to). Just calling get_object() should in theory not affect this AFAIK unless it triggers some currently unknown side effect.

@RomanFaust0v
Copy link

Have the same problem, started after nov 15. I'm also noticed that mediaBox size duplicates accordingly to number of pages in files: let's say i'm merging two pdf documents with two and three pages. In the result pdf document will be 5 pages with different mediabox for every page: first two pages will have mediabox sizes merged two times and last three pages size will be merged three time for the number of pages.

@stefan6419846
Copy link
Collaborator

Do you happen to be able to provide a complete reproducing example including a PDF file and the corresponding code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem needs-pdf The issue needs a PDF file to show the problem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants