Assert fails when getting the mediabox property for certain PDFs #2991

Paethon · 2024-12-05T13:51:19Z

Hi

We are processing quite a lot of PDFs, and from time to time we see the following assert fail on specific PDFs when trying to get the mediabox property of a page.

pypdf/pypdf/generic/_rectangle.py

Line 24 in 27edc06

assert len(arr) == 4

Here is the content of page:

{'/Contents': [IndirectObject(34, 0, 131870331607200)],
 '/CropBox': [0, 0, 595, 841, 0, 0, 595, 841],
 '/MediaBox': [0, 0, 595, 841, 0, 0, 595, 841],
 '/Parent': IndirectObject(1, 0, 131870331607200),
 '/Resources': IndirectObject(5, 0, 131870331607200),
 '/Type': '/Page'}

Is this "just" a malformed PDF (it opens without problem in a wide range of pdf readers)? Unfortunately, I can't share the PDF, since it contains sensitive customer information.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-12-05T14:04:58Z

I just had a look at the PDF specification (ISO 32000-2:2020/PDF 2.0): According to table 31 in section 7.7.3.3, the MediaBox is a rectangle:

(Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries").

Looking into section 7.9.5, we can read:

A rectangle shall be written as an array of four numbers giving the coordinates of a pair of diagonally opposite corners.

The PDF 1.7 specification (ISO 32000-1:2008) states the same here.

For this reason, your PDF files are indeed broken, although we might consider truncating rectangles to their first four elements and issue a warning instead. Feel free to submit a corresponding PR.

Just out of curiosity: Do you have any information on the origin/software generating these faulty files?

Paethon · 2024-12-05T14:10:05Z

Yes, it says: Producer: "pypdf" 😂

Although I have the feeling that they have a different origin and were just processed afterwards using pypdf (but not 100% sure)

stefan6419846 · 2024-12-05T14:12:46Z

Ouch. This might have been an issue from our side, but does not necessarily have to be ;)

Paethon · 2024-12-05T14:16:58Z

I'll see if I can open a PR in the next few days 👍

sjudd · 2024-12-13T22:31:42Z

I've run into this as well, my media box looks slightly different:

/CropBox: [
0,
0,
612,
792,
0,
0,
612,
792,
0,
0
],

/MediaBox: [
0,
0,
612,
792,
0,
0,
612,
792,
0,
0
],

Sadly I also can't directly share the file.

Closes py-pdf#2991

sjudd · 2024-12-13T23:22:30Z

I sent a PR for the read issue reported here. I also realized that pypdf is the source for the corrupt boxes and I should be able to reproduce it reliably locally, so I'll look at a fix for the write side issue too.

sjudd · 2024-12-13T23:32:57Z

Here's the how I'm reproducing. I have 5 input files (again all with customer data :/):

pdf_writer = PdfWriter()
for i in range(1, 6):
    path = <path_to_file_i>
    with PdfReader(path) as pdf_reader:
        for page in range(len(pdf_reader.pages)):
            pdf_writer.add_page(pdf_reader.pages[page])
output_path = <anywhere>
with open(output_path, "wb") as f:
    pdf_writer.write(f)

reader = PdfReader(output_path)
for i, page in enumerate(reader.pages, 1):
    page.mediabox # this eventually throws

One fun thing, if I read the mediabox value from each page before I add it to the pdf_writer, the merged pdf mediabox is no longer corrupt. E.g. this never throws:

pdf_writer = PdfWriter()
for i in range(1, 6):
    path = <path_to_file_i>
    with PdfReader(path) as pdf_reader:
        for page in range(len(pdf_reader.pages)):
            print(i, pdf_reader.pages[page].mediabox) # Added!!
            pdf_writer.add_page(pdf_reader.pages[page])
output_path = <anywhere>
with open(output_path, "wb") as f:
    pdf_writer.write(f)

reader = PdfReader(output_path)
for i, page in enumerate(reader.pages, 1):
    page.mediabox # this now never throws!

Closes py-pdf#2991

sjudd · 2024-12-13T23:59:17Z

I've narrowed this down to the get_object call:

pypdf/pypdf/_writer.py

Line 487 in 27edc06

"PageObject", page_org.clone(self, False, excluded_keys).get_object()

. Here's the page instance printed before and after it (I removed the clone part):

27 {'/Type': '/Page', '/Parent': IndirectObject(2, 0, 4371288464), '/Resources': IndirectObject(16, 0, 4371288464), '/Contents': [IndirectObject(15, 0, 4371288464)], '/MediaBox': [0, 0, 612, 792], '/CropBox': [0, 0, 612, 792]} [0, 0, 612, 792]
27 {'/Type': '/Page', '/Resources': IndirectObject(242, 0, 4367985936), '/Contents': [IndirectObject(250, 0, 4367985936)], '/MediaBox': [0, 0, 612, 792, 0, 0, 612, 792], '/CropBox': [0, 0, 612, 792, 0, 0, 612, 792]} [0, 0, 612, 792, 0, 0, 612, 792]

I think I'm going to go back to my normal job for now, but let me know if I can help reproduce later.

stefan6419846 · 2024-12-14T11:31:37Z

Thanks for the further analysis. Unfortunately, this seems to indicate that there might be a real bug, thus just fixing the constructor of RectangleObject does not really look like the correct solution.

To go on with this and allow further analysis, we would really need some actual minimal example code showing this behavior, including an actual PDF file. I have not been able to reproduce this actual issue with the current code provided and one random PDF file.

sjudd · 2024-12-14T14:23:15Z

I totally agree there's a bug on the write side.

But I think you can still consider a workaround on the read side. There will be some number of saved PDFs impacted by this issue from pypdf in the real world. Fixing the write issue will stop new corrupt PDFs going forward, but it won't help read the already corrupt files saved previously. Chrome and preview are able to view these files without any issues I can see anyway.

Understood regarding reproducibility, I can keep poking at it eventually. Can you point me to the real implementation of get_object for Page? Or the possible examples? I got lost when that method seems to just return self but then also mutated the mediabox. Or if that is the only possible implementation that's useful to know too

stefan6419846 · 2024-12-14T14:38:48Z

get_object() indeed returns self in most of the cases and is commonly used to resolve indirect references (IndirectObject) to their actual object (the one pointing to). Just calling get_object() should in theory not affect this AFAIK unless it triggers some currently unknown side effect.

RomanFaust0v · 2025-01-05T05:15:52Z

Have the same problem, started after nov 15. I'm also noticed that mediaBox size duplicates accordingly to number of pages in files: let's say i'm merging two pdf documents with two and three pages. In the result pdf document will be 5 pages with different mediabox for every page: first two pages will have mediabox sizes merged two times and last three pages size will be merged three time for the number of pages.

stefan6419846 · 2025-01-05T10:13:03Z

Do you happen to be able to provide a complete reproducing example including a PDF file and the corresponding code?

Paethon changed the title ~~Assert fails when getting the mediabox property fails for certain PDFs~~ Assert fails when getting the mediabox property for certain PDFs Dec 5, 2024

sjudd added a commit to sjudd/pypdf that referenced this issue Dec 13, 2024

BUG: Truncate mediabox and cropbox values with > 4 points.

83e1638

Closes py-pdf#2991

sjudd linked a pull request Dec 13, 2024 that will close this issue

BUG: Truncate mediabox and cropbox values with > 4 points. #3001

Open

sjudd added a commit to sjudd/pypdf that referenced this issue Dec 13, 2024

BUG: Truncate mediabox and cropbox values with > 4 points.

15ea592

Closes py-pdf#2991

stefan6419846 added needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert fails when getting the mediabox property for certain PDFs #2991

Assert fails when getting the mediabox property for certain PDFs #2991

Paethon commented Dec 5, 2024 •

edited

Loading

stefan6419846 commented Dec 5, 2024

Paethon commented Dec 5, 2024 •

edited

Loading

stefan6419846 commented Dec 5, 2024

Paethon commented Dec 5, 2024

sjudd commented Dec 13, 2024

sjudd commented Dec 13, 2024 •

edited

Loading

sjudd commented Dec 13, 2024 •

edited

Loading

sjudd commented Dec 13, 2024

stefan6419846 commented Dec 14, 2024

sjudd commented Dec 14, 2024 •

edited

Loading

stefan6419846 commented Dec 14, 2024

RomanFaust0v commented Jan 5, 2025

stefan6419846 commented Jan 5, 2025

Assert fails when getting the mediabox property for certain PDFs #2991

Assert fails when getting the mediabox property for certain PDFs #2991

Comments

Paethon commented Dec 5, 2024 • edited Loading

stefan6419846 commented Dec 5, 2024

Paethon commented Dec 5, 2024 • edited Loading

stefan6419846 commented Dec 5, 2024

Paethon commented Dec 5, 2024

sjudd commented Dec 13, 2024

sjudd commented Dec 13, 2024 • edited Loading

sjudd commented Dec 13, 2024 • edited Loading

sjudd commented Dec 13, 2024

stefan6419846 commented Dec 14, 2024

sjudd commented Dec 14, 2024 • edited Loading

stefan6419846 commented Dec 14, 2024

RomanFaust0v commented Jan 5, 2025

stefan6419846 commented Jan 5, 2025

Paethon commented Dec 5, 2024 •

edited

Loading

Paethon commented Dec 5, 2024 •

edited

Loading

sjudd commented Dec 13, 2024 •

edited

Loading

sjudd commented Dec 13, 2024 •

edited

Loading

sjudd commented Dec 14, 2024 •

edited

Loading