-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert fails when getting the mediabox property for certain PDFs #2991
Comments
I just had a look at the PDF specification (ISO 32000-2:2020/PDF 2.0): According to table 31 in section 7.7.3.3, the MediaBox is a rectangle:
Looking into section 7.9.5, we can read:
The PDF 1.7 specification (ISO 32000-1:2008) states the same here. For this reason, your PDF files are indeed broken, although we might consider truncating rectangles to their first four elements and issue a warning instead. Feel free to submit a corresponding PR. Just out of curiosity: Do you have any information on the origin/software generating these faulty files? |
Ouch. This might have been an issue from our side, but does not necessarily have to be ;) |
I'll see if I can open a PR in the next few days 👍 |
I've run into this as well, my media box looks slightly different:
Sadly I also can't directly share the file. |
I sent a PR for the read issue reported here. I also realized that pypdf is the source for the corrupt boxes and I should be able to reproduce it reliably locally, so I'll look at a fix for the write side issue too. |
Here's the how I'm reproducing. I have 5 input files (again all with customer data :/): pdf_writer = PdfWriter()
for i in range(1, 6):
path = <path_to_file_i>
with PdfReader(path) as pdf_reader:
for page in range(len(pdf_reader.pages)):
pdf_writer.add_page(pdf_reader.pages[page])
output_path = <anywhere>
with open(output_path, "wb") as f:
pdf_writer.write(f)
reader = PdfReader(output_path)
for i, page in enumerate(reader.pages, 1):
page.mediabox # this eventually throws One fun thing, if I read the mediabox value from each page before I add it to the pdf_writer, the merged pdf mediabox is no longer corrupt. E.g. this never throws: pdf_writer = PdfWriter()
for i in range(1, 6):
path = <path_to_file_i>
with PdfReader(path) as pdf_reader:
for page in range(len(pdf_reader.pages)):
print(i, pdf_reader.pages[page].mediabox) # Added!!
pdf_writer.add_page(pdf_reader.pages[page])
output_path = <anywhere>
with open(output_path, "wb") as f:
pdf_writer.write(f)
reader = PdfReader(output_path)
for i, page in enumerate(reader.pages, 1):
page.mediabox # this now never throws! |
I've narrowed this down to the Line 487 in 27edc06
I think I'm going to go back to my normal job for now, but let me know if I can help reproduce later. |
Thanks for the further analysis. Unfortunately, this seems to indicate that there might be a real bug, thus just fixing the constructor of To go on with this and allow further analysis, we would really need some actual minimal example code showing this behavior, including an actual PDF file. I have not been able to reproduce this actual issue with the current code provided and one random PDF file. |
I totally agree there's a bug on the write side. But I think you can still consider a workaround on the read side. There will be some number of saved PDFs impacted by this issue from pypdf in the real world. Fixing the write issue will stop new corrupt PDFs going forward, but it won't help read the already corrupt files saved previously. Chrome and preview are able to view these files without any issues I can see anyway. Understood regarding reproducibility, I can keep poking at it eventually. Can you point me to the real implementation of |
|
Have the same problem, started after nov 15. I'm also noticed that mediaBox size duplicates accordingly to number of pages in files: let's say i'm merging two pdf documents with two and three pages. In the result pdf document will be 5 pages with different mediabox for every page: first two pages will have mediabox sizes merged two times and last three pages size will be merged three time for the number of pages. |
Do you happen to be able to provide a complete reproducing example including a PDF file and the corresponding code? |
Hi
We are processing quite a lot of PDFs, and from time to time we see the following assert fail on specific PDFs when trying to get the
mediabox
property of a page.pypdf/pypdf/generic/_rectangle.py
Line 24 in 27edc06
Here is the content of
page
:Is this "just" a malformed PDF (it opens without problem in a wide range of pdf readers)? Unfortunately, I can't share the PDF, since it contains sensitive customer information.
The text was updated successfully, but these errors were encountered: