-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing PDFDocEncoding #609
Labels
Comments
GreyWyvern
added a commit
to GreyWyvern/pdfparser
that referenced
this issue
Jul 4, 2023
Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252. For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired. This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8. It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue smalot#609: smalot#609
k00ni
added a commit
that referenced
this issue
Jul 11, 2023
* Enable PDFDocEncoding support Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252. For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired. This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8. It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue #609: #609 * Update PDFDocEncoding.php I hope I am not assuming too much by adding myself as the author of this file! * PR #611 suggested changes Add comments in Document.php Use plain class PDFDocEncoding, do not extend AbstractEncoding array() => [] Break up class functions into one that returns the code table, and another that uses the table to perform the conversion * fixed coding style issues in Document.php * fixed coding style issue in PDFDocEncoding.php --------- Co-authored-by: Konrad Abicht <[email protected]>
Fixed by #611, isn't it? @GreyWyvern If not, please reopen. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The Adobe PDF Reference defines a special encoding which is an extension of Latin1 such that:
See also: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf
As this includes "document information" such as titles, authors and other details, PdfParser should use PDFDocEncoding to translate these strings.
Here is a proposed 'PDFDocEncoding.php' file I quickly mocked up but haven't tested yet. You can give it a shot; I will also see if I can create a branch where this works and submit a PR.
The text was updated successfully, but these errors were encountered: