Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable PDFDocEncoding support for metadata #611

Merged
merged 5 commits into from
Jul 11, 2023
Merged

Commits on Jul 4, 2023

  1. Enable PDFDocEncoding support

    Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252.
    
    For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf
    
    Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired.
    
    This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8.
    
    It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue smalot#609: smalot#609
    GreyWyvern committed Jul 4, 2023
    Configuration menu
    Copy the full SHA
    06958a8 View commit details
    Browse the repository at this point in the history
  2. Update PDFDocEncoding.php

    I hope I am not assuming too much by adding myself as the author of this file!
    GreyWyvern committed Jul 4, 2023
    Configuration menu
    Copy the full SHA
    e7c30c5 View commit details
    Browse the repository at this point in the history

Commits on Jul 6, 2023

  1. PR smalot#611 suggested changes

    Add comments in Document.php
    Use plain class PDFDocEncoding, do not extend AbstractEncoding
    array() => []
    Break up class functions into one that returns the code table, and another that uses the table to perform the conversion
    GreyWyvern committed Jul 6, 2023
    Configuration menu
    Copy the full SHA
    d660b77 View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2023

  1. Configuration menu
    Copy the full SHA
    a029772 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    66e8d9e View commit details
    Browse the repository at this point in the history