Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing PDFDocEncoding #609

Closed
GreyWyvern opened this issue Jun 27, 2023 · 1 comment
Closed

Missing PDFDocEncoding #609

GreyWyvern opened this issue Jun 27, 2023 · 1 comment

Comments

@GreyWyvern
Copy link
Contributor

The Adobe PDF Reference defines a special encoding which is an extension of Latin1 such that:

Informational or content strings can be represented in Unicode. These strings include text annotations, bookmark names, article names, document information, date strings, etc. In PDF 1.1 these strings are stored in PDFDocEncoding, which is a superset of ISOLatin1.
PDF Reference 1.2 - https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.2.pdf

See also: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

As this includes "document information" such as titles, authors and other details, PdfParser should use PDFDocEncoding to translate these strings.

Here is a proposed 'PDFDocEncoding.php' file I quickly mocked up but haven't tested yet. You can give it a shot; I will also see if I can create a branch where this works and submit a PR.

<?php

/**
 * @file
 *          This file is part of the PdfParser library.
 *
 * @author  Sébastien MALOT <[email protected]>
 *
 * @date    2017-01-03
 *
 * @license LGPLv3
 *
 * @url     <https://github.com/smalot/pdfparser>
 *
 *  PdfParser is a pdf library written in PHP, extraction oriented.
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU Lesser General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU Lesser General Public License for more details.
 *
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program.
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
 */

// Source : https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.2.pdf
// Source : https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

namespace Smalot\PdfParser\Encoding;

/**
 * Class PDFDocEncoding
 */
class PDFDocEncoding extends AbstractEncoding
{
    public function getTranslations(): array
    {
        $encoding =
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          'breve caron circumflex dotaccent hungarumlaut ogonek ring tilde '.
          'space exclam quotedbl numbersign dollar percent ampersand quotesingle '.
          'parenleft parenright asterisk plus comma hyphen period slash zero one '.
          'two three four five six seven eight nine colon semicolon less equal '.
          'greater question at A B C D E F G H I J K L M N O P Q R S T U V W X '.
          'Y Z bracketleft backslash bracketright asciicircum underscore '.
          'grave a b c d e f g h i j k l m n o p q r s t u v w x y z '.
          'braceleft bar braceright asciitilde .notdef bullet dagger daggerdbl '.
          'ellipsis emdash endash florin fraction guilsinglleft guilsinglright '.
          'minus perthousand quotedblbase quotedblleft quotedblright quoteleft '.
          'quoteright quotesinglbase trademark fi fl Lslash OE Scaron Ydieresis '.
          'Zcaron dotlessi lslash oe scaron zcaron .notdef Euro exclamdown cent '.
          'sterling currency yen brokenbar section dieresis copyright '.
          'ordfeminine guillemotleft logicalnot .notdef registered macron degree '.
          'plusminus twosuperior threesuperior acute mu paragraph '.
          'periodcentered cedilla onesuperior ordmasculine guillemotright '.
          'onequarter onehalf threequarters questiondown Agrave Aacute '.
          'Acircumflex Atilde Adieresis Aring AE Ccedilla Egrave Eacute '.
          'Ecircumflex Edieresis Igrave Iacute Icircumflex Idieresis Eth Ntilde '.
          'Ograve Oacute Ocircumflex Otilde Odieresis multiply Oslash Ugrave '.
          'Uacute Ucircumflex Udieresis Yacute Thorn germandbls agrave aacute '.
          'acircumflex atilde adieresis aring ae ccedilla egrave eacute '.
          'ecircumflex edieresis igrave iacute icircumflex idieresis eth ntilde '.
          'ograve oacute ocircumflex otilde odieresis divide oslash ugrave '.
          'uacute ucircumflex udieresis yacute thorn ydieresis';

        return explode(' ', $encoding);
    }
}
GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this issue Jul 4, 2023
Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252.

For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired.

This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8.

It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue smalot#609: smalot#609
k00ni added a commit that referenced this issue Jul 11, 2023
* Enable PDFDocEncoding support

Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252.

For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired.

This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8.

It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue #609: #609

* Update PDFDocEncoding.php

I hope I am not assuming too much by adding myself as the author of this file!

* PR #611 suggested changes

Add comments in Document.php
Use plain class PDFDocEncoding, do not extend AbstractEncoding
array() => []
Break up class functions into one that returns the code table, and another that uses the table to perform the conversion

* fixed coding style issues in Document.php

* fixed coding style issue in PDFDocEncoding.php

---------

Co-authored-by: Konrad Abicht <[email protected]>
@k00ni
Copy link
Collaborator

k00ni commented Jul 11, 2023

Fixed by #611, isn't it? @GreyWyvern

If not, please reopen.

@k00ni k00ni closed this as completed Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants