Missing PDFDocEncoding #609

GreyWyvern · 2023-06-27T20:06:35Z

The Adobe PDF Reference defines a special encoding which is an extension of Latin1 such that:

Informational or content strings can be represented in Unicode. These strings include text annotations, bookmark names, article names, document information, date strings, etc. In PDF 1.1 these strings are stored in PDFDocEncoding, which is a superset of ISOLatin1.
PDF Reference 1.2 - https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.2.pdf

See also: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

As this includes "document information" such as titles, authors and other details, PdfParser should use PDFDocEncoding to translate these strings.

Here is a proposed 'PDFDocEncoding.php' file I quickly mocked up but haven't tested yet. You can give it a shot; I will also see if I can create a branch where this works and submit a PR.

<?php

/**
 * @file
 *          This file is part of the PdfParser library.
 *
 * @author  Sébastien MALOT <[email protected]>
 *
 * @date    2017-01-03
 *
 * @license LGPLv3
 *
 * @url     <https://github.com/smalot/pdfparser>
 *
 *  PdfParser is a pdf library written in PHP, extraction oriented.
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU Lesser General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU Lesser General Public License for more details.
 *
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program.
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
 */

// Source : https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.2.pdf
// Source : https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

namespace Smalot\PdfParser\Encoding;

/**
 * Class PDFDocEncoding
 */
class PDFDocEncoding extends AbstractEncoding
{
    public function getTranslations(): array
    {
        $encoding =
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          '.notdef .notdef .notdef .notdef .notdef .notdef .notdef .notdef '.
          'breve caron circumflex dotaccent hungarumlaut ogonek ring tilde '.
          'space exclam quotedbl numbersign dollar percent ampersand quotesingle '.
          'parenleft parenright asterisk plus comma hyphen period slash zero one '.
          'two three four five six seven eight nine colon semicolon less equal '.
          'greater question at A B C D E F G H I J K L M N O P Q R S T U V W X '.
          'Y Z bracketleft backslash bracketright asciicircum underscore '.
          'grave a b c d e f g h i j k l m n o p q r s t u v w x y z '.
          'braceleft bar braceright asciitilde .notdef bullet dagger daggerdbl '.
          'ellipsis emdash endash florin fraction guilsinglleft guilsinglright '.
          'minus perthousand quotedblbase quotedblleft quotedblright quoteleft '.
          'quoteright quotesinglbase trademark fi fl Lslash OE Scaron Ydieresis '.
          'Zcaron dotlessi lslash oe scaron zcaron .notdef Euro exclamdown cent '.
          'sterling currency yen brokenbar section dieresis copyright '.
          'ordfeminine guillemotleft logicalnot .notdef registered macron degree '.
          'plusminus twosuperior threesuperior acute mu paragraph '.
          'periodcentered cedilla onesuperior ordmasculine guillemotright '.
          'onequarter onehalf threequarters questiondown Agrave Aacute '.
          'Acircumflex Atilde Adieresis Aring AE Ccedilla Egrave Eacute '.
          'Ecircumflex Edieresis Igrave Iacute Icircumflex Idieresis Eth Ntilde '.
          'Ograve Oacute Ocircumflex Otilde Odieresis multiply Oslash Ugrave '.
          'Uacute Ucircumflex Udieresis Yacute Thorn germandbls agrave aacute '.
          'acircumflex atilde adieresis aring ae ccedilla egrave eacute '.
          'ecircumflex edieresis igrave iacute icircumflex idieresis eth ntilde '.
          'ograve oacute ocircumflex otilde odieresis divide oslash ugrave '.
          'uacute ucircumflex udieresis yacute thorn ydieresis';

        return explode(' ', $encoding);
    }
}

The text was updated successfully, but these errors were encountered:

Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252. For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired. This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8. It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue smalot#609: smalot#609

* Enable PDFDocEncoding support Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252. For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired. This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8. It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue #609: #609 * Update PDFDocEncoding.php I hope I am not assuming too much by adding myself as the author of this file! * PR #611 suggested changes Add comments in Document.php Use plain class PDFDocEncoding, do not extend AbstractEncoding array() => [] Break up class functions into one that returns the code table, and another that uses the table to perform the conversion * fixed coding style issues in Document.php * fixed coding style issue in PDFDocEncoding.php --------- Co-authored-by: Konrad Abicht <[email protected]>

k00ni · 2023-07-11T14:56:22Z

Fixed by #611, isn't it? @GreyWyvern

If not, please reopen.

k00ni added enhancement de-/encoding issue labels Jun 28, 2023

k00ni mentioned this issue Jun 28, 2023

Read XMP Metadata and add it to data returned by getDetails() #606

Merged

GreyWyvern mentioned this issue Jul 4, 2023

Enable PDFDocEncoding support for metadata #611

Merged

k00ni closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing PDFDocEncoding #609

Missing PDFDocEncoding #609

GreyWyvern commented Jun 27, 2023

k00ni commented Jul 11, 2023

Missing PDFDocEncoding #609

Missing PDFDocEncoding #609

Comments

GreyWyvern commented Jun 27, 2023

k00ni commented Jul 11, 2023