Decode to wrong character #607

huynq55 · 2023-06-23T03:46:46Z

PHP Version: 8.1.13
PDFParser Version: 2.5.0

Description:

In my pdf file, there are strings with the character Đ, for example: "Địa chỉ". The characters 'ị' 'a' 'c' 'h' 'ỉ' are encoded with 2 bytes, while the character 'Đ' is encoded with 3 bytes. I learned this through checking the pdf file. However, I don't understand why the character 'Đ' is encoded with 3 bytes. pdfparser didn't detect this and therefore decodes 2 bytes at a time, resulting in incorrect decoding for all the characters.

PDF input

1C23TAZ_0000178321.pdf

Expected output & actual output

Expected output: Địa chỉ

Actual output: non-readable text

Bytes sequence: 01 5c 62 04 cf 00 44 00 03 00 46 00 4b 04 cd
01 5c 62 => can't decode
04 cf => ị
00 44 => a
00 03 => space
00 46 => c
00 4b => h
04 cd => ỉ

Code

$parser = new Parser();
$document = $parser->parseFile($file);
$data = $document->getText();

GreyWyvern · 2023-07-07T19:39:53Z

I believe this is some kind of issue with the PDF generating software HiQPdf 11.1. When I opened the sample PDF in Adobe Acrobat, deleted one letter (the "o" in "month" from "Ngày (Date) 10 tháng (month) 05 năm (year) 2023") and typed it back in, then saved the file, it runs through PdfParser just fine, handling every Đ like a champ.

The same thing occurs when editing no text, but saving the PDF as a "Reduced size PDF" using Adobe Acrobat. All letters get interpreted correctly.

I think there's something to be said for PdfParser to be able to handle mis-coded[1] text (because Adobe sure can, just by loading it) but I don't think this is a bug in PdfParser per se. Probably more like an enhancement to handle this kind of situation.

Is it actually mis-coded? Probably more testing needed.

k00ni · 2023-07-10T06:43:11Z

@huynq55 can you check that again please? If its the case what @GreyWyvern said, I am for closing this issue.

GreyWyvern · 2023-07-10T19:43:00Z

I don't think this should be closed, because Adobe can open the file and read the characters properly, so PdfParser should be able to do that too. Just we can be sure that Adobe and HiQPdf 11.1 are saving these bytes (or fonts?) in different ways. It's a first place to look.

GreyWyvern · 2023-07-13T15:46:09Z

Just returning to this to see if it's affected by PR 614. This particular issue is not a font issue, but an issue with the way HiQPDF 11.1 is saving the bytes. Where in 614, whole blocks of text were being assigned the wrong character map, in this case substrings within correctly encoded blocks are being saved in a weird way.

This is one such block. Parts of it are encoded fine, but a central portion has been changed. Adobe can read this, so there is some way to correct it, but it doesn't have anything to do with incorrectly specified fonts.

768 Nguyễn Thị �\b���Q�K�����3�K�m���Q�J���7�K���Q�K���0�����/���L�����7�3���7�K�����\bức, TP Hồ Chí Minh, Việt Nam.

GreyWyvern · 2023-08-10T18:20:44Z

The sample file 1C23TAZ_0000178321.pdf is now extracting properly in the latest release v2.7.0 and was probably fixed by #597.

k00ni added bug de-/encoding issue labels Jun 23, 2023

GreyWyvern mentioned this issue Jul 7, 2023

text encoding breaks in the middle of the line #586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode to wrong character #607

Decode to wrong character #607

huynq55 commented Jun 23, 2023

GreyWyvern commented Jul 7, 2023

k00ni commented Jul 10, 2023

GreyWyvern commented Jul 10, 2023

GreyWyvern commented Jul 13, 2023

GreyWyvern commented Aug 10, 2023

Decode to wrong character #607

Decode to wrong character #607

Comments

huynq55 commented Jun 23, 2023

Description:

PDF input

Expected output & actual output

Code

GreyWyvern commented Jul 7, 2023

k00ni commented Jul 10, 2023

GreyWyvern commented Jul 10, 2023

GreyWyvern commented Jul 13, 2023

GreyWyvern commented Aug 10, 2023