Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode to wrong character #607

Open
huynq55 opened this issue Jun 23, 2023 · 5 comments
Open

Decode to wrong character #607

huynq55 opened this issue Jun 23, 2023 · 5 comments

Comments

@huynq55
Copy link

huynq55 commented Jun 23, 2023

  • PHP Version: 8.1.13
  • PDFParser Version: 2.5.0

Description:

In my pdf file, there are strings with the character Đ, for example: "Địa chỉ". The characters 'ị' 'a' 'c' 'h' 'ỉ' are encoded with 2 bytes, while the character 'Đ' is encoded with 3 bytes. I learned this through checking the pdf file. However, I don't understand why the character 'Đ' is encoded with 3 bytes. pdfparser didn't detect this and therefore decodes 2 bytes at a time, resulting in incorrect decoding for all the characters.

PDF input

1C23TAZ_0000178321.pdf

Expected output & actual output

Expected output: Địa chỉ

Actual output: non-readable text

Bytes sequence: 01 5c 62 04 cf 00 44 00 03 00 46 00 4b 04 cd
01 5c 62 => can't decode
04 cf => ị
00 44 => a
00 03 => space
00 46 => c
00 4b => h
04 cd => ỉ

Code

$parser = new Parser();
$document = $parser->parseFile($file);
$data = $document->getText();

@GreyWyvern
Copy link
Contributor

I believe this is some kind of issue with the PDF generating software HiQPdf 11.1. When I opened the sample PDF in Adobe Acrobat, deleted one letter (the "o" in "month" from "Ngày (Date) 10 tháng (month) 05 năm (year) 2023") and typed it back in, then saved the file, it runs through PdfParser just fine, handling every Đ like a champ.

The same thing occurs when editing no text, but saving the PDF as a "Reduced size PDF" using Adobe Acrobat. All letters get interpreted correctly.

I think there's something to be said for PdfParser to be able to handle mis-coded[1] text (because Adobe sure can, just by loading it) but I don't think this is a bug in PdfParser per se. Probably more like an enhancement to handle this kind of situation.

  1. Is it actually mis-coded? Probably more testing needed.

@k00ni
Copy link
Collaborator

k00ni commented Jul 10, 2023

@huynq55 can you check that again please? If its the case what @GreyWyvern said, I am for closing this issue.

@GreyWyvern
Copy link
Contributor

I don't think this should be closed, because Adobe can open the file and read the characters properly, so PdfParser should be able to do that too. Just we can be sure that Adobe and HiQPdf 11.1 are saving these bytes (or fonts?) in different ways. It's a first place to look.

@GreyWyvern
Copy link
Contributor

Just returning to this to see if it's affected by PR 614. This particular issue is not a font issue, but an issue with the way HiQPDF 11.1 is saving the bytes. Where in 614, whole blocks of text were being assigned the wrong character map, in this case substrings within correctly encoded blocks are being saved in a weird way.

This is one such block. Parts of it are encoded fine, but a central portion has been changed. Adobe can read this, so there is some way to correct it, but it doesn't have anything to do with incorrectly specified fonts.

768 Nguyễn Thị �\b���Q�K�����3�K�m���Q�J���7�K���Q�K���0�����/���L�����7�3���7�K�����\bức, TP Hồ Chí Minh, Việt Nam.

@GreyWyvern
Copy link
Contributor

The sample file 1C23TAZ_0000178321.pdf is now extracting properly in the latest release v2.7.0 and was probably fixed by #597.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants