Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some characters are replaced with \f and \b #124

Closed
rplc opened this issue Oct 29, 2016 · 9 comments · Fixed by #597
Closed

Some characters are replaced with \f and \b #124

rplc opened this issue Oct 29, 2016 · 9 comments · Fixed by #597
Labels

Comments

@rplc
Copy link

rplc commented Oct 29, 2016

Hello,

Great libary. Works in most cases like a charm.
But sometimes there is a problem... Currently I've got a problem that occurs in some pdfs (not all!).

The pdf seems to be ok but the parser replaces some characters in the pdf with \f and \b.
I use the parser in a standard/simple way, see the function call below.

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("test.pdf");
file_put_contents("tmp.txt", $pdf->getText());

Here are two files. The source pdf and the txt file containing the whole pdf content (including the strange replacements).
test.pdf
tmp.txt

Would be great if somebody could help me.

Greetings Alex

@rplc
Copy link
Author

rplc commented Nov 2, 2016

I've tracked the problem down to the Method translateChar in the Font.php Class.
But I still have no clue what causes the problem...

@aariza-argentina
Copy link

I have the same problem, but I detect that the problem is only on the First Page, apparentrly for the page 0 translateChar is not used to translate.

@Connum
Copy link
Contributor

Connum commented Sep 29, 2020

I went back in time using the provided PDF and can confirm that this issue was fixed with 4f4fd10

@k00ni k00ni closed this as completed Sep 30, 2020
@k00ni k00ni added the bug label Sep 30, 2020
@adrianbj
Copy link

This fix works great, but it's not in Font.php in the version I installed via composer. Is that still expected, or should it be in there by now?

@k00ni
Copy link
Collaborator

k00ni commented Jan 27, 2021

We had to revert that fix, because it caused problems.

Reference: 1862686#diff-27bc594b24bd5e2779e8d81ee79810d0ffda03f200f0d53be007a44a9d2cb2de

@k00ni k00ni reopened this Jan 27, 2021
@Connum
Copy link
Contributor

Connum commented Jan 27, 2021

We could add \b to the replacements list (actually, I thought I had when reverting the fix, but obviously I only added \f), but we'd have to check that this doesn't break anything else of course.

Im not sure though whether we can include this sample PDF for testing, because it contains personal data... @adrianbj can you provide a sample PDF file that we can include for the unit tests?

@Connum
Copy link
Contributor

Connum commented Jan 27, 2021

I tested this and it's not enough to just add \b to the replacements... (The letter that's supposed to be inits place is still missing) Apparently, stripcslashes() does something else to the input that results in the "\b" vanishing from the output while preserving the actual letter. But I don't think I'll find any more time to look into this.

@adrianbj
Copy link

Thanks for explaining - I don't think I am seeing any issues using stripcslashes() here, so will stick with that for the moment. You can use this PDF for testing. It would definitely be great to have a better solution for this. I took a look at some other PDF2TXT libraries (including the one I have used in the past) and they do mostly seem to use a character replacement approach for this, rather than stripcslashes(), but they are structured so differently that I didn't see an easy way to bring their solutions into this library.

healthy-chesapeake-waterways.pdf

@GreyWyvern
Copy link
Contributor

Both example PDFs getText() text nicely, without any \b or \f with #597

@k00ni k00ni linked a pull request Jul 23, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants