Some characters are replaced with \f and \b #124

rplc · 2016-10-29T14:06:19Z

Hello,

Great libary. Works in most cases like a charm.
But sometimes there is a problem... Currently I've got a problem that occurs in some pdfs (not all!).

The pdf seems to be ok but the parser replaces some characters in the pdf with \f and \b.
I use the parser in a standard/simple way, see the function call below.

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("test.pdf");
file_put_contents("tmp.txt", $pdf->getText());

Here are two files. The source pdf and the txt file containing the whole pdf content (including the strange replacements).
test.pdf
tmp.txt

Would be great if somebody could help me.

Greetings Alex

The text was updated successfully, but these errors were encountered:

rplc · 2016-11-02T12:14:59Z

I've tracked the problem down to the Method translateChar in the Font.php Class.
But I still have no clue what causes the problem...

aariza-argentina · 2018-09-05T18:25:55Z

I have the same problem, but I detect that the problem is only on the First Page, apparentrly for the page 0 translateChar is not used to translate.

Connum · 2020-09-29T17:51:51Z

I went back in time using the provided PDF and can confirm that this issue was fixed with 4f4fd10

adrianbj · 2021-01-27T00:38:47Z

This fix works great, but it's not in Font.php in the version I installed via composer. Is that still expected, or should it be in there by now?

k00ni · 2021-01-27T08:36:41Z

We had to revert that fix, because it caused problems.

Reference: 1862686#diff-27bc594b24bd5e2779e8d81ee79810d0ffda03f200f0d53be007a44a9d2cb2de

Connum · 2021-01-27T10:38:46Z

We could add \b to the replacements list (actually, I thought I had when reverting the fix, but obviously I only added \f), but we'd have to check that this doesn't break anything else of course.

Im not sure though whether we can include this sample PDF for testing, because it contains personal data... @adrianbj can you provide a sample PDF file that we can include for the unit tests?

Connum · 2021-01-27T11:56:02Z

I tested this and it's not enough to just add \b to the replacements... (The letter that's supposed to be inits place is still missing) Apparently, stripcslashes() does something else to the input that results in the "\b" vanishing from the output while preserving the actual letter. But I don't think I'll find any more time to look into this.

adrianbj · 2021-01-27T14:40:06Z

Thanks for explaining - I don't think I am seeing any issues using stripcslashes() here, so will stick with that for the moment. You can use this PDF for testing. It would definitely be great to have a better solution for this. I took a look at some other PDF2TXT libraries (including the one I have used in the past) and they do mostly seem to use a character replacement approach for this, rather than stripcslashes(), but they are structured so differently that I didn't see an easy way to bring their solutions into this library.

healthy-chesapeake-waterways.pdf

GreyWyvern · 2023-07-21T13:37:54Z

Both example PDFs getText() text nicely, without any \b or \f with #597

k00ni closed this as completed Sep 30, 2020

k00ni added the bug label Sep 30, 2020

k00ni reopened this Jan 27, 2021

k00ni linked a pull request Jul 23, 2023 that will close this issue

str_replace in Font.php now seems to work as expected #597

Merged

k00ni closed this as completed in #597 Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some characters are replaced with \f and \b #124

Some characters are replaced with \f and \b #124

rplc commented Oct 29, 2016

rplc commented Nov 2, 2016

aariza-argentina commented Sep 5, 2018

Connum commented Sep 29, 2020

adrianbj commented Jan 27, 2021

k00ni commented Jan 27, 2021

Connum commented Jan 27, 2021 •

edited

Loading

Connum commented Jan 27, 2021

adrianbj commented Jan 27, 2021

GreyWyvern commented Jul 21, 2023

Some characters are replaced with \f and \b #124

Some characters are replaced with \f and \b #124

Comments

rplc commented Oct 29, 2016

rplc commented Nov 2, 2016

aariza-argentina commented Sep 5, 2018

Connum commented Sep 29, 2020

adrianbj commented Jan 27, 2021

k00ni commented Jan 27, 2021

Connum commented Jan 27, 2021 • edited Loading

Connum commented Jan 27, 2021

adrianbj commented Jan 27, 2021

GreyWyvern commented Jul 21, 2023

Connum commented Jan 27, 2021 •

edited

Loading