Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hexadecimal decoding (fixes #715) #716

Closed
wants to merge 6 commits into from

Conversation

krzyc
Copy link

@krzyc krzyc commented May 27, 2024

Type of pull request

  • Bug fix (involves code and configuration changes)

About

Fix for #715 as requested by @k00ni. Tests are passing but I am not confident that this will not break pdfparser because I lack knowledge of internal work of library and PDF format.

@k00ni
Copy link
Collaborator

k00ni commented May 28, 2024

@krzyc
Copy link
Author

krzyc commented May 28, 2024

@k00ni I have fixed PHP 7.x incompatibility.
Regarding testSpecialCharsEncodedAsHex test failed - I have still not found a method to fix it - I have added this test because during additional inspection I have discovered that there is more related problems with round brackets.
But I think that PR is already in state that allows us to verify whether my approach is correct.

@k00ni
Copy link
Collaborator

k00ni commented May 28, 2024

Is there a reference in the PDF specification you referring to with this changes in general?

@krzyc
Copy link
Author

krzyc commented May 28, 2024

Probably, but it was not intentional. Let it be "PDF Reference sixth edition" by Adobe Systems Incorporated (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) page 53:
"Any characters may appear in a string except unbalanced parentheses and the backslash, which must be treated specially" e.g.

( Strings may contain balanced parentheses ( ) and
special characters ( * ! & } ^ % and so on ) . )

(Maybe tests should include this string instead of mine.)

I was mainly interested in reading Sig objects (https://www.adobe.com/devnet-docs/acrobatetk/tools/DigSig/Acrobat_DigitalSignatures_in_PDF.pdf), but pdfparser was truncating binary data so I was forced to dive into basics and found more problems as in attached tests.

My concern is I do not understand why pdfparser is parsing strings like this:

// Find next ')' not escaped.
$cur_start_text = $start_search_end = 0;
while (false !== ($cur_start_pos = strpos($name, ')', $start_search_end))) {
$cur_extract = substr($name, $cur_start_text, $cur_start_pos - $cur_start_text);
preg_match('/(?P<escape>[\\\]*)$/s', $cur_extract, $match);
if (!(\strlen($match['escape']) % 2)) {
break;
}
$start_search_end = $cur_start_pos + 1;
}
// Extract string.
$name = substr($name, 0, (int) $cur_start_pos);

so I am not comfortable at modifying it.

I am looking for advice or help in fixing problem as I written earlier in #715.

@krzyc
Copy link
Author

krzyc commented May 29, 2024

I have rewritten element string parser and now tests pass.
I have two observations:

  • backslash followed by space should return backslash and space, but this triggers fail of tests, so I kept old behavior (returning space only) for compatibility - to be discussed.
  • octal parsing should be probably moved to parsing loop, but I kept current behavior - to be discussed.

I am awaiting your feedback.

@krzyc krzyc marked this pull request as ready for review May 29, 2024 11:22
@k00ni
Copy link
Collaborator

k00ni commented May 30, 2024

Thanks for your effort! I try to give you feedback as soon as possible.

- Moved some test-related code to separate function to improve code
readability
- renamed and refined testSpecialCharsEncodedAsHex: you don't need an
if-clause in a test if you insist on certain values along the way, just
use assertX to check for expected values ("fail early")
- ElementString: moved the part which handles escaped characters to a
separate function to improve code readability
- added references / comments
@k00ni k00ni linked an issue Jun 5, 2024 that may be closed by this pull request
... which fail (mostly) without the fixes in this branch.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krzyc Again, thanks for your effort. I dove into your code and the PDF specification 1.7 and holy cow, the spec has over 1300 pages. This topic is not easy but I think I got what you try to achieve with your patch as well as what the spec says in this regard. But further feedback is welcomed as always, CC: @j0k3r @GreyWyvern.


I took the liberty to change some of your code. Here is a summary of my changes:

  • refactored some code parts
  • added references and some comments
  • added a few tests

(for further details, please see my commit notes)

About the tests: I copied a few example texts from the spec (page 54 + 55) to see if PDFParser fails without your fixes. It does most of the time. Only this one works pre- and post patch.

What is great is the fact that it now parses strings like (ref):

(Strings may contain balanced parentheses ( ) and
special characters (*!&}^% and so on).)

Pre-patch PDFParser would have only got until Strings may contain balanced parentheses (. This change is super relevant for scientific papers which often use parentheses in some way (( ), [ ], { }, ...).

I can image there are further edge cases which still might not work. But we should aim for the following: It can get new features/capabilities as long as it retains its current ones.

To your knowledge, are there any cases left which worked before but don't now?

Comment on lines +87 to +91
case ' ': // TODO: this should probably be removed - kept for compatibility
$processedName .= $nextChar;
$name = substr($name, 1);
++$position;
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "kept for compatibility"? To what?

Comment on lines -50 to 55
// repackage string as standard
$name = '('.self::decode($name).')';
$element = ElementDate::parse($name, $document);
$name = self::decode($name);
$element = ElementDate::parse('('.$name.')', $document);

if (!$element) {
$element = ElementString::parse($name, $document);
$element = new self($name);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give me a short summary what you try to accomplish here.

Why removing the reference to $document?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why switching from ElementString to ElementHexa? Is it to match others existing behavior from other elements classes?

@GreyWyvern
Copy link
Contributor

GreyWyvern commented Jun 5, 2024

After an initial review, this appears to be an issue solely with ElementString::parse. There are zero changes required to the ElementHexa.php file IMHO. ElementHexa::parse sends ElementString::parse a perfectly valid string, and it's ElementString that proceeds to mess it up.

The only reason ElementHexa is involved at all is because it's a vector that allows this sort of tricky (but still valid!) string to reach ElementString, whereas normally ElementString would rarely see such strings.

ElementString::parse expects all internal parentheses to be escaped, and I would wager that in >99% of PDFs, they are. But according to the PDF Reference, balanced, unescaped parentheses are allowed. So this is a matter of just updating ElementString::parse to handle balanced, unescaped parentheses. I haven't fully examined the updated code in ElementString::parse in this PR, but as long as it does that, there should be no other edits necessary to ElementHexa.php.

Edit: In fact, when I update the test hexadecimal string to '\(\(\\\\' the output is '()\', not truncated at all, because all the parentheses were escaped as ElementString::parse expected.

@k00ni
Copy link
Collaborator

k00ni commented Jun 21, 2024

What are your plans here @krzyc?

As mentioned, the change in ElementHexa does not seem to be required.

@GreyWyvern
Copy link
Contributor

I've been thinking about this some more and it's a little more complicated than it seems at first glance.

ElementString accepts (and currently expects) strings with escaped slashes and parentheses. 99% of the time strings from PDF document streams will be escaped, BUT unescaped, balanced parentheses are allowed. Unescaped backslashes are NOT allowed; according to the PDF reference if they aren't followed by a valid escape character, they should be ignored.

And that is the problem with strings from ElementHexa. It passes a completely unescaped (but valid) string, but if it contains literal backslashes, then further parsing it will remove them or result in an incorrect special character.

Strings that originate from ElementHexa should be passed to ElementString to create the appropriate object, but ElementString::parse() should probably not be called since the string is already unescaped. Or if necessary, a new function to process already unescaped strings can be added.

@k00ni
Copy link
Collaborator

k00ni commented Jul 19, 2024

I don't want to let this PR become stale, but my time is limited currently. What is the minimal amount of work to finish this? A revert of the ElementHexa part and some refinement should be enough. Any suggestions?

CC @GreyWyvern @j0k3r

@k00ni
Copy link
Collaborator

k00ni commented Aug 16, 2024

Closed due to inactivity and no response from the author.

@k00ni k00ni closed this Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Closing round bracket encoded in hexadecimal format breaks parsing
4 participants