Skip to content

Commit

Permalink
Allow for CRC23 checksum gzuncompress() (#622)
Browse files Browse the repository at this point in the history
* Allow for CRC23 checksum gzuncompress()

A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042
Partially resolves #592.

* Simplify decodeFilterFlateDecode() error-handling

Instead of setting an error handler to catch the E_WARNING's that gzuncompress() emits, suppress it with an @ so we can do away with the try/catch. Make a note of this in the comments.
Switch from using tempnam() to tmpfile() because tempnam() can emit E_NOTICE's and would have to be suppressed as well. tmpfile() just returns a handle or false.
Limit file_get_contents() by the $decodeMemoryLimit. Unlike gzuncompress() for which a limit value of zero (0) means "no limit", file_get_contents() takes null to mean "no limit".

* Update FilterHelper.php

Fix for PHP < 8.0 that doesn't like a length limit of null for file_get_contents().
  • Loading branch information
GreyWyvern authored Aug 5, 2023
1 parent c974994 commit 2608ac3
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 27 deletions.
Binary file added samples/bugs/Issue592.pdf
Binary file not shown.
48 changes: 24 additions & 24 deletions src/Smalot/PdfParser/RawData/FilterHelper.php
Original file line number Diff line number Diff line change
Expand Up @@ -233,32 +233,32 @@ protected function decodeFilterASCII85Decode(string $data): string
*/
protected function decodeFilterFlateDecode(string $data, int $decodeMemoryLimit): ?string
{
/*
* gzuncompress may throw a not catchable E_WARNING in case of an error (like $data is empty)
* the following set_error_handler changes an E_WARNING to an E_ERROR, which is catchable.
*/
set_error_handler(function ($errNo, $errStr) {
if (\E_WARNING === $errNo) {
throw new \Exception($errStr);
} else {
// fallback to default php error handler
return false;
}
});
// Uncatchable E_WARNING for "data error" is @ suppressed
// so execution may proceed with an alternate decompression
// method.
$decoded = @gzuncompress($data, $decodeMemoryLimit);

$decoded = null;

// initialize string to return
try {
$decoded = gzuncompress($data, $decodeMemoryLimit);
if (false === $decoded) {
throw new \Exception('decodeFilterFlateDecode: invalid code');
if (false === $decoded) {
// If gzuncompress() failed, try again using the compress.zlib://
// wrapper to decode it in a file-based context.
// See: https://www.php.net/manual/en/function.gzuncompress.php#79042
// Issue: https://github.com/smalot/pdfparser/issues/592
$ztmp = tmpfile();
if (false != $ztmp) {
fwrite($ztmp, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
$file = stream_get_meta_data($ztmp)['uri'];
if (0 === $decodeMemoryLimit) {
$decoded = file_get_contents('compress.zlib://'.$file);
} else {
$decoded = file_get_contents('compress.zlib://'.$file, false, null, 0, $decodeMemoryLimit);
}
fclose($ztmp);
}
} catch (\Exception $e) {
throw $e;
} finally {
// Restore old handler just in case it was customized outside of PDFParser.
restore_error_handler();
}

if (false === \is_string($decoded) || '' === $decoded) {
// If the decoded string is empty, that means decoding failed.
throw new \Exception('decodeFilterFlateDecode: invalid data');
}

return $decoded;
Expand Down
18 changes: 15 additions & 3 deletions tests/PHPUnit/Integration/RawData/FilterHelperTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
namespace PHPUnitTests\Integration\RawData;

use PHPUnitTests\TestCase;
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\RawData\FilterHelper;

class FilterHelperTest extends TestCase
Expand Down Expand Up @@ -113,7 +114,7 @@ public function testDecodeFilterFlateDecode(): void
public function testDecodeFilterFlateDecodeEmptyString(): void
{
$this->expectException(\Exception::class);
$this->expectExceptionMessage('gzuncompress(): data error');
$this->expectExceptionMessage('decodeFilterFlateDecode: invalid data');

$this->fixture->decodeFilter('FlateDecode', '');
}
Expand All @@ -124,13 +125,24 @@ public function testDecodeFilterFlateDecodeEmptyString(): void
public function testDecodeFilterFlateDecodeUncompressedString(): void
{
$this->expectException(\Exception::class);
$this->expectExceptionMessage('gzuncompress(): data error');
$this->expectExceptionMessage('decodeFilterFlateDecode: invalid data');

$this->fixture->decodeFilter('FlateDecode', 'something');
}

/**
* How does function behave if an uncompressed string was given.
* How does function behave if compression checksum is CRC32 instead of Adler-32.
* See: https://github.com/smalot/pdfparser/issues/592
*/
public function testDecodeFilterFlateDecodeCRC32Checksum(): void
{
$document = (new Parser())->parseFile($this->rootDir.'/samples/bugs/Issue592.pdf');

self::assertStringContainsString('Two Westbrook Corporate Center Suite 500', $document->getText());
}

/**
* How does function behave if an unknown filter name was given.
*/
public function testDecodeFilterUnknownFilter(): void
{
Expand Down

0 comments on commit 2608ac3

Please sign in to comment.