Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when obtaining array from a PDF #681

Open
andresflorez12 opened this issue Feb 25, 2024 · 1 comment
Open

Error when obtaining array from a PDF #681

andresflorez12 opened this issue Feb 25, 2024 · 1 comment
Labels

Comments

@andresflorez12
Copy link

  • PHP Version: 8.0
  • PDFParser Version: ^2.8

Description:

As you can see, the PDF has a table with rows and columns, but when it is read the order of the same changes, in many cases where it is empty it does not generate a value and therefore when creating arrays so that the values ​​are in order It is not allowed to me, since as you can see it takes the values ​​as desired without following the rows and columns of the table.

PDF input

ARRIVALS.pdf

Expected output & actual output

Just small fragments
[42] => Time,[41] => ETA/ARR
[11] => 20-JUL-2010 0900 [12] => 07-DEC-2012 0045 [13] => 30-JAN-2013 0700 [14] => 23-FEB-2013 1000 [15] => 21-MAR-2013 1730 [16] => 09-MAY-2013 0820 [17] => 27-JUL-2013 2250 [18] => 02-SEP-2013 0700 [19] => 13-NOV-2013 1942 [20] => 20-NOV-2013 2052 [21] => 15-JAN-2014 1155 [22] => 17-JAN-2014 0345 [23] => 18-JAN-2014 1200 [24] => 16-APR-2014 1200 [25] => 20-JAN-2015 1700 [26] => 20-JAN-2015 1700 [27] => 07-FEB-2015 1824 [28] => 12-MAR-2015 1237 [29] => 20-OCT-2015 1115 [30] => 19-FEB-2016 1900 [31] => 17-JAN-2017 0850 [32] => 21-MAR-2017 1710 [33] => 26-APR-2017 0700 [34] => 13-JUL-2017 0800 [35] => 26-OCT-2017 0700 [36] => 19-FEB-2018 0330 [37] => 10-JUN-2018 1740 [38] => 02-JUL-2018 0600 [39] => 10-JUL-2018 0710 [40] => 29-AUG-2018 0730
[323] => IMO #
[305]=>'' [306] => 1010179 [307] => 9195248 [307A]=>'' [307B]=>'' [308] => 6903395 [309] => 9147215 [310] => 9607435 [311] => 9192741 [312] => 5100532 [313] => 1010117 [314] => 9669471 [315] => 9551480 [316] => 8906169 [317] => 6725066 [318] => 743807 [319] => 9262003 [320] => 9773870 [321] => 6903981 [322] => 9235490
ACTUAL OUTPUT
Array ( [0] => Arrivals [1] => Page [2] => 1 [3] => Enhanced Vessel Traffic Management System [4] => Prepared on: [5] => Report Id: [6] => SY5700RP [7] => Run by: [8] => REPORTS [9] => 05-FEB-2024 1130 [10] => Atlantic [11] => 20-JUL-2010 0900 [12] => 07-DEC-2012 0045 [13] => 30-JAN-2013 0700 [14] => 23-FEB-2013 1000 [15] => 21-MAR-2013 1730 [16] => 09-MAY-2013 0820 [17] => 27-JUL-2013 2250 [18] => 02-SEP-2013 0700 [19] => 13-NOV-2013 1942 [20] => 20-NOV-2013 2052 [21] => 15-JAN-2014 1155 [22] => 17-JAN-2014 0345 [23] => 18-JAN-2014 1200 [24] => 16-APR-2014 1200 [25] => 20-JAN-2015 1700 [26] => 20-JAN-2015 1700 [27] => 07-FEB-2015 1824 [28] => 12-MAR-2015 1237 [29] => 20-OCT-2015 1115 [30] => 19-FEB-2016 1900 [31] => 17-JAN-2017 0850 [32] => 21-MAR-2017 1710 [33] => 26-APR-2017 0700 [34] => 13-JUL-2017 0800 [35] => 26-OCT-2017 0700 [36] => 19-FEB-2018 0330 [37] => 10-JUN-2018 1740 [38] => 02-JUL-2018 0600 [39] => 10-JUL-2018 0710 [40] => 29-AUG-2018 0730 [41] => ETA/ARR [42] => Time [43] => * [44] => * [45] => * [46] => * [47] => * [48] => * [49] => * [50] => * [51] => * [52] => * [53] => * [54] => * [55] => * [56] => * [57] => * [58] => * [59] => * [60] => * [61] => * [62] => * [63] => * [64] => * [65] => * [66] => * [67] => * [68] => * [69] => * [70] => * [71] => * [72] => Bk Date [73] => 3009758 [74] => 6007662 [75] => 3002206 [76] => 0354104 [77] => 3013206 [78] => 0368156 [79] => 0384917 [80] => 3013608 [81] => 3013713 [82] => 3008829 [83] => 3000660 [84] => 3013815 [85] => 3009984 [86] => 3014090 [87] => 3013301 [88] => 6006991 [89] => 3015105 [90] => 0770795 [91] => 0235661 [92] => 3016136 [93] => 3019329 [94] => 6002381 [95] => 3020297 [96] => 6014554 [97] => 3009692 [98] => 6015644 [99] => 3006914 [100] => 3015287 [101] => 0272787 [102] => 6016569 [103] => SIN [104] => * [105] => DN 136 [106] => ELANDESS [107] => MAREIKE B [108] => MATT II [109] => BLACK SHEEP [110] => SONNY [111] => STADT DUESSELDORF [112] => SENTA [113] => MILITOS [114] => ATLANTIC LIGURIA [115] => CANDELA V [116] => IMAGINE [117] => ORINOQUIA I [118] => CONCEPCION [119] => CERRO ITAMUT [120] => PARITA I [121] => T/T BLACK JACK [122] => VB CALIFORNIA [123] => ADONAI [124] => CACIQUE [125] => DWS XPRESS [126] => SINA [127] => ABOUT TIME [128] => ARCANGEL SAN RAFAEL [129] => MAMMA MIA [130] => ARCANGEL SAN GABRIEL [131] => GREAT PORTOBELLO [132] => HC SVEA KIM [133] => ICB - 01 [134] => OÑI LEKUN [135] => Name [136] => 34.12 [137] => 196.85 [138] => 283.46 [139] => 120.01 [140] => 49.11 [141] => 244.09 [142] => 480.54 [143] => 40.85 [144] => 899.54 [145] => 600.39 [146] => 169.95 [147] => 214.90 [148] => 36.58 [149] => 70.00 [150] => 94.82 [151] => 89.90 [152] => 36.09 [153] => 105.31 [154] => 194.69 [155] => 151.71 [156] => 34.45 [157] => 328.05 [158] => 39.37 [159] => 95.28 [160] => 117.59 [161] => 95.47 [162] => 400.50 [163] => 424.70 [164] => 120.20 [165] => 104.99 [166] => Length [167] => 32.87 [168] => 36.19 [169] => 42.72 [170] => 45.57 [171] => 14.70 [172] => 33.53 [173] => 75.46 [174] => 11.93 [175] => 164.16 [176] => 89.99 [177] => 27.66 [178] => 40.88 [179] => 9.97 [180] => 48.26 [181] => 45.93 [182] => 40.03 [183] => 9.84 [184] => 30.35 [185] => 34.81 [186] => 28.35 [187] => 10.50 [188] => 62.32 [189] => 13.12 [190] => 41.34 [191] => 26.41 [192] => 43.41 [193] => 54.09 [194] => 52.79 [195] => 50.00 [196] => 39.37 [197] => Beam [198] => D [199] => HML [200] => CC [201] => Rest [202] => N [203] => H [204] => N [205] => N [206] => N [207] => 7 [208] => N [209] => N [210] => H [211] => 1 [212] => N [213] => N [214] => N [215] => N [216] => N [217] => N [218] => N [219] => N [220] => H [221] => N [222] => N [223] => N [224] => N [225] => N [226] => N [227] => N [228] => 7 [229] => H [230] => N [231] => N [232] => Pd [233] => S12GA [234] => Sched [235] => No. [236] => Y [237] => N [238] => N [239] => N [240] => N [241] => N [242] => N [243] => N [244] => N [245] => N [246] => N [247] => N [248] => Y [249] => N [250] => Y [251] => Y [252] => N [253] => N [254] => N [255] => N [256] => N [257] => N [258] => N [259] => N [260] => N [261] => N [262] => N [263] => N [264] => N [265] => N [266] => Tr [267] => Flg [268] => HRM [269] => CPC [270] => HRM+ [271] => HRM [272] => HRM [273] => Hold P [274] => 22-JUL-2010 0828* [275] => First Lock Time [276] => Depart Last Lock [277] => ASA [278] => ASA [279] => PA [280] => GATE [281] => AGENSA [282] => SEASAG [283] => CENTCO [284] => FERNIE [285] => INCH [286] => TINAMC [287] => STANLE [288] => PCC [289] => PCC [290] => ROZO [291] => ASA [292] => INTCAR [293] => STWARD [294] => MASTER [295] => COSCO [296] => CENTCO [297] => ATLASM [298] => MASTER [299] => ATLASM [300] => ONIX [301] => ATLASM [302] => ASA [303] => NL [304] => Agent [305] => Customer [306] => 1010179 [307] => 9195248 [308] => 6903395 [309] => 9147215 [310] => 9607435 [311] => 9192741 [312] => 5100532 [313] => 1010117 [314] => 9669471 [315] => 9551480 [316] => 8906169 [317] => 6725066 [318] => 743807 [319] => 9262003 [320] => 9773870 [321] => 6903981 [322] => 9235490 [323] => IMO # [324] => Vsl [325] => Cd [326] => 14 [327] => 21 [328] => 01 [329] => 14 [330] => 21 [331] => 28 [332] => 07 [333] => 21 [334] => 04 [335] => 29 [336] => 01 [337] => 21 [338] => 21 [339] => 14 [340] => 18 [341] => 18 [342] => 21 [343] => 18 [344] => 01 [345] => 21 [346] => 21 [347] => 07 [348] => 21 [349] => 18 [350] => 21 [351] => 18 [352] => 28 [353] => 01 [354] => 14 [355] => 50 [356] => 1,090 [357] => 2,545 [358] => 251 [359] => 1,141 [360] => 9,528 [361] => 18 [362] => 23,843 [363] => 423 [364] => 1,503 [365] => 484 [366] => 359 [367] => 331 [368] => 810 [369] => 458 [370] => 4,462 [371] => 299 [372] => 4,605 [373] => 6,382 [374] => Gross [375] => Ton [376] => 2009 [377] => 2001 [378] => 1984 [379] => 1968 [380] => 1998 [381] => 1997 [382] => 1999 [383] => 1956 [384] => 2011 [385] => 2007 [386] => 2013 [387] => 2011 [388] => 1989 [389] => 1967 [390] => 2002 [391] => 2003 [392] => 2007 [393] => 1969 [394] => 2000 [395] => Yr [396] => Blt [397] => 11/06 [398] => 16/00 [399] => 02/00 [400] => 22/08 [401] => 09/00 [402] => 10/08 [403] => 09/00 [404] => 19/06 [405] => 05/00 [406] => 13/00 [407] => 13/02 [408] => 05/00 [409] => 10/00 [410] => Max [411] => TFW [412] => Visit No. [413] => 178234 [414] => 229018 [415] => 231510 [416] => 232797 [417] => 234001 [418] => 236363 [419] => 239894 [420] => 241625 [421] => 244750 [422] => 245363 [423] => 247804 [424] => 247917 [425] => 247661 [426] => 249568 [427] => 264857 [428] => 264860 [429] => 265640 [430] => 267371 [431] => 278035 [432] => 283425 [433] => 298998 [434] => 302484 [435] => 304483 [436] => 308604 [437] => 314107 [438] => 320699 [439] => 327002 [440] => 327464 [441] => 328501 [442] => 331162 [443] => PMX+ )

Code

// Include Composer autoloader if not already done.
include 'pdfparser/vendor/autoload.php';
// Parse pdf file and build necessary objects.
$config = new \Smalot\PdfParser\Config();
$config->setIgnoreEncryption(true);
$config->setPdfWhitespaces='\f\r';
/**
* Represents: (NUL, HT, LF, FF, CR, SP)
*
* @var string
*/
//$config->pdfWhitespacesRegex = '[\0\f\r ]';
$config->setFontSpaceLimit=10;
$config->setHorizontalOffset(" ");
$parser = new \Smalot\PdfParser\Parser([], $config);
//$pdf = $parser->parseFile('ReportAU_2145225.pdf');
$pdf = $parser->parseFile('ARRIVALS.pdf');
$pages = $pdf->getPages();
// Loop over each page to extract text.
$text = '';
foreach ($pages as $page) {
print_r($page->getTextArray());
echo '


';
}

@k00ni k00ni added the bug label Feb 26, 2024
@vinceDeNoisy
Copy link

PHP Version : 8.2.0
PDFParser Version: v2.9.0

Exactly the same issue :
It was working until I had to update PDFParser for PHP 8.

ep4_1_2024-02.pdf

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($fileName);
$page = $pdf->getPages()[2];

echo '
getText
';
print_r($page->getText());
// expected output : a different separator for lines and words (typically "\n" and ”\t” or ” "),
// actual output : "\n” and " " between each field => impossible to parse lines

echo '
getTextArray
';
print_r($page->getTextArray());
// expected output : array with the same structure as the pdf
// actual output : an array with one different value for each word

echo '
getDataTm
';
$data = $page->getDataTm();
foreach($data as $k => $td){
$text=$td[1];
if(!trim($text))continue;
echo 'text'.$text.'
';
echo 'transformation matrix = ('.$td[0][0].','.$td[0][1].','.$td[0][2].','.$td[0][3].')
';
echo 'position x='.$td[0][4].' y='.$td[0][5].'
';
}
// expected output : position different for each text element
// actual output : same position for every text element

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants