-
Notifications
You must be signed in to change notification settings - Fork 57
Many lines inside a spreadsheet cell: heuristic fails #57
Comments
Thanks, @mhkeller Can you share that PDF? I'd love to take a look at it. |
Also, if you're using Tabula from the command line, you might want to try tabula-java (pure java, easier to install, ~3x faster) — |
Just sent you an extract. re: tabula-extractor 😓 i just got it running. |
OK, so the output is expected for that PDF. Without visual separators ("ruling lines") we can't really merge multiline cells. I usually post process those cases with a script that "rolls up" multi line cells. |
I shall do the same. |
Unfortunately, the number of ways that shitty PDFs exist precludes a heuristic attempt at "rolling up" these sorts of cells automatically. Some flow onto the next line, some flow onto the previous line, some center-align. It's a real mess. I've thought about writing a roller-upper framework, but for some reason concluded it was impossible. |
Ya it sounds rough. I'll post a link for whatever I end up implementing. The more examples the merrier I guess right? |
Here's a case that we might want to look into: https://www.dropbox.com/s/0i6ae5kgtcy0frb/s-013163.pdf
It's definitely a "spreadsheet", but the lines-of-text / ruling-lines ratio is way below/above the heuristic's defined threshold.
The text was updated successfully, but these errors were encountered: