Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Many lines inside a spreadsheet cell: heuristic fails #57

Open
jazzido opened this issue Jan 13, 2014 · 8 comments
Open

Many lines inside a spreadsheet cell: heuristic fails #57

jazzido opened this issue Jan 13, 2014 · 8 comments

Comments

@jazzido
Copy link
Contributor

jazzido commented Jan 13, 2014

Here's a case that we might want to look into: https://www.dropbox.com/s/0i6ae5kgtcy0frb/s-013163.pdf

It's definitely a "spreadsheet", but the lines-of-text / ruling-lines ratio is way below/above the heuristic's defined threshold.

@mhkeller
Copy link

I can't access that pdf but it might be similar to an issue I'm having.
screen shot 2015-03-11 at 11 41 56 am

For rows like the above where HOOKS & HODGES has a line break, I'm getting one row with a lot of blanks and HODGES in the name cell.

@jazzido
Copy link
Contributor Author

jazzido commented Mar 11, 2015

Thanks, @mhkeller

Can you share that PDF? I'd love to take a look at it.

@jazzido
Copy link
Contributor Author

jazzido commented Mar 11, 2015

Also, if you're using Tabula from the command line, you might want to try tabula-java (pure java, easier to install, ~3x faster) — tabula-extractor is going to be deprecated soon.

@mhkeller
Copy link

Just sent you an extract. re: tabula-extractor 😓 i just got it running.

@jazzido
Copy link
Contributor Author

jazzido commented Mar 11, 2015

OK, so the output is expected for that PDF. Without visual separators ("ruling lines") we can't really merge multiline cells.

I usually post process those cases with a script that "rolls up" multi line cells.

@mhkeller
Copy link

I shall do the same.

@jeremybmerrill
Copy link
Member

Unfortunately, the number of ways that shitty PDFs exist precludes a heuristic attempt at "rolling up" these sorts of cells automatically. Some flow onto the next line, some flow onto the previous line, some center-align. It's a real mess. I've thought about writing a roller-upper framework, but for some reason concluded it was impossible.

@mhkeller
Copy link

Ya it sounds rough. I'll post a link for whatever I end up implementing. The more examples the merrier I guess right?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants