Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement line color filter #421

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jazzpi
Copy link

@jazzpi jazzpi commented May 5, 2021

Addresses #21.

I've never worked with PDFBox before, so I hope this is the right approach -- it works at least for this file (without the color filter, some of the underlines for the hyperlinks are detected as rulings, which splits those rows). However, it doesn't work for this file (without the color filter, it simply detects all cells as separate). With the color filter, it exports the following CSV:

A,B
4","2
5,6

Is this an issue with the color filter or is it related to the red and black lines crossing?

Other notes:

  • I'm not sure if this is a good way to pass the line color filter argument to the ObjectExtractorStreamEngine.
  • I haven't added tests yet. I should hopefully have some time next week to debug further and add them.
  • I couldn't come up with a sensible short-style command line option, so I only added a long-style one.

@jazzpi
Copy link
Author

jazzpi commented May 11, 2021

... so after a couple hours of debugging I just realized that this happens because the line returns used by tabula-java are carriage returns instead of line feeds, which means the beginning of the line is overwritten, and it actually works just fine.

I've added a test as well and think this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant