-
Notifications
You must be signed in to change notification settings - Fork 57
Command line option that returns detected areas #11
Comments
Curious, @soupgrey , would you parse the output to use programmatically, or just for yourself to check before running the admittedly-time-consuming process? In either case, can you suggest how sample output would look? e.g., for human-readable output, are you thinking something like this?
|
I would like use it for debugging strange tabula output. Sometimes pdf table extraction does not work perfectly. (in most cases when dealing with multi-line cells or bad quality pdf files). I would like to review what TableGuesser treats as table and if it is a bad choice reprocess pdf with manually selected area. Then I could see how much I can trust TableGuesser to automate PDF processing :) Perfered output would be CSV format like:
BTW - Great software :) |
Thanks. It's definitely the case that table detection is imperfect. We're also definitely working on it, so if you get errors, debug output is helpful. Since this isn't a bug, I'm not going to fix it right now, but this is a easy and doable feature request. |
Another option in debugging would be useful for me. It would be great if tabula was able to return location of extracted row text. Something like:
Having this information it would be possible to run test of area or page extraction coverage. It would be possible to check how much data was recognized and extracted. Don't know how reasonable it sound to you, but for me it could improve automation and verification. |
Add an empty dot file to pdfs directory
@soupgrey, you should check out the debug output in Cell (with debug level set to SUPERDEBUG). https://github.com/jazzido/tabula-extractor/blob/pre07/lib/tabula/entities/cell.rb It should do some of what I think you're asking for. This output isn't available for all extractions (just "spreadsheet" method ones), but eventually it should be. I'd love to hear your feedback |
Feature request.
I'd love to see an command line option to get informations about rectangles found by TableGuesser. A "dry run mode" to see what portion of PDF tabula-extractor will be processing.
The text was updated successfully, but these errors were encountered: