Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF example of truncated highlight #33

Open
tadeoos opened this issue Jul 29, 2020 · 2 comments
Open

PDF example of truncated highlight #33

tadeoos opened this issue Jul 29, 2020 · 2 comments
Labels
bug pdfminer Issue in pdfminer

Comments

@tadeoos
Copy link

tadeoos commented Jul 29, 2020

Hey all,
Thank you all for this fantastic script! It works very well, although I found a pdf (attached) whose highlights are being severely truncated. I tweaked boxhit function to return True if there is any overlap at all which gave me better results but then the script still does not pick up the last line of each highlight. It looks like original boxes and the rectangle in the Annotation object are indeed missing this last line (the annotation y0 is bigger than the item's)...

Anyway... I can provide more info if you'd like and I'd very much appreciate any insight into fixing this although it is also possible that it is more of a pdfminer issue...

pwc-tax-guide.pdf

@0xabu
Copy link
Owner

0xabu commented Jul 30, 2020

Thanks for the report and sample PDF. I've futzed with the hit detection algorithm quite a few times before, but haven't had any reports of issues with it for a long time so I suspect this may be an issue with the PDF annotation software as much as it is with pdfminer. I could consider making the 0.5 constant tunable, but that sounds like it wouldn't have fully solved your issue (?)

@tadeoos
Copy link
Author

tadeoos commented Aug 11, 2020

Thanks, @0xabu for a quick reply! Even with the 0.5 const down to 0, I'm missing some of the letters...

I dug a bit deeper and I believe it is a pdfminer.six issue.
I filed (or rather commented an existing) issue there. Just in case, I'm leaving a link here: pdfminer/pdfminer.six#281 (comment)

@0xabu 0xabu added bug pdfminer Issue in pdfminer labels Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug pdfminer Issue in pdfminer
Projects
None yet
Development

No branches or pull requests

2 participants