Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Outputting an annotation and the entire sentence where the annotation is located #87

Open
Shadowalker1995 opened this issue Aug 30, 2023 · 5 comments

Comments

@Shadowalker1995
Copy link

Shadowalker1995 commented Aug 30, 2023

Thank you so much for developing this module, it's fantastic. Is it possible to implement the function of simultaneously outputting an annotation and the entire sentence where the annotation is located?

If possible, please guide me on the general principle, Thanks

@Shadowalker1995 Shadowalker1995 changed the title Feature: Outputting a comment and the entire sentence where the comment is located Feature: Outputting an annotation and the entire sentence where the annotation is located Aug 30, 2023
@0xabu
Copy link
Owner

0xabu commented Aug 30, 2023

Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.

If you did want to include context with other annotation types, you could probably modify Annotation.wants_context to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context in the markdown printer.

I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)

@Shadowalker1995
Copy link
Author

Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.

If you did want to include context with other annotation types, you could probably modify Annotation.wants_context to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context in the markdown printer.

I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)

Thanks for your quick reply. I can understand what you mean and could implement the code as you said.

As a Ph.D. candidate, my main task involves reading and annotating literature. Your tool has been helpful in exporting my annotations in a specific format, which has significantly aided me in my work. However, there is another scenario in my annotations that marks well-used words and phrases, and I hope to be able to export these annotations along with their context (i.e., the whole sentence). This would help me better comprehend the meaning and usage of the phrase when I review my notes.

I also have checked the pdfminer module. It said if we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

How can I use the annot to match an element? Is this possible?

@0xabu
Copy link
Owner

0xabu commented Aug 30, 2023

I also have checked the pdfminer module. It said if we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

How can I use the annot to match an element? Is this possible?

That's basically the problem at the core of pdfannots :) Most page elements have x/y coordinates, and each annotation consists of one or more bounding boxes, so the problem mostly boils down to processing the text elements and then checking for intersections between them and the annotation boxes. However, you can't just use LTTextContainers for this as those are too large (e.g. entire boxes or lines), rather you have to look at the characters inside them. The logic for this is in _PDFProcessor.render and its helpers like test_boxes.

@Shadowalker1995
Copy link
Author

I will have a try. thank you for your kindly guide

@Shadowalker1995
Copy link
Author

I have done the primary implementation. here is my repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants