Feature: Outputting an annotation and the entire sentence where the annotation is located #87

Shadowalker1995 · 2023-08-30T04:24:39Z

Thank you so much for developing this module, it's fantastic. Is it possible to implement the function of simultaneously outputting an annotation and the entire sentence where the annotation is located?

If possible, please guide me on the general principle, Thanks

0xabu · 2023-08-30T10:59:46Z

Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.

If you did want to include context with other annotation types, you could probably modify Annotation.wants_context to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context in the markdown printer.

I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)

Shadowalker1995 · 2023-08-30T14:30:29Z

Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.

If you did want to include context with other annotation types, you could probably modify Annotation.wants_context to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context in the markdown printer.

I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)

Thanks for your quick reply. I can understand what you mean and could implement the code as you said.

As a Ph.D. candidate, my main task involves reading and annotating literature. Your tool has been helpful in exporting my annotations in a specific format, which has significantly aided me in my work. However, there is another scenario in my annotations that marks well-used words and phrases, and I hope to be able to export these annotations along with their context (i.e., the whole sentence). This would help me better comprehend the meaning and usage of the phrase when I review my notes.

I also have checked the pdfminer module. It said if we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

How can I use the annot to match an element? Is this possible?

0xabu · 2023-08-30T14:38:11Z

I also have checked the pdfminer module. It said if we want to extract all of the text. We could do:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())
How can I use the annot to match an element? Is this possible?

That's basically the problem at the core of pdfannots :) Most page elements have x/y coordinates, and each annotation consists of one or more bounding boxes, so the problem mostly boils down to processing the text elements and then checking for intersections between them and the annotation boxes. However, you can't just use LTTextContainers for this as those are too large (e.g. entire boxes or lines), rather you have to look at the characters inside them. The logic for this is in _PDFProcessor.render and its helpers like test_boxes.

Shadowalker1995 · 2023-09-01T07:22:19Z

I will have a try. thank you for your kindly guide

Shadowalker1995 · 2023-09-04T15:16:27Z

I have done the primary implementation. here is my repo.

Shadowalker1995 changed the title ~~Feature: Outputting a comment and the entire sentence where the comment is located~~ Feature: Outputting an annotation and the entire sentence where the annotation is located Aug 30, 2023

0xabu added the enhancement label Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Outputting an annotation and the entire sentence where the annotation is located #87

Feature: Outputting an annotation and the entire sentence where the annotation is located #87

Shadowalker1995 commented Aug 30, 2023 •

edited

Loading

0xabu commented Aug 30, 2023

Shadowalker1995 commented Aug 30, 2023

0xabu commented Aug 30, 2023

Shadowalker1995 commented Sep 1, 2023

Shadowalker1995 commented Sep 4, 2023

Feature: Outputting an annotation and the entire sentence where the annotation is located #87

Feature: Outputting an annotation and the entire sentence where the annotation is located #87

Comments

Shadowalker1995 commented Aug 30, 2023 • edited Loading

0xabu commented Aug 30, 2023

Shadowalker1995 commented Aug 30, 2023

0xabu commented Aug 30, 2023

Shadowalker1995 commented Sep 1, 2023

Shadowalker1995 commented Sep 4, 2023

Shadowalker1995 commented Aug 30, 2023 •

edited

Loading