Skip to content

sonebu/pdf-annotation-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Extracting (some) Annotations from PDFs

A short script that uses the pypdf library to extract a list of annotations from a PDF.

Currently only works for the following types of annotations:

  • Highlighted Notes:
  • "Caret" annotations (strikethrough, but with a text suggestion):
  • Strikethrough annotations (just says "remove this part"):

Ignores hyperlinks in the text such as those generated for citations in latex-built academic article PDFs.

Probably useful for grad students who need to make sure they address every comment/annotation their advisor or a reviewer makes on a PDF they shared (e.g., article, response letter etc.). I used this to simply generate a list of action items over which I can check my work and tell whether I missed a review point / comment.

Note that the annotations are saved as vector items on the page coordinate frame rather than as attached to certain parts of the text, which is how the author typically remembers parts of the text. In other words, you don't get which word was striked out, you can only get its location on the page, and that is not extremely useful since it's in some metric form that is not intuitive.

Installation

Just install pypdf via pip install pypdf

Usage

python extract_annotations.py pdffile.pdf > example_output.txt

References

Inspired by the following discussions:

and the following page from the pypdf documentation:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages