Skip to content

A script that extracts slides from lecture video and converts them into a searchable OCRed PDF.

Notifications You must be signed in to change notification settings

johan456789/slide-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

slide-extractor

A script that extracts slides from lecture video and converts them into a searchable OCRed PDF.

This script extracts different frames from lecture videos in current directory recursively (imagehash, cv2), combine frames into image-only PDFs (img2pdf), OCR the frames and output text-only PDFs (tesserect, ghostscript), and merge text-only and image-only PDFs into high quality searchable lecture slides.

Usage:

Put slide-extractor.py in the video directory, run python slide-extractor.py. The output PDFs will be stored in the same (sub)directories as those videos.

Dependencies

brew install tesseract ghostscript
pip install tqdm pillow imagehash opencv-python PyPDF2 img2pdf

Tested environment: Python 3.7.2, macOS

Homebrew packages: tesserect, ghostscript

Python packages: tqdm, pillow, imagehash, opencv-python, PyPDF2, img2pdf

  • Other possible candidate libraries for this tiny project and why they are not used:

    • imagemagick: convert *.png out.pdf it re-encodes the image. With zip compression (-compress Zip) you can get lossless output, but the file will be larger. img2pdf does not re-encode by default, runs faster, and uses less memory, so img2pdf is used.

    • OCRmyPDF: ocrmypdf in.pdf out-ocr.pdf Tesseract & ghostscript pipeline is actually faster and has better image quality, as it uses the original images in OCRed PDFs (downsides: high I/O, larger output files), so ocrmypdf is not used. If smaller PDF is desired, just do further compression using other software.

      $ time (for i in frame*.png; do tesseract -c textonly_pdf=1 $i $i pdf; done; gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine-text.pdf -dBATCH frame*.pdf; python merge.py;)
      
      real	0m35.962s
      user	0m28.935s
      sys	0m1.890s
      
      $ time ocrmypdf in.pdf out-ocr.pdf
      
      real	0m39.866s
      user	1m11.777s
      sys	0m7.876s

Sidenote

This program is intended for use on MOOC videos. For Cousera and edX, you can check out coursera-dl and edx-dl to download videos in batch.

About

A script that extracts slides from lecture video and converts them into a searchable OCRed PDF.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages