slide-extractor

A script that extracts slides from lecture video and converts them into a searchable OCRed PDF.

This script extracts different frames from lecture videos in current directory recursively (imagehash, cv2), combine frames into image-only PDFs (img2pdf), OCR the frames and output text-only PDFs (tesserect, ghostscript), and merge text-only and image-only PDFs into high quality searchable lecture slides.

Usage:

Put slide-extractor.py in the video directory, run python slide-extractor.py. The output PDFs will be stored in the same (sub)directories as those videos.

Dependencies

brew install tesseract ghostscript

pip install tqdm pillow imagehash opencv-python PyPDF2 img2pdf

Tested environment: Python 3.7.2, macOS

Homebrew packages: tesserect, ghostscript

Python packages: tqdm, pillow, imagehash, opencv-python, PyPDF2, img2pdf

Other possible candidate libraries for this tiny project and why they are not used:
- imagemagick: convert *.png out.pdf it re-encodes the image. With zip compression (-compress Zip) you can get lossless output, but the file will be larger. img2pdf does not re-encode by default, runs faster, and uses less memory, so img2pdf is used.
- OCRmyPDF: ocrmypdf in.pdf out-ocr.pdf Tesseract & ghostscript pipeline is actually faster and has better image quality, as it uses the original images in OCRed PDFs (downsides: high I/O, larger output files), so ocrmypdf is not used. If smaller PDF is desired, just do further compression using other software.
```
$ time (for i in frame*.png; do tesseract -c textonly_pdf=1 $i $i pdf; done; gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine-text.pdf -dBATCH frame*.pdf; python merge.py;)

real	0m35.962s
user	0m28.935s
sys	0m1.890s

$ time ocrmypdf in.pdf out-ocr.pdf

real	0m39.866s
user	1m11.777s
sys	0m7.876s
```

Sidenote

This program is intended for use on MOOC videos. For Cousera and edX, you can check out coursera-dl and edx-dl to download videos in batch.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
slide-extractor.py		slide-extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

slide-extractor

Usage:

Dependencies

Sidenote

About

Releases

Packages

Contributors 2

Languages

johan456789/slide-extractor

Folders and files

Latest commit

History

Repository files navigation

slide-extractor

Usage:

Dependencies

Sidenote

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages