Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 672 Bytes

README.md

File metadata and controls

28 lines (19 loc) · 672 Bytes

TextIT

Prerequisites

Get the language identification model.

sudo apt install libreoffice
conda install conda-forge::tesseract
conda install conda-forge::ghostscript
pip3 install -r requirements.txt
cd src/textit/processors && mkdir -p lang_id && cd lang_id && touch __init__.py && wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Usage

The following code turns all the files from tests/fixtures int json files in extracted_text.

python use_extractor.py tests/fixtures  extracted_text/

To write the files in a two level directory structure based on the hash of the file:

--use_hash_directories