HTRC-Text-Processing Library

Tool to process pairtree format data in 17 million digitized works at HathiTrust.

About `htrc-text-processing` Library

Detailed Description goes here.

To install,

pip install htrc-text-processing

That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.

Function: get_zips()

A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.

Inputs:
1. Path (string) to directory that holds the pairtree.
2. Path (string) to directory that will hold the folders from expanded zips.
```
htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
```
Function: normalize_txt_file_names()

A function that clean and normalizes page file names.

Example: turns 39002088672754_000001.txt into 00000001.txt
```
htrc_text_processing.normalize_txt_file_names('txt path or dir to txts') 
```
Function: clean_vol()

Inputs:
1. List of paths (strings) to directories that holds page files, one per volume
2. Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
Function: check_vol()

Inputs:
1. Page directory List
2. Cleaned vols output dir
Output
1. Page directory list which is not cleaned yet
```
new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)
```

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data_sets		data_sets
htrc_text_processing		htrc_text_processing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
setup.py		setup.py