Tool to process pairtree format data in 17 million digitized works at HathiTrust.
Detailed Description goes here.
To install,
pip install htrc-text-processing
That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.
-
Function:
get_zips()
A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.
Inputs:
- Path (string) to directory that holds the pairtree.
- Path (string) to directory that will hold the folders from expanded zips.
htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
-
Function:
normalize_txt_file_names()
A function that clean and normalizes page file names.
Example: turns
39002088672754_000001.txt
into00000001.txt
htrc_text_processing.normalize_txt_file_names('txt path or dir to txts')
-
Function:
clean_vol()
Inputs:
- List of paths (strings) to directories that holds page files, one per volume
- Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
-
Function:
check_vol()
Inputs:
- Page directory List
- Cleaned vols output dir
Output
- Page directory list which is not cleaned yet
new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)