OCR Converters for XTF Sites (e.g., Campus Publications)

This repository contains a script to convert simple OCR data, in the form of a word list and page coordinates, into the format needed by the Internet Archive Bookreader.

Please note that the script contains hardcoded references to the LDR pair tree and the ark_data.db database. Coordinate with the systems administrators to get access to these locations in the filesystem and adjust the script before proceeding.

To run this script, start by setting up a python virtual environment. Activate the environment, clone this repo, and install its dependencies:

python3 -m venv venv
source venv/bin/activate
git clone https://github.com/uchicago-library/ocr_converters.git
cd ocr_converters
pip install -r requirements.txt
pip install -r requrements_dev.txt

Then, run the program like this:

python build_ia_bookreader_ocr.py <identifier> <min-year> <max-year> [<shrink_to_height>]

is the mvol identifier to produce OCR for, for example, mvol-0001-0002-0003. is the earliest year for any item from this item's journal. (This is necessary because each item contains metadata for the entire title.) is the latest year for any item from this item's journal. <shrink_to_height> is used for situations where the JPEG images used in the Internet Archive bookreader have been shrunken down to a smaller pixel height from the dimensions of the original master file.

The script will output OCR for the Internet Archive Bookreader that is used in XTF sites like the Campus Publications.

XTF File Layout

Get the XTF production and development server names from the systems administrators. XTF uses a data directory- cd into that directory, and cd into bookreader. You'll find a sequence of directories, one for each digital object. Each will be named something like "mvol-0001-0002-0003"- this is the internal identifier the Preservation Department uses to track these files.

Inside each directory is a sequence of JPEGs. Each has eight digits with leading zeroes, numbered like:

00000001.jpg
00000002.jpg
00000003.jpg
etc.

These are the page images for this item. To add a new item to XTF, use your favorite utility to convert TIFF files to JPEGs, optionally shrinking them to some smaller height. (If you shrink them you can use the <shrink_to_height> option on build_ia_bookreader_ocr.py above.)

Then, each directory contains a thumbnail image- .jpg, which is 100px tall, e.g.:

mvol-0001-0002-0003.jpg

Each contains a PDF, with all page images:

mvol-0001-0002-0003.pdf

The OCR produced above is stored at:

mvol-0001-0002-0003.xml

And the text of the document itself, with no OCR information, lives in:

mvol-0001-0002-0003.txt

The entire file layout should look like this:

00000001.jpg
00000002.jpg
00000003.jpg
mvol-0001-0002-0003.jpg
mvol-0001-0002-0003.pdf
mvol-0001-0002-0003.txt
mvol-0001-0002-0003.xml

Because input data tends change with each deposit, I write ad-hoc scripts to get data into this format and scp it to the XTF servers.

Re-Indexing the Site

To re-index the site, look in the XTF bin directory. To rebuild the index completely, run:

./textIndexer -clean -index default

Note that will probably take about a half hour, during which time the site will be unavailable.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_ia_bookreader_ocr.py		build_ia_bookreader_ocr.py
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Converters for XTF Sites (e.g., Campus Publications)

XTF File Layout

Re-Indexing the Site

See Also

About

Releases

Packages

Languages

License

uchicago-library/ocr_converters

Folders and files

Latest commit

History

Repository files navigation

OCR Converters for XTF Sites (e.g., Campus Publications)

XTF File Layout

Re-Indexing the Site

See Also

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages