Here you'll find (meta)data extracted from the M&TS A Files corpus along with useful files and scripts for managing that (meta)data.
E.g.,
data/metadata_outputs
- contains JSON documents of metadata scoped 1 per document image
- the metadata contained was extracted via spacy LLM ML model
- see: Deployment_Framework and Model_Trainer repos
- the documents reference the original image as served through a IIIF Image API
schemas/document-schema.json
- is the machine-readable JSON schema for the documents above
lib/validate_json_to_schema.py
- a script that checks if every JSON document in
data/metadata_outputs
coheres to the schema
- a script that checks if every JSON document in
You'll need Python, Git, and poetry installed to use the data and run the scripts available in /lib
. On mac, we recommend using Homebrew and asdf.
If you already have Homebrew installed:
brew install coreutils curl git gh
brew install asdf
Then follow the instructions for your system to add asdf
to your shell's PATH
. If you're using ZSH, for example, you'll run:
echo -e "\n. $(brew --prefix asdf)/libexec/asdf.sh" >> ${ZDOTDIR:-~}/.zshrc
source ~/.zshrc
asdf plugin-add python
asdf plugin-add direnv
asdf direnv setup --shell zsh --version latest # if using ZSH! can replace with bash
brew install pipx
pipx ensurepath
source ~/.zshrc # if using ZSH! can replace with ~/.bashrc
gh repo clone Migrants-and-The-State/extracted-data && cd extracted-data
adsf install python
poetry install
poetry run python lib/validate_json_to_schema.py