Scripts and programs for downloadinging the metadata and text for the History Lab-Muckrock COVID-19 Collection.
Requires Python version 3.9 or higher.
- Ensure underlying C-libraries are installed for OCRmyPDF and pdftotext. See these pages for OS-specific guidance:
- Clone this repo
- Install Python requirements:
pip install -r requirements.txt
python -m spacy download en_core_web_lg
- Set environmental variables
MR_USER
andMR_PSWD
to a DocumentCloud account with access to the History Lab COVID-19 Archive project. SetPG*
enviornmental variables to the PostgreSQL FOIArchive database.
export MR_USER=<muckrock_username>
export MR_PSWD=<muckrock_password>
export PGUSER=<postgres_username>
export PGPASSWORD=<postgres_password>
export PGHOST=<postgres_host>
export AWS_ACCESS_KEY_ID=<aws access key id>
export AWS_SECRET_ACCESS_KEY=<aws access key>
- run the programs in the pipeline:
python metadata_download.py
python text_download.py
python pii_detect.py
- The metadata program downloads data to a CSV file. SQL scripts then are used to load the data.
- The text program downloads data directly into a database table.
- The pii detect program reviews the text and stores pii in a database table.