Skip to content

Ingestion Checklist

xinru1414 edited this page Aug 5, 2021 · 15 revisions

Instructions

The following is a checklist that should be used when ingesting new volumes. If you have been ingesting for some time, you, may be tempted to skip some of these. Don't! I suggest copying this file to one named "checklist.md" and placing it in the ingestion directory. This way, you can verify to posterity that you have gone through these steps.

Steps

  • Clone and build ACL Anthology github repo, and have a data dir e.g. $DATA
  • Download the ingestion data zip file DATA.zip to $DATA
  • Unpack DATA.zip in $DATA, create a date-venue folder in Dropbox ingest dir and upload the files
  • Check out a new branch git checkout -b YOUR_BRANCH_NAME under ACL Anthology repo
  • Run ingestion command python bin/ingest.py --ingest-date 2020-04-19 PATH/TO/DATA/data/*/proceedings
  • Run command git diff data/yaml/venues.yaml to check file venues.yaml. Specifically, remove all numbers of venues e.g. The First, 32th
  • Update data/yaml/joint.yaml when needed
  • Such information can be found in newly generated .xml files. Normally, tutorials, SRW etc are included automatically because they share the same collection ID i.e. 2021-eacl, what aren't included are the workshops that have different collection IDs
  • Make sure to update collections-volume IDs, not just the collection IDs
  • Check meta files in $DATA and modify data/yaml/sigs/sig files when needed
  • Check all newly generated .xml files
  • Check that editor names are split correctly, spot check a few authors
  • Volume name should usually be "1" if there is just a single volume. That's the convention. If there are other volumes, then they can use names
  • The volume name is determined by data in the file proceedings/meta so you could also look at that ahead of time
  • Make sure location, year, etc are reasonable
  • Run command make check and make sure all tests pass
  • Run command git add ABOVE_NEWLY_GENERATED_FILES
  • Run command git commit -m “YOUR_MESSAGE” to commit your changes
  • Run command git push origin YOUR_BRANCH_NAME to push your changes
  • Go on git, open a new pull request, assign reviewers to acl-org/anthology, choose ingestion under labels
  • Under dir ~/anthology-files, upload all generated attachments and pdfs by running e.g. rsync -ave ssh anth:anthology-files
  • Clean out dir ~/anthology-files

CL and TACL ingestion

There are several different steps for CL and TACL ingestion:

  • Connect to MIT press
  • Download all new files
  • Ingest with ingest_mitpress.py