-
Notifications
You must be signed in to change notification settings - Fork 306
Ingestion Checklist
xinru1414 edited this page Aug 5, 2021
·
15 revisions
The following is a checklist that should be used when ingesting new volumes. If you have been ingesting for some time, you, may be tempted to skip some of these. Don't! I suggest copying this file to one named "checklist.md" and placing it in the ingestion directory. This way, you can verify to posterity that you have gone through these steps.
- Clone and build ACL Anthology github repo, and have a data dir e.g. $DATA
- Download the ingestion data zip file DATA.zip to $DATA
- Unpack DATA.zip in $DATA, create a date-venue folder in Dropbox ingest dir and upload the files
- Check out a new branch
git checkout -b YOUR_BRANCH_NAME
under ACL Anthology repo - Run ingestion command
python bin/ingest.py --ingest-date 2020-04-19 PATH/TO/DATA/data/*/proceedings
- Run command
git diff data/yaml/venues.yaml
to check filevenues.yaml
. Specifically, remove all numbers of venues e.g. The First, 32th - Update
data/yaml/joint.yaml
when needed
- Such information can be found in newly generated .xml files. Normally, tutorials, SRW etc are included automatically because they share the same collection ID i.e. 2021-eacl, what aren't included are the workshops that have different collection IDs
- Make sure to update collections-volume IDs, not just the collection IDs
- Check meta files in $DATA and modify
data/yaml/sigs/sig
files when needed - Check all newly generated .xml files
- Check that editor names are split correctly, spot check a few authors
- Volume name should usually be "1" if there is just a single volume. That's the convention. If there are other volumes, then they can use names
- The volume name is determined by data in the file proceedings/meta so you could also look at that ahead of time
- Make sure location, year, etc are reasonable
- Run command
make check
and make sure all tests pass - Run command
git add ABOVE_NEWLY_GENERATED_FILES
- Run command
git commit -m “YOUR_MESSAGE”
to commit your changes - Run command
git push origin YOUR_BRANCH_NAME
to push your changes - Go on git, open a new pull request, assign reviewers to
acl-org/anthology
, chooseingestion
underlabels
- Under dir
~/anthology-files
, upload all generated attachments and pdfs by running e.g.rsync -ave ssh anth:anthology-files
- Clean out dir
~/anthology-files
There are several different steps for CL and TACL ingestion:
- Connect to MIT press
- Download all new files
- Ingest with
ingest_mitpress.py