-
Notifications
You must be signed in to change notification settings - Fork 313
Ingestion Checklist
Matt Post edited this page Jan 1, 2022
·
15 revisions
The following is a checklist that should be used when ingesting new volumes. If you have been ingesting for some time, you, may be tempted to skip some of these. Don't! I suggest copying this file to one named "checklist.md" and placing it in the ingestion directory. This way, you can verify to posterity that you have gone through these steps.
- Ensure that there are editors listed in the
<meta>
block - Update
joint.yaml
if there are workshops that need to be listed together with the parent event - Add events to their relevant SIGs
- Look at the venue listing for prior years, and ensure that the new volume titles are consistent
- Navigate to the event page preview (e.g., https://preview.aclanthology.org/icnlsp-ingestion/events/icnlsp-2021/), and page through, to see if there are any glaring mistakes
- Download 3–5 PDFs (including the first and last one) and make sure they match the page
This section contains technical details for the ingestion process. The checklist above should be used after this is done.
- Clone and build ACL Anthology github repo, and have a data dir e.g. $DATA
- Download the ingestion data zip file DATA.zip to $DATA
- Unpack DATA.zip in $DATA, create a date-venue folder in Dropbox ingest dir and upload the files
- Check out a new branch
git checkout -b YOUR_BRANCH_NAME
under ACL Anthology repo - Run ingestion command
python bin/ingest.py --ingest-date 2020-04-19 PATH/TO/DATA/data/*/proceedings
- Run command
python bin/write_bibkeys_to_xml.py -c
to back ingest bibkey for the newly generated xml file - Run command
git diff data/yaml/venues.yaml
to check filevenues.yaml
. Specifically, remove all numbers of venues e.g. The First, 32th - Update
data/yaml/joint.yaml
when needed- Such information can be found in newly generated .xml files. Normally, tutorials, SRW etc are included automatically because they share the same collection ID i.e. 2021-eacl, what aren't included are the workshops that have different collection IDs
- Make sure to update collections-volume IDs, not just the collection IDs
- Check meta files in $DATA and modify
data/yaml/sigs/sig
files when needed - Check all newly generated .xml files
- Check that editor names are split correctly, spot check a few authors
- Volume name should usually be "1" if there is just a single volume. That's the convention. If there are other volumes, then they can use names
- The volume name is determined by data in the file proceedings/meta so you could also look at that ahead of time
- Make sure location, year, etc are reasonable
- Run command
make check
and make sure all tests pass - Run command
git add ABOVE_NEWLY_GENERATED_FILES
- Run command
git commit -m “YOUR_MESSAGE”
to commit your changes - Run command
git push origin YOUR_BRANCH_NAME
to push your changes - Go on git, open a new pull request, assign reviewers to
acl-org/anthology
, chooseingestion
underlabels
- Under dir
~/anthology-files
, upload all generated attachments and pdfs by running e.g.rsync -ave ssh pdf anth:anthology-files
- Clean out dir
~/anthology-files
There are several different steps for CL and TACL ingestion:
- Connect to MIT press
- Download all new files
- Ingest with
ingest_mitpress.py