-
Notifications
You must be signed in to change notification settings - Fork 305
Ingestion Checklist
Matt Post edited this page Jan 14, 2024
·
15 revisions
The following is a checklist that should be used when ingesting new volumes. If you have been ingesting for some time, you, may be tempted to skip some of these. Don't! I suggest copying this file to one named "checklist.md" and placing it in the ingestion directory. This way, you can verify to posterity that you have gone through these steps.
- Make sure the branch is merged with the latest
master
branch - Ensure that there are editors listed in the
<meta>
block - If it's a workshop, add a
<venue>ws</venue>
tag - Add events to their relevant SIGs
- Look at the venue listing for prior years, and ensure that the new volume titles are consistent. You can do this by clicking on the venue name from a paper page, which will take you to the vendor listing.
- Navigate to the event page preview (e.g., https://preview.aclanthology.org/icnlsp-ingestion/events/icnlsp-2021/), and page through, to see if there are any glaring mistakes
- Skim through the complete listing, looking for mis-parsed author names.
- Download the frontmatter and verify that the table of contents matches at least three randomly-selected papers
- Download 3–5 PDFs (including the first and last one) and make sure they are correct (title, authors, page numbers).
This section contains technical details for the ingestion process. The checklist above should be used after this is done.
- Clone and build ACL Anthology github repo, and have a data dir e.g. $DATA
- Download the ingestion data zip file DATA.zip to $DATA
- Unpack DATA.zip in $DATA, create a date-venue folder in Dropbox ingest dir and upload the files
- Check out a new branch
git checkout -b YOUR_BRANCH_NAME
under ACL Anthology repo - Run ingestion command
python bin/ingest.py --ingest-date 2020-04-19 PATH/TO/DATA/data/*/proceedings
- Run command
python bin/write_bibkeys_to_xml.py -c
to back ingest bibkey for the newly generated xml file - Run command
git diff data/yaml/venues.yaml
to check filevenues.yaml
. Specifically, remove all numbers of venues e.g. The First, 32th - Update
data/yaml/joint.yaml
when needed- Such information can be found in newly generated .xml files. Normally, tutorials, SRW etc are included automatically because they share the same collection ID i.e. 2021-eacl, what aren't included are the workshops that have different collection IDs
- Make sure to update collections-volume IDs, not just the collection IDs
- Check meta files in $DATA and modify
data/yaml/sigs/sig
files when needed - Check all newly generated .xml files
- Check that editor names are split correctly, spot check a few authors
- Volume name should usually be "1" if there is just a single volume. That's the convention. If there are other volumes, then they can use names
- The volume name is determined by data in the file proceedings/meta so you could also look at that ahead of time
- Make sure location, year, etc are reasonable
- Run command
make check
and make sure all tests pass - Run command
git add ABOVE_NEWLY_GENERATED_FILES
- Run command
git commit -m “YOUR_MESSAGE”
to commit your changes - Run command
git push origin YOUR_BRANCH_NAME
to push your changes - Go on git, open a new pull request, assign reviewers to
acl-org/anthology
, chooseingestion
underlabels
- Under dir
~/anthology-files
, upload all generated attachments and pdfs by running e.g.rsync -ave ssh pdf anth:anthology-files
- Clean out dir
~/anthology-files
There are several different steps for CL and TACL ingestion:
- Connect to MIT press
- Download all new files
- Ingest with
ingest_mitpress.py