This extractor pulls out the contents of <titleproper>
, <scopecontent>
(all paragraphs), and children of <origination>
and puts them into a CSV under dc:title
, dc:description
, and dc:creator
. Multiple variables are separated by |
and multiple paragraphs in <scopecontent>
are separated by unicode line breaks. Quotation marks are replaced by unicode quotation marks in order to allow each section to be wrapped in quotation marks for safety. Much of this handling has to do with the specific needs of CurateND's batch ingester and the characters in Notre Dame's finding aids.
The extractor sets the filename, minus ".xml" as dc:identifier
, which is being used for internal purposes. Similarly it creates a link to the Archives' website as dc:source
.
The extractor adds hardcoded fields for type
, owner
, and access
and the filename as files
, all of which are specific to CurateND's batch ingester.
- Open
fa-new.sh
. Make sure the batch ingest path to batch is correct. Many lines. - Open
process.py
and ensuredirectory
on line 86 is the correct path to the batch.
- Edit variable
directory
(line 86) or turn it into araw_input
string and add to the end. - Edit appropriate lines in
createCSV
(line 125). Lines which should be considered have comments explaining the internal uses. - Make any decisions in line 25 re: the desired separator between
<part>
elements - Run
python process.py
Make the path to batch ingest something in fa-new.sh and then passed to process-new.py