-
Notifications
You must be signed in to change notification settings - Fork 36
Conversion process phase: publish
[up](Conversion process phases)
- Installing csv2rdf4lod automation
- Conversion process phase: name
- Conversion process phase: retrieve
- Conversion process phase: csv-ify
- Conversion process phase: create conversion trigger
- Conversion process phase: pull conversion trigger
- ... (rinse and repeat; flavor to taste) ...
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger
Pulling the conversion trigger will convert the tabular data in source/
(or manual/
) and place the RDF results in the automatic/
directory. The automatic/
directory contains output files whose names correspond to the input filenames, so that you can easily find the output file that was derived from a given input file. For example, automatic/HQI_HOSP.csv.e1.ttl
is derived by converting source/HQI_HOSP.csv
.
The automatic/
directory contains all of the converted RDF results. This page discusses what csv2rdf4lod-automation can do to help publish the conversion results in a consistent, self-described form. Publishing with csv2rdf4lod-automation ensures that the converted results are placed in the same locations (i.e., dump files and named graphs in a SPARQL endpoint) that are mentioned within the dataset, which was asserted when the dataset was converted.
While it makes sense to choose output filenames so that they correspond to their input filenames (e.g. HQI_FTNT.csv
, HQI_HOSP.csv
, and HQI_HOSP_AHRQ.csv
), it does not make sense to preserve this physical organization when we present our final converted datasets. If we did our job correctly during enhancement, the data from each input file is appropriately connected to the data from the other input files, and this integrated view is the organization that we should present to anyone exploring our collection of results. (For what it's worth, the RDF graphs derived from each input file can be traced back to the data file from which it came by looking at the RDF structure itself.)
The publish/
directory reorganizes the RDF data from automatic/
according to the more consistent [source - dataset - version](Directory Conventions) scheme that is central to csv2rdf4lod's design.
When publishing, all files are aggregated into a single VoID dataset. This is because the original file names are less important after they have been transformed to RDF (because the original file groupings are reflected in the VoID dataset descriptions created during conversion; we aren't losing structure when we aggregate.) . The aggregation file in publish/
is created from the conversion files in automatic/
and named using its source, dataset, and version identifiers. The files in publish/
are ready for publication, but are not necessarily published yet.
When the converter transformed each tabular file into RDF, it include some VoID descriptions of the dataset it produced. Combining all of the VoID from each conversion provides a larger picture for different parts of the RDF graph that is created.
<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17>
a void:Dataset, conversion:VersionedDataset;
void:dataDump
<http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.ttl>
If CSV2RDF4LOD_PUBLISH
is "true
", the conversion trigger will aggregate the output from automatic/*
into publish/*
and publish the aggregates in a variety of forms (dump files, endpoint, etc) according to the current values of the [CSV2RDF4LOD_ environment variables](Controlling automation using CSV2RDF4LOD_ environment variables).
If you've already converted the data and just want to publish the aggregates in an additional way, the scripts in publish/bin/*.sh
can be used to bypass the state of the environment variables and just do it. The naming of the scripts follows the pattern action
-source
-dataset
-version
.sh. publish/bin/publish.sh
can be used to aggregate and publish according to the environment variables, just like the conversion trigger would do.
The following are the most frequently used:
publish/bin/publish.sh
publish/bin/virtuoso-load-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/virtuoso-delete-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/ln-to-www-root-SOURCEID-DATASETID-VERSIONID.sh
These are less used but still a primary focus:
publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID-void.sh
publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID.sh
These haven't been used in a while (we use a Virtuoso endpoint):
publish/bin/tdbloader-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/joseki-config-anterior-SOURCEID-DATASETID-VERSIONID.ttl
publish/bin/4store-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/lod-materialize-apache-SOURCEID-DATASETID-VERSIONID.sh
If CSV2RDF4LOD_PUBLISH
is "true
", the conversion trigger will aggregate the output from automatic/*
into publish/*
, which results in files named in the form:
publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
publish/SOURCEID-DATASETID-VERSIONID.nt.graph
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl
(the code that aggregates from automatic/
to publish/
is here.)
-
publish/SOURCEID-DATASETID-VERSIONID
.nt.graph- This contains one line with the URI of the dataset version, which is useful when loading into a named graph.
- The same URI could be obtained by running cr-dataset-uri.sh from the conversion cockpit.
-
publish/SOURCEID-DATASETID-VERSIONID
.ttl.gz- This is all of the dataset in Turtle syntax, gzipped.
- Dump files will be compressed if
CSV2RDF4LOD_PUBLISH_COMPRESS="true"
-
publish/SOURCEID-DATASETID-VERSIONID
.raw.ttl.gz- This is only the raw layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.raw.sample.ttl- This is only a sample of the raw layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.e1.ttl.gz- This is only the enhancement 1 layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.e1.sample.ttl- This is only a sample of the enhancement 1 layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.nt.gz- This is all of the dataset in N-TRIPLES syntax, gzipped.
- Only produced if
CSV2RDF4LOD_PUBLISH_NT="true"
-
publish/SOURCEID-DATASETID-VERSIONID
.void.ttl- This is all metadata, including DC, VoID, and PML.
- Would be more appropriately named
publish/SOURCEID-DATASETID-VERSIONID.meta.ttl
-
publish/SOURCEID-DATASETID-VERSIONID
.pml.ttl- This is all provenance-related metadata, including PML, OPM, Provenir, etc.
- Would be more appropriately named
publish/SOURCEID-DATASETID-VERSIONID.provenance.ttl
The publish/
directory in the conversion cockpit contains files ready to be released into the wild. Some options for what to do with it:
- Generic: Publishing conversion results with a Virtuoso triplestore
- Use Case: Publishing LOGD's International Open Government Data Search data
We've run into a few situations where some third parties are OK with us having the data and hosting it, but not having it listed in our data catalog (security through obscurity).
To prevent having the dataset's metadata included in the metadata named graph (from which the dataset catalog is created), invoke:
mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST
Next time $CSV2RDF4LOD_HOME/bin/convert-aggregate.sh
reproduces the VoID file, it will see that the .DO_NOT_LIST
is present and will rename the new file to .DO_NOT_LIST
.
This works because $CSV2RDF4LOD_HOME/bin/cr-publish-void-to-endpoint.sh
looks for files */version/*/publish -name "*void.ttl"
.
To let the metadata flow, just move it back:
mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl
Use it!
- Follow linked data
- Grab a dump file off of the web
- Query your SPARQL endpoint
Review:
- Follow through A quick and easy conversion
- Remember the Conversion process phases
- Check out Real world examples