Skip to content

Conversion process phase: publish

Tim L edited this page May 15, 2015 · 126 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

[up](Conversion process phases)

What is first?

Pulling the conversion trigger will convert the tabular data in source/ (or manual/) and place the RDF results in the automatic/ directory. The automatic/ directory contains output files whose names correspond to the input filenames, so that you can easily find the output file that was derived from a given input file. For example, automatic/HQI_HOSP.csv.e1.ttl is derived by converting source/HQI_HOSP.csv.

What we will cover

The automatic/ directory contains all of the converted RDF results. This page discusses what csv2rdf4lod-automation can do to help publish the conversion results in a consistent, self-described form. Publishing with csv2rdf4lod-automation ensures that the converted results are placed in the same locations (i.e., dump files and named graphs in a SPARQL endpoint) that are mentioned within the dataset, which was asserted when the dataset was converted.

Let's get to it!

Grouping the conversions of all tabular input files

While it makes sense to choose output filenames so that they correspond to their input filenames (e.g. HQI_FTNT.csv, HQI_HOSP.csv, and HQI_HOSP_AHRQ.csv), it does not make sense to preserve this physical organization when we present our final converted datasets. If we did our job correctly during enhancement, the data from each input file is appropriately connected to the data from the other input files, and this integrated view is the organization that we should present to anyone exploring our collection of results. (For what it's worth, the RDF graphs derived from each input file can be traced back to the data file from which it came by looking at the RDF structure itself.)

The publish/ directory reorganizes the RDF data from automatic/ according to the more consistent [source - dataset - version](Directory Conventions) scheme that is central to csv2rdf4lod's design.

When publishing, all files are aggregated into a single VoID dataset. This is because the original file names are less important after they have been transformed to RDF (because the original file groupings are reflected in the VoID dataset descriptions created during conversion; we aren't losing structure when we aggregate.) . The aggregation file in publish/ is created from the conversion files in automatic/ and named using its source, dataset, and version identifiers. The files in publish/ are ready for publication, but are not necessarily published yet.

The converter provides metadata for free

When the converter transforms each tabular file into RDF, it includes metadata about the RDF dataset that it produces. Many existing vocabularies are reused to assert this metadata, including FOAF, DCTerms, VoID, PML, and VANN. Combining all of the metadata from each conversion provides a bigger picture for how the different parts of the RDF graph that are organized. The principal organization is done with VoID, which creates a hierarchy of void:Datasets according to the [source - dataset - version](Directory Conventions) scheme.

<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17> 
   a void:Dataset, conversion:VersionedDataset;
   void:dataDump 
   <http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.ttl>;
   void:subset 
   <http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1>;
.

<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1> 
   a void:Dataset, conversion:Dataset , conversion:LayerDataset;
   void:dataDump 
   <http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.e1.ttl> .

workflow of downloading a file into source/, manual modifications into manual/, automatic conversions into automatic/, aggregate results into publish/ and publishing to /var/www/ and a SPARQL endpoint

How do I publish my converted data?

Remember to use [cr-vars.sh](Script: cr vars.sh) to see the environment variables that are used to control csv2rdf4lod-automation.

If CSV2RDF4LOD_PUBLISH is "true", the conversion trigger will aggregate the output from automatic/* into publish/* and publish the aggregates in a variety of forms (dump files, endpoint, etc) according to the current values of the [CSV2RDF4LOD environment variables](Controlling automation using CSV2RDF4LOD_ environment variables).

If you've already converted the data and just want to publish the aggregates in an additional way, the scripts in publish/bin/*.sh can be used to bypass the state of the environment variables and just do it. The naming of the scripts follows the pattern action-source-dataset-version.sh. publish/bin/publish.sh can be used to aggregate and publish according to the environment variables, just like the conversion trigger would do.

The following are the most frequently used:

  • publish/bin/publish.sh
  • publish/bin/virtuoso-load-SOURCEID-DATASETID-VERSIONID.sh
  • publish/bin/virtuoso-delete-SOURCEID-DATASETID-VERSIONID.sh
  • publish/bin/ln-to-www-root-SOURCEID-DATASETID-VERSIONID.sh

These are less used but still a primary focus:

  • publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID-void.sh
  • publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID.sh

These haven't been used in a while (we use a Virtuoso endpoint):

  • publish/bin/tdbloader-SOURCEID-DATASETID-VERSIONID.sh
  • publish/bin/joseki-config-anterior-SOURCEID-DATASETID-VERSIONID.ttl
  • publish/bin/4store-SOURCEID-DATASETID-VERSIONID.sh
  • publish/bin/lod-materialize-apache-SOURCEID-DATASETID-VERSIONID.sh

What gets aggregated into publish/?

If CSV2RDF4LOD_PUBLISH is "true", the conversion trigger will aggregate the output from automatic/* into publish/*, which results in files named in the form:

publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
publish/SOURCEID-DATASETID-VERSIONID.nt.graph
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl

(the code that aggregates from automatic/ to publish/ is here.)

  • publish/SOURCEID-DATASETID-VERSIONID.nt.graph

    • This contains one line with the URI of the dataset version, which is useful when loading into a named graph.
    • The same URI could be obtained by running cr-dataset-uri.sh from the conversion cockpit.
  • publish/SOURCEID-DATASETID-VERSIONID.ttl.gz

    • This is all of the dataset in Turtle syntax, gzipped.
    • Dump files will be compressed if CSV2RDF4LOD_PUBLISH_COMPRESS="true"
  • publish/SOURCEID-DATASETID-VERSIONID.raw.ttl.gz

    • This is only the raw layer in Turtle syntax, gzipped.
  • publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl

    • This is only a sample of the raw layer in Turtle syntax, gzipped.
  • publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz

    • This is only the enhancement 1 layer in Turtle syntax, gzipped.
  • publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl

    • This is only a sample of the enhancement 1 layer in Turtle syntax, gzipped.
  • publish/SOURCEID-DATASETID-VERSIONID.nt.gz

    • This is all of the dataset in N-TRIPLES syntax, gzipped.
    • Only produced if CSV2RDF4LOD_PUBLISH_NT="true"
  • publish/SOURCEID-DATASETID-VERSIONID.void.ttl

    • This is all metadata, including DC, VoID, and PML.
    • Would be more appropriately named publish/SOURCEID-DATASETID-VERSIONID.meta.ttl
  • publish/SOURCEID-DATASETID-VERSIONID.pml.ttl

    • This is all provenance-related metadata, including PML, OPM, Provenir, etc.
    • Would be more appropriately named publish/SOURCEID-DATASETID-VERSIONID.provenance.ttl

Publishing what is in publish/

The publish/ directory in the conversion cockpit contains files ready to be released into the wild. Some options for what to do with it:

Environment variables that affect publishing:

csv2rdf4lod-automation primarily uses Virtuoso. The environment variables that it needs to publish into a Virtuoso triple store are:

Avoiding this dataset from being listed

We've run into a few situations where some third parties are OK with us having the data and hosting it, but not having it listed in our data catalog (security through obscurity).

To prevent having the dataset's metadata included in the metadata named graph (from which the dataset catalog is created), invoke:

mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST

Next time $CSV2RDF4LOD_HOME/bin/convert-aggregate.sh reproduces the VoID file, it will see that the .DO_NOT_LIST is present and will rename the new file to .DO_NOT_LIST.

This works because $CSV2RDF4LOD_HOME/bin/cr-publish-void-to-endpoint.sh looks for files */version/*/publish -name "*void.ttl".

To let the metadata flow, just move it back:

mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl

CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS

When CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS is cr:auto, csv2rdf4lod-automation determines the correct value to pass to the converter using dump-file-extensions.sh to embody the logic (which is based on CSV2RDF4LOD_PUBLISH_COMPRESS, CSV2RDF4LOD_PUBLISH_RDFXML, and CSV2RDF4LOD_PUBLISH_NT).

The csv2rdf4lod Java converter accepts the following arguments that are related to file extensions:

  • -VoIDDumpExtensions / -vde gets values from dump-file-extensions.sh
    • The void:dataDump to void files do not respond to this parameter. More to look at here...
  • -outputExtension / -oe does not appear to be given to the converter - was for the extension of the data dump?

Developer notes

What's next?

Use it!

  • Follow linked data
  • Grab a dump file off of the web
  • Query your SPARQL endpoint

Review:

Clone this wiki locally