Conversion process phase: publish

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

[up](Conversion process phases)

What is first?

Installing csv2rdf4lod automation
Conversion process phase: name
Conversion process phase: retrieve
Conversion process phase: csv-ify
Conversion process phase: create conversion trigger
Conversion process phase: pull conversion trigger
... (rinse and repeat; flavor to taste) ...
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger

Pulling the conversion trigger will convert the tabular data in source/ (or manual/) and place the RDF results in the automatic/ directory. The automatic/ directory contains output files whose names correspond to the input filenames, so that you can easily find the output file that was derived from a given input file. For example, automatic/HQI_HOSP.csv.e1.ttl is derived by converting source/HQI_HOSP.csv.

What we will cover

The automatic/ directory contains all of the converted RDF results. This page discusses what csv2rdf4lod-automation can do to help publish the conversion results in a consistent, self-described form. Publishing with csv2rdf4lod-automation ensures that the converted results are placed in the same locations (i.e., dump files and named graphs in a SPARQL endpoint) that are mentioned within the dataset, which was asserted when the dataset was converted.

Let's get to it!

Grouping the conversions of all tabular input files

While it makes sense to choose output filenames so that they correspond to their input filenames (e.g. HQI_FTNT.csv, HQI_HOSP.csv, and HQI_HOSP_AHRQ.csv), it does not make sense to preserve this physical organization when we present our final converted datasets. If we did our job correctly during enhancement, the data from each input file is appropriately connected to the data from the other input files, and this integrated view is the organization that we should present to anyone exploring our collection of results. (For what it's worth, the RDF graphs derived from each input file can be traced back to the data file from which it came by looking at the RDF structure itself.)

The publish/ directory reorganizes the RDF data from automatic/ according to the more consistent [source - dataset - version](Directory Conventions) scheme that is central to csv2rdf4lod's design.

When publishing, all files are aggregated into a single VoID dataset. This is because the original file names are less important after they have been transformed to RDF (because the original file groupings are reflected in the VoID dataset descriptions created during conversion; we aren't losing structure when we aggregate.) . The aggregation file in publish/ is created from the conversion files in automatic/ and named using its source, dataset, and version identifiers. The files in publish/ are ready for publication, but are not necessarily published yet.

What metadata the converter provided

When the converter transformed each tabular file into RDF, it include some VoID descriptions of the dataset it produced. Combining all of the VoID from each conversion provides a larger picture for different parts of the RDF graph that is created.

<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17> 
   a void:Dataset, conversion:VersionedDataset;
   void:dataDump 
   <http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.ttl>

workflow of downloading a file into source/, manual modifications into manual/, automatic conversions into automatic/, aggregate results into publish/ and publishing to /var/www/ and a SPARQL endpoint

How do I publish my converted data?

If CSV2RDF4LOD_PUBLISH is "true", the conversion trigger will aggregate the output from automatic/* into publish/* and publish the aggregates in a variety of forms (dump files, endpoint, etc) according to the current values of the [CSV2RDF4LOD_ environment variables](Controlling automation using CSV2RDF4LOD_ environment variables).

If you've already converted the data and just want to publish the aggregates in an additional way, the scripts in publish/bin/*.sh can be used to bypass the state of the environment variables and just do it. The naming of the scripts follows the pattern action-source-dataset-version.sh. publish/bin/publish.sh can be used to aggregate and publish according to the environment variables, just like the conversion trigger would do.

The following are the most frequently used:

publish/bin/publish.sh
publish/bin/virtuoso-load-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/virtuoso-delete-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/ln-to-www-root-SOURCEID-DATASETID-VERSIONID.sh

These are less used but still a primary focus:

publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID-void.sh
publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID.sh

These haven't been used in a while (we use a Virtuoso endpoint):

publish/bin/tdbloader-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/joseki-config-anterior-SOURCEID-DATASETID-VERSIONID.ttl
publish/bin/4store-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/lod-materialize-apache-SOURCEID-DATASETID-VERSIONID.sh

What gets aggregated into `publish/`?

If CSV2RDF4LOD_PUBLISH is "true", the conversion trigger will aggregate the output from automatic/* into publish/*, which results in files named in the form:

publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
publish/SOURCEID-DATASETID-VERSIONID.nt.graph
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl

(the code that aggregates from automatic/ to publish/ is here.)

publish/SOURCEID-DATASETID-VERSIONID.nt.graph
- This contains one line with the URI of the dataset version, which is useful when loading into a named graph.
- The same URI could be obtained by running cr-dataset-uri.sh from the conversion cockpit.
publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
- This is all of the dataset in Turtle syntax, gzipped.
- Dump files will be compressed if CSV2RDF4LOD_PUBLISH_COMPRESS="true"
publish/SOURCEID-DATASETID-VERSIONID.raw.ttl.gz
- This is only the raw layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
- This is only a sample of the raw layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
- This is only the enhancement 1 layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
- This is only a sample of the enhancement 1 layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
- This is all of the dataset in N-TRIPLES syntax, gzipped.
- Only produced if CSV2RDF4LOD_PUBLISH_NT="true"
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
- This is all metadata, including DC, VoID, and PML.
- Would be more appropriately named publish/SOURCEID-DATASETID-VERSIONID.meta.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl
- This is all provenance-related metadata, including PML, OPM, Provenir, etc.
- Would be more appropriately named publish/SOURCEID-DATASETID-VERSIONID.provenance.ttl

Publishing what is in `publish/`

The publish/ directory in the conversion cockpit contains files ready to be released into the wild. Some options for what to do with it:

Avoiding this dataset from being listed

We've run into a few situations where some third parties are OK with us having the data and hosting it, but not having it listed in our data catalog (security through obscurity).

To prevent having the dataset's metadata included in the metadata named graph (from which the dataset catalog is created), invoke:

mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST

Next time $CSV2RDF4LOD_HOME/bin/convert-aggregate.sh reproduces the VoID file, it will see that the .DO_NOT_LIST is present and will rename the new file to .DO_NOT_LIST.

This works because $CSV2RDF4LOD_HOME/bin/cr-publish-void-to-endpoint.sh looks for files */version/*/publish -name "*void.ttl".

To let the metadata flow, just move it back:

mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion process phase: publish

What is first?

What we will cover

Let's get to it!

Grouping the conversions of all tabular input files

What metadata the converter provided

How do I publish my converted data?

What gets aggregated into `publish/`?

Publishing what is in `publish/`

Avoiding this dataset from being listed

What's next?

Clone this wiki locally

Conversion process phase: publish

What is first?

What we will cover

Let's get to it!

Grouping the conversions of all tabular input files

What metadata the converter provided

How do I publish my converted data?

What gets aggregated into publish/?

Publishing what is in publish/

Avoiding this dataset from being listed

What's next?

Clone this wiki locally

What gets aggregated into `publish/`?

Publishing what is in `publish/`