Conversion process phase: publish

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

[up](Conversion process phases)

What is first?

Installing csv2rdf4lod automation
Conversion process phase: name
Conversion process phase: retrieve
Conversion process phase: csv-ify
Conversion process phase: create conversion trigger
Conversion process phase: pull conversion trigger
... (rinse and repeat; flavor to taste) ...
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger

Pulling the conversion trigger will convert the tabular data in source/ (or manual/) and place the RDF results in the automatic/ directory. The automatic/ directory contains output files whose names correspond to the input filenames, so that you can easily find the output file that was derived from a given input file. For example, automatic/HQI_HOSP.csv.e1.ttl is derived by converting source/HQI_HOSP.csv.

What we will cover

The automatic/ directory contains all of the converted RDF results. This page discusses what csv2rdf4lod-automation can do to help publish the conversion results in a consistent, self-described form. Publishing with csv2rdf4lod-automation ensures that the converted results are placed in the same locations (i.e., dump files and named graphs in a SPARQL endpoint) that are mentioned within the dataset, which was asserted when the dataset was converted.

Let's get to it!

Grouping the conversions of all tabular input files

While it makes sense to choose output filenames so that they correspond to their input filenames (e.g. HQI_FTNT.csv, HQI_HOSP.csv, and HQI_HOSP_AHRQ.csv), it does not make sense to preserve this physical organization when we present our final converted datasets. If we did our job correctly during enhancement, the data from each input file is appropriately connected to the data from the other input files, and this integrated view is the organization that we should present to anyone exploring our collection of results. (For what it's worth, the RDF graphs derived from each input file can be traced back to the data file from which it came by looking at the RDF structure itself.)

The publish/ directory reorganizes the RDF data from automatic/ according to the more consistent [source - dataset - version](Directory Conventions) scheme that is central to csv2rdf4lod's design.

When publishing, all files are aggregated into a single VoID dataset. This is because the original file names are less important after they have been transformed to RDF (because the original file groupings are reflected in the VoID dataset descriptions created during conversion; we aren't losing structure when we aggregate.) . The aggregation file in publish/ is created from the conversion files in automatic/ and named using its source, dataset, and version identifiers. The files in publish/ are ready for publication, but are not necessarily published yet.

The converter provides metadata for free

When the converter transforms each tabular file into RDF, it includes metadata about the RDF dataset that it produces. Many existing vocabularies are reused to assert this metadata, including FOAF, DCTerms, VoID, PML, and VANN. Combining all of the metadata from each conversion provides a bigger picture for how the different parts of the RDF graph that are organized. The principal organization is done with VoID, which creates a hierarchy of void:Datasets according to the [source - dataset - version](Directory Conventions) scheme.

<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17> 
   a void:Dataset, conversion:VersionedDataset;
   void:dataDump 
   <http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.ttl>;
   void:subset 
   <http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1>;
.

<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1> 
   a void:Dataset, conversion:Dataset , conversion:LayerDataset;
   void:dataDump 
   <http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.e1.ttl> .

workflow of downloading a file into source/, manual modifications into manual/, automatic conversions into automatic/, aggregate results into publish/ and publishing to /var/www/ and a SPARQL endpoint

How do I publish my converted data?

Remember to use [cr-vars.sh](Script: cr vars.sh) to see the environment variables that are used to control csv2rdf4lod-automation.

If CSV2RDF4LOD_PUBLISH is "true", the conversion trigger will aggregate the output from automatic/* into publish/* and publish the aggregates in a variety of forms (dump files, endpoint, etc) according to the current values of the [CSV2RDF4LOD environment variables](Controlling automation using CSV2RDF4LOD_ environment variables).

If you've already converted the data and just want to publish the aggregates in an additional way, the scripts in publish/bin/*.sh can be used to bypass the state of the environment variables and just do it. The naming of the scripts follows the pattern action-source-dataset-version.sh. publish/bin/publish.sh can be used to aggregate and publish according to the environment variables, just like the conversion trigger would do.

The following are the most frequently used:

publish/bin/publish.sh
publish/bin/virtuoso-load-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/virtuoso-delete-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/ln-to-www-root-SOURCEID-DATASETID-VERSIONID.sh

These are less used but still a primary focus:

publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID-void.sh
publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID.sh

These haven't been used in a while (we use a Virtuoso endpoint):

publish/bin/tdbloader-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/joseki-config-anterior-SOURCEID-DATASETID-VERSIONID.ttl
publish/bin/4store-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/lod-materialize-apache-SOURCEID-DATASETID-VERSIONID.sh

What gets aggregated into `publish/`?

If CSV2RDF4LOD_PUBLISH is "true", the conversion trigger will aggregate the output from automatic/* into publish/*, which results in files named in the form:

publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
publish/SOURCEID-DATASETID-VERSIONID.nt.graph
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl

(the code that aggregates from automatic/ to publish/ is here.)

publish/SOURCEID-DATASETID-VERSIONID.nt.graph
- This contains one line with the URI of the dataset version, which is useful when loading into a named graph.
- The same URI could be obtained by running cr-dataset-uri.sh from the conversion cockpit.
publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
- This is all of the dataset in Turtle syntax, gzipped.
- Dump files will be compressed if CSV2RDF4LOD_PUBLISH_COMPRESS="true"
publish/SOURCEID-DATASETID-VERSIONID.raw.ttl.gz
- This is only the raw layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
- This is only a sample of the raw layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
- This is only the enhancement 1 layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
- This is only a sample of the enhancement 1 layer in Turtle syntax, gzipped.
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
- This is all of the dataset in N-TRIPLES syntax, gzipped.
- Only produced if CSV2RDF4LOD_PUBLISH_NT="true"
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
- This is all metadata, including DC, VoID, and PML.
- Would be more appropriately named publish/SOURCEID-DATASETID-VERSIONID.meta.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl
- This is all provenance-related metadata, including PML, OPM, Provenir, etc.
- Would be more appropriately named publish/SOURCEID-DATASETID-VERSIONID.provenance.ttl

Publishing what is in `publish/`

The publish/ directory in the conversion cockpit contains files ready to be released into the wild. Some options for what to do with it:

Environment variables that affect publishing:

CSV2RDF4LOD_PUBLISH=true
CSV2RDF4LOD_PUBLISH_DELAY_UNTIL_ENHANCED will prevent publishing if the dataset has not been enhanced.
- If you want to publish un-enhanced datasets, set it to true.
- If you want to publish only enhanced datasets, set it to false.
CSV2RDF4LOD_PUBLISH_FULL_CONVERSIONS will load only sample files if true and will load the entire dataset if false.
pvload.sh demands that the file it loads be a remote URL (for provenance reasons). So, CSV2RDF4LOD_PUBLISH_VARWWW_ROOT and CSV2RDF4LOD_PUBLISH_VARWWW_DUMP_FILES must be set to a path and true, respectively.

csv2rdf4lod-automation primarily uses Virtuoso. The environment variables that it needs to publish into a Virtuoso triple store are:

CSV2RDF4LOD_PUBLISH_VIRTUOSO needs to be true.
v-isql needs to be on your PATH (or CSV2RDF4LOD_PUBLISH_VIRTUOSO_ISQL_PATH needs to be set)
v-isql parameters CSV2RDF4LOD_PUBLISH_VIRTUOSO_PORT, CSV2RDF4LOD_PUBLISH_VIRTUOSO_USERNAME, CSV2RDF4LOD_PUBLISH_VIRTUOSO_PASSWORD, and CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT need to be set. See details at Publishing conversion results with a Virtuoso triplestore.

Avoiding this dataset from being listed

We've run into a few situations where some third parties are OK with us having the data and hosting it, but not having it listed in our data catalog (security through obscurity).

To prevent having the dataset's metadata included in the metadata named graph (from which the dataset catalog is created), invoke:

mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST

Next time $CSV2RDF4LOD_HOME/bin/convert-aggregate.sh reproduces the VoID file, it will see that the .DO_NOT_LIST is present and will rename the new file to .DO_NOT_LIST.

This works because $CSV2RDF4LOD_HOME/bin/cr-publish-void-to-endpoint.sh looks for files */version/*/publish -name "*void.ttl".

To let the metadata flow, just move it back:

mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl

CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS

When CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS is cr:auto, csv2rdf4lod-automation determines the correct value to pass to the converter using dump-file-extensions.sh to embody the logic (which is based on CSV2RDF4LOD_PUBLISH_COMPRESS, CSV2RDF4LOD_PUBLISH_RDFXML, and CSV2RDF4LOD_PUBLISH_NT).

The csv2rdf4lod Java converter accepts the following arguments that are related to file extensions:

-VoIDDumpExtensions / -vde gets values from dump-file-extensions.sh
- The void:dataDump to void files do not respond to this parameter. More to look at here...
-outputExtension / -oe does not appear to be given to the converter - was for the extension of the data dump?

Developer notes

The aggregation in bin/convert-aggregate.sh is deprecated in favor of bin/aggregate-source-rdf.sh. While the "raw" and "enhancement" layer aggregation logic make sense to keep, the creation of the full union should be replaced by bin/aggregate-source-rdf.sh.
bin/util/cr-full-dump.sh does a quick link to /var/www, but should now be handled by an update to bin/aggregate-source-rdf.sh
bin/cr-ln-to-www-root.sh will generalize and replace the cockpit-specific scripts.

What's next?

cr-publish.sh publishes any kind of file, either as aggregate RDF or linked into htdocs.
cr-ln-to-www-root.sh
aggregate-source-rdf.sh provides a consistent way to "concatenate" the given RDF files (no matter what format) and publish using the URL conventions.
Aggregating subsets of converted datasets to publish the metadata before all of the data.

Use it!

Follow linked data
Grab a dump file off of the web
Query your SPARQL endpoint

Review:

Follow through A quick and easy conversion
Remember the Conversion process phases
Check out Real world examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion process phase: publish

What is first?

What we will cover

Let's get to it!

Grouping the conversions of all tabular input files

The converter provides metadata for free

How do I publish my converted data?

What gets aggregated into `publish/`?

Publishing what is in `publish/`

Avoiding this dataset from being listed

CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS

Developer notes

What's next?

Clone this wiki locally

Conversion process phase: publish

What is first?

What we will cover

Let's get to it!

Grouping the conversions of all tabular input files

The converter provides metadata for free

How do I publish my converted data?

What gets aggregated into publish/?

Publishing what is in publish/

Avoiding this dataset from being listed

CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS

Developer notes

What's next?

Clone this wiki locally

What gets aggregated into `publish/`?

Publishing what is in `publish/`