Skip to content

One click data dump

Tim L edited this page Feb 27, 2014 · 35 revisions

What is first

What we will cover

This page describes how csv2rdf4lod-automation produces the "one click data dump", how it is published, and how it is described in the VoID metadata.

The following variable will expand to the URL for the dump of unique URI nodes:

  • $CSV2RDF4LOD_BASE_URI/source/$CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/file/cr-full-dump/version/latest/conversion/us-cr-full-dump-latest.ttl.gz

Let's get to it!

cr-full-dump.sh gathers all versioned dataset dump files into a single gzipped ntriples file that contains all RDF data in a [csv2rdf4lod node](csv2rdf4lod automation data root):

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ du -sh publish/purl-org-twc-health.nt.gz

52M	publish/purl-org-twc-health.nt.gz

It also extracts all URI subjects and objects from the full data dump, using uri-nodes.sh. The output file contains duplicates so that we can investigate popularity of different nodes.

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-node-occurrences.txt

<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>

It also distills the occurrences into a list of unique URI nodes. These also happen to be sorted.

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.txt

<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=>
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=>
<di:sha-256;0cHKWqDWnClj8WXl2Osl8Zh1VnfhcCdAyU5bHXgZ7Tg=>
<di:sha-256;0gSL5RbDvMTnayr9nZqQFPxDfwazT9O1ougjGk3LYVs=>

The previous list of unique RDF URI nodes can be used to create a simple RDF file to relate all of the nodes to a special dataset:

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=> a rdfs:Resource .
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=> a rdfs:Resource .

We can reuse aggregate-source-rdf.sh to package the resource list into the conventional filenames, publish as dump files on the web, and publish to the triple store:

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ aggregate-source-rdf.sh automatic/purl-org-twc-health-uri-nodes.ttl

publish/purl-org-twc-health.nt.gz

publish/healthdata-tw-rpi-edu-cr-full-dump-latest.ttl
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.nt
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.sd_name
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.void.ttl

publish/bin
publish/bin/ln-to-www-root-healthdata-tw-rpi-edu-cr-full-dump-latest.sh
publish/bin/virtuoso-load-healthdata-tw-rpi-edu-cr-full-dump-latest.sh
publish/bin/virtuoso-delete-healthdata-tw-rpi-edu-cr-full-dump-latest.sh

publish/healthdata-tw-rpi-edu-cr-full-dump-latest.void.ttl above is published to healthdata-tw-rpi-edu-cr-full-dump-latest.void.ttl, so http://healthdata.tw.rpi.edu/void's dump file would be purl-org-twc-health.nt.gz

http://healthdata.tw.rpi.edu/source/healthdata-tw-rpi-edu/file/cr-full-dump/version/latest/conversion/healthdata-tw-rpi-edu-cr-full-dump-latest.ttl.gz contains the minimal RDF file that lists all RDF URI nodes in Turtle.

Ignore - this needs to be edited

Produced by aggregate-source-rdf.sh:

<http://purl.org/twc/health/void>
   void:subset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest> .

<http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest>
   a void:Dataset;
   void:dataDump <SOME_FILE>
.

The URL of the "one-click data download" can (will) be found in the VoID description of the csv2rdf4lod node (e.g. http://healthdata.tw.rpi.edu/void.ttl). (TODO: the file is created from cron, but it isn't published and isn't mentioned in the void file yet).

The cr-full-dump dataset contains mostly void:inDataset links from every resource in the csv2rdf4lod node to itself. It also includes a void:dataDump from the top-level dataset to the dump file that we created (it's just reusing the dump file from cr-full-dump.

<cowboy>        void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
dbpedia:Montana void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
...
<http://purl.org/twc/health/void>
   void:dataDump <SOME_FILE>
.

Providing example resource metadata

http://validator.lod-cloud.net/validate.php requires an example resource, which is asserted for the top-level /void at https://github.com/timrdf/csv2rdf4lod-automation/blob/master/bin/util/cr-full-dump.sh#L251

What is next

Clone this wiki locally