One click data dump

What is first

Vocabulary of Interlinked Datasets (VoID) is an RDF vocabulary to describe sets of RDF graphs -- including where to download them.
We use the Aggregating subsets of converted datasets pattern to produce the "one click data dump" within csv2rdf4lod-automation.

What we will cover

This page describes how csv2rdf4lod-automation produces the "one click data dump", how it is published, and how it is described in the VoID metadata.

Let's get to it!

cr-full-dump.sh gathers all versioned dataset dump files into a single gzipped ntriples file that contains all RDF data in a [csv2rdf4lod node](csv2rdf4lod automation data root):

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ du -sh publish/purl-org-twc-health.nt.gz

52M	publish/purl-org-twc-health.nt.gz

It also extracts all URI subjects and objects from the full data dump, using uri-nodes.sh. The output file contains duplicates so that we can investigate popularity of different nodes.

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-node-occurrences.txt

<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>

It also distills the occurrences into a list of unique URI nodes. These also happen to be sorted.

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.txt

<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=>
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=>
<di:sha-256;0cHKWqDWnClj8WXl2Osl8Zh1VnfhcCdAyU5bHXgZ7Tg=>
<di:sha-256;0gSL5RbDvMTnayr9nZqQFPxDfwazT9O1ougjGk3LYVs=>

/source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 publish/purl-org-twc-health-uri-nodes.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=> a rdfs:Resource .
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=> a rdfs:Resource .

Produced by aggregate-source-rdf.sh:

<http://purl.org/twc/health/void>
   void:subset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest> .

<http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest>
   a void:Dataset;
   void:dataDump <SOME_FILE>
.

The URL of the "one-click data download" can (will) be found in the VoID description of the csv2rdf4lod node (e.g. http://healthdata.tw.rpi.edu/void.ttl). (TODO: the file is created from cron, but it isn't published and isn't mentioned in the void file yet).

The cr-full-dump dataset contains mostly void:inDataset links from every resource in the csv2rdf4lod node to itself. It also includes a void:dataDump from the top-level dataset to the dump file that we created (it's just reusing the dump file from cr-full-dump.

<cowboy>        void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
dbpedia:Montana void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
...
<http://purl.org/twc/health/void>
   void:dataDump <SOME_FILE>
.

What is next

twc-healthdata example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One click data dump

What is first

What we will cover

Let's get to it!

What is next

Clone this wiki locally