Skip to content

One click data dump

timrdf edited this page Jan 8, 2013 · 35 revisions

What is first

What we will cover

This page describes how csv2rdf4lod-automation produces the "one click data dump", how it is published, and how it is described in the VoID metadata.

Let's get to it!

cr-full-dump.sh gathers all versioned dataset dump files into a single gzipped ntriples file that contains all RDF data in a [csv2rdf4lod node](csv2rdf4lod automation data root):

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ du -sh publish/purl-org-twc-health.nt.gz

52M	publish/purl-org-twc-health.nt.gz

It also extracts all URI subjects and objects from the full data dump, using uri-nodes.sh. The output file contains duplicates so that we can investigate popularity of different nodes.

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-node-occurrences.txt

<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>

It also distills the occurrences into a list of unique URI nodes. These also happen to be sorted.

source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.txt

<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=>
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=>
<di:sha-256;0cHKWqDWnClj8WXl2Osl8Zh1VnfhcCdAyU5bHXgZ7Tg=>
<di:sha-256;0gSL5RbDvMTnayr9nZqQFPxDfwazT9O1ougjGk3LYVs=>

/source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 publish/purl-org-twc-health-uri-nodes.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=> a rdfs:Resource .
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=> a rdfs:Resource .

Produced by aggregate-source-rdf.sh:

<http://purl.org/twc/health/void>
   void:subset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest> .

<http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest>
   a void:Dataset;
   void:dataDump <SOME_FILE>
.

The URL of the "one-click data download" can (will) be found in the VoID description of the csv2rdf4lod node (e.g. http://healthdata.tw.rpi.edu/void.ttl). (TODO: the file is created from cron, but it isn't published and isn't mentioned in the void file yet).

The cr-full-dump dataset contains mostly void:inDataset links from every resource in the csv2rdf4lod node to itself. It also includes a void:dataDump from the top-level dataset to the dump file that we created (it's just reusing the dump file from cr-full-dump.

<cowboy>        void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
dbpedia:Montana void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
...
<http://purl.org/twc/health/void>
   void:dataDump <SOME_FILE>
.

What is next

Clone this wiki locally