-
Notifications
You must be signed in to change notification settings - Fork 36
One click data dump
- Vocabulary of Interlinked Datasets (VoID) is an RDF vocabulary to describe sets of RDF graphs -- including where to download them.
- We use the Aggregating subsets of converted datasets pattern to produce the "one click data dump" within csv2rdf4lod-automation.
This page describes how csv2rdf4lod-automation produces the "one click data dump", how it is published, and how it is described in the VoID metadata.
cr-full-dump.sh gathers all versioned dataset dump files into a single gzipped ntriples file that contains all RDF data in a [csv2rdf4lod node](csv2rdf4lod automation data root):
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ du -sh publish/purl-org-twc-health.nt.gz
52M publish/purl-org-twc-health.nt.gz
It also extracts all URI subjects and objects from the full data dump, using uri-nodes.sh. The output file contains duplicates so that we can investigate popularity of different nodes.
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-node-occurrences.txt
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
It also distills the occurrences into a list of unique URI nodes. These also happen to be sorted.
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.txt
<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=>
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=>
<di:sha-256;0cHKWqDWnClj8WXl2Osl8Zh1VnfhcCdAyU5bHXgZ7Tg=>
<di:sha-256;0gSL5RbDvMTnayr9nZqQFPxDfwazT9O1ougjGk3LYVs=>
/source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 publish/purl-org-twc-health-uri-nodes.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=> a rdfs:Resource .
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=> a rdfs:Resource .
Produced by aggregate-source-rdf.sh:
<http://purl.org/twc/health/void>
void:subset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest> .
<http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest>
a void:Dataset;
void:dataDump <SOME_FILE>
.
The URL of the "one-click data download" can (will) be found in the VoID description of the csv2rdf4lod node (e.g. http://healthdata.tw.rpi.edu/void.ttl). (TODO: the file is created from cron, but it isn't published and isn't mentioned in the void file yet).
The cr-full-dump
dataset contains mostly void:inDataset links from every resource in the csv2rdf4lod node to itself. It also includes a void:dataDump from the top-level dataset to the dump file that we created (it's just reusing the dump file from cr-full-dump
.
<cowboy> void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
dbpedia:Montana void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
...
<http://purl.org/twc/health/void>
void:dataDump <SOME_FILE>
.