-
Notifications
You must be signed in to change notification settings - Fork 36
One click data dump
- Vocabulary of Interlinked Datasets (VoID) is an RDF vocabulary to describe sets of RDF graphs -- including where to download them.
- We use the Aggregating subsets of converted datasets pattern to produce the "one click data dump" within csv2rdf4lod-automation.
- Used by pr whois domain.
This page describes how csv2rdf4lod-automation produces the "one click data dump", how it is published, and how it is described in the VoID metadata.
The following variable will expand to the URL for the dump of unique URI nodes:
$CSV2RDF4LOD_BASE_URI/source/$CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/file/cr-full-dump/version/latest/conversion/us-cr-full-dump-latest.ttl.gz
cr-full-dump.sh gathers all versioned dataset dump files into a single gzipped ntriples file that contains all RDF data in a [csv2rdf4lod node](csv2rdf4lod automation data root):
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ du -sh publish/purl-org-twc-health.nt.gz
52M publish/purl-org-twc-health.nt.gz
It also extracts all URI subjects and objects from the full data dump, using uri-nodes.sh. The output file contains duplicates so that we can investigate popularity of different nodes.
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-node-occurrences.txt
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject/version/2012-Dec-20>
<http://purl.org/twc/health/source/bioontology-org/dataset/annotator-description-subject>
It also distills the occurrences into a list of unique URI nodes. These also happen to be sorted.
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.txt
<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=>
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=>
<di:sha-256;0cHKWqDWnClj8WXl2Osl8Zh1VnfhcCdAyU5bHXgZ7Tg=>
<di:sha-256;0gSL5RbDvMTnayr9nZqQFPxDfwazT9O1ougjGk3LYVs=>
The previous list of unique RDF URI nodes can be used to create a simple RDF file to relate all of the nodes to a special dataset:
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ head -4 automatic/purl-org-twc-health-uri-nodes.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<di:sha-256;09Szn5b5AWS6seKOEtGI4AI44oDMEO9AMYwlA3AdgeY=> a rdfs:Resource .
<di:sha-256;0bep7TlAaIUzGzQe6gN-5Mz0MLwUOi-U6GZAbePGY5s=> a rdfs:Resource .
We can reuse aggregate-source-rdf.sh to package the resource list into the conventional filenames, publish as dump files on the web, and publish to the triple store:
source/healthdata-tw-rpi-edu/cr-full-dump/version/latest$ aggregate-source-rdf.sh automatic/purl-org-twc-health-uri-nodes.ttl
publish/purl-org-twc-health.nt.gz
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.ttl
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.nt
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.sd_name
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.void.ttl
publish/bin
publish/bin/ln-to-www-root-healthdata-tw-rpi-edu-cr-full-dump-latest.sh
publish/bin/virtuoso-load-healthdata-tw-rpi-edu-cr-full-dump-latest.sh
publish/bin/virtuoso-delete-healthdata-tw-rpi-edu-cr-full-dump-latest.sh
publish/healthdata-tw-rpi-edu-cr-full-dump-latest.void.ttl
above is published to
healthdata-tw-rpi-edu-cr-full-dump-latest.void.ttl, so http://healthdata.tw.rpi.edu/void's dump file would be purl-org-twc-health.nt.gz
http://healthdata.tw.rpi.edu/source/healthdata-tw-rpi-edu/file/cr-full-dump/version/latest/conversion/healthdata-tw-rpi-edu-cr-full-dump-latest.ttl.gz contains the minimal RDF file that lists all RDF URI nodes in Turtle.
Produced by aggregate-source-rdf.sh:
<http://purl.org/twc/health/void>
void:subset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest> .
<http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump/version/latest>
a void:Dataset;
void:dataDump <SOME_FILE>
.
The URL of the "one-click data download" can (will) be found in the VoID description of the csv2rdf4lod node (e.g. http://healthdata.tw.rpi.edu/void.ttl). (TODO: the file is created from cron, but it isn't published and isn't mentioned in the void file yet).
The cr-full-dump
dataset contains mostly void:inDataset links from every resource in the csv2rdf4lod node to itself. It also includes a void:dataDump from the top-level dataset to the dump file that we created (it's just reusing the dump file from cr-full-dump
.
<cowboy> void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
dbpedia:Montana void:inDataset <http://purl.org/twc/health/source/healthdata-tw-rpi-edu/dataset/cr-full-dump> .
...
<http://purl.org/twc/health/void>
void:dataDump <SOME_FILE>
.
http://validator.lod-cloud.net/validate.php requires an example resource, which is asserted for the top-level /void
at
https://github.com/timrdf/csv2rdf4lod-automation/blob/master/bin/util/cr-full-dump.sh#L251