Finding Linksets among Linked Data Bubbles

What is first

CKAN - a walk through of how to add and annotate dataset entries (and the extra requirements to suit the lodcloud group).
One click data dump - easy access to the list of all URIs in a csv2rdf4lod node.
https://github.com/jimmccusker/twc-healthdata/wiki/Listing-twc-healthdata-as-a-LOD-Cloud-Bubble
The analysis described in this page can be enabled as part of csv2rdf4lod's Secondary Derivative Datasets framework.

What we will cover

This page describes how to calculate VoID Linksets between a csv2rdf4lod node and all other bubbles in the Linked Data Diagram, using csv2rdf4lod-automations' one-click data dump and lodcloud's "namespace" annotations. Calculating the Linksets makes it easier to find out how a bubble is connected to others, which also makes it easier to assert the CKAN lodcloud annotation required to get into the diagram.

Let's get to it!

To find links, we need two things:

A list of all RDF nodes in a bubble. We can get this rather easily by running csv2rdf4lod's one-click data dump through nt-nodes.sh.
The namespace for each Linked Data bubble, which is given with the "namespace" annotation in CKAN. For example,
- http://datahub.io/dataset/2000-us-census-rdf's namespace is http://www.rdfabout.com/rdf/usgov/geo/, and
- http://datahub.io/dataset/a-seobook-dataset's namespace is http://seobook.blog.com.

We can get a bubble's namespace by POSTing its URI to a deployed instance of lift-ckan.py (e.g. here), which provides a good RDF description of the contorted annotations in the CKAN data entry.

curl -H "Content-Type: text/turtle" \
  -d '<http://datahub.io/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#CKANDataset> .' \
    http://aquarius.tw.rpi.edu/projects/datafaqs/services/sadi/ckan/lift-ckan

returns the following RDF triples (among others). The one we need is void:uriSpace.

<http://datahub.io/dataset/2000-us-census-rdf> a datafaqs:CKANDataset;
    ov:shortName "US Census (rdfabout)";
    dcterms:title "2000 U.S. Census in RDF (rdfabout.com)";
    void:sparqlEndpoint <http://www.rdfabout.com/sparql>;
    void:triples 1002848918;
    void:uriSpace "http://www.rdfabout.com/rdf/usgov/geo/" .

http://datahub.io/dataset/twc-logd's namespace is http://logd.tw.rpi.edu/, and http://datahub.io/dataset/twc-healthdata references URIs http://logd.tw.rpi.edu/id/medicare-gov/provider/340070 and http://logd.tw.rpi.edu/id/medicare-gov/provider/340071.

cr-linksets.sh creates a versioned dataset. Use find automatic -type f -size +0b -name linkset.txt to find non-zero linksets.

Modeling the Linkset

When 50 URIs occur in both http://datahub.io/dataset/twc-healthdata and http://datahub.io/dataset/2000-us-census-rdf, it is represented in VoID like this:

<http://datahub.io/dataset/twc-healthdata>
    void:subset :linkset_2000c93158fafa9776550172052af7dc .

:linkset_2000c93158fafa9776550172052af7dc 
     a void:Linkset, void:Dataset;
     void:target 
       <http://datahub.io/dataset/twc-healthdata>, 
       <http://datahub.io/dataset/2000-us-census-rdf>;
     void:triples 50; 
.

<http://www.rdfabout.com/rdf/usgov/geo/blah_1> void:inDataset :linkset_2000c93158fafa9776550172052af7dc .
<http://www.rdfabout.com/rdf/usgov/geo/blah_2> void:inDataset :linkset_2000c93158fafa9776550172052af7dc .

We can name the Linkset by hashing the targets and current date. For example:

md5.sh -qs http://datahub.io/dataset/twc-healthdata`date +%s`http://datahub.io/dataset/2000-us-census-rdf
2000c93158fafa9776550172052af7dc

Where this has been applied

Limitations of this approach

This is cheaper to calculate because we don't need to go through the hassle of finding and retrieving the full data dump of each bubble, and we don't have as much instance data to process. However, this will miss connections between our bubble and others' when they mention the same URIs that we do, but are not in their own namespace.

What is next?

The analysis described in this page can be enabled as part of csv2rdf4lod's Secondary Derivative Datasets framework.
How hard is it to get one click data dumps for bubbles that do not use csv2rdf4lod-automation?
What is the disparity between the manual assertion on the CKAN entry and what was actually found?
How can we model the Linkset calculation so that it naturally provides justification for the resulting CKAN annotation? (SIO-qualifying the void:triples triple and saying it prov:wasDerivedFrom the analysis that produced it. Tie into Jim's aggregation thesis?)
Some thoughts on How to characterize a list of RDF node URIs
CKAN lodcloud RDF vocabulary to use add-metadata.py to submit the Linksets to CKAN (done automatically with cr pingback).
Finding Vocabularies that Datasets Use
https://github.com/timrdf/vsr/wiki/Centrifuge - a new view on the lodcloud, instead of the bubble blob.
edu.rpi.tw.string.uri.NamingAuthorityMatrix implements the PLD citation graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly