Skip to content

Finding Linksets among Linked Data Bubbles

Tim L edited this page Apr 21, 2014 · 18 revisions

What is first

What we will cover

This page describes how to calculate VoID Linksets between a csv2rdf4lod node and all other bubbles in the Linked Data Diagram, using csv2rdf4lod-automations' one-click data dump and lodcloud's "namespace" annotations. Calculating the Linksets makes it easier to find out how a bubble is connected to others, which also makes it easier to assert the CKAN lodcloud annotation required to get into the diagram.

Let's get to it!

To find links, we need two things:

We can get a bubble's namespace by POSTing its URI to a deployed instance of lift-ckan.py (e.g. here), which provides a good RDF description of the contorted annotations in the CKAN data entry.

curl -H "Content-Type: text/turtle" \
  -d '<http://datahub.io/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#CKANDataset> .' \
    http://aquarius.tw.rpi.edu/projects/datafaqs/services/sadi/ckan/lift-ckan

returns the following RDF triples (among others). The one we need is void:uriSpace.

<http://datahub.io/dataset/2000-us-census-rdf> a datafaqs:CKANDataset;
    ov:shortName "US Census (rdfabout)";
    dcterms:title "2000 U.S. Census in RDF (rdfabout.com)";
    void:sparqlEndpoint <http://www.rdfabout.com/sparql>;
    void:triples 1002848918;
    void:uriSpace "http://www.rdfabout.com/rdf/usgov/geo/" .

http://datahub.io/dataset/twc-logd's namespace is http://logd.tw.rpi.edu/, and http://datahub.io/dataset/twc-healthdata references URIs http://logd.tw.rpi.edu/id/medicare-gov/provider/340070 and http://logd.tw.rpi.edu/id/medicare-gov/provider/340071.

cr-linksets.sh creates a versioned dataset. Use find automatic -type f -size +0b -name linkset.txt to find non-zero linksets.

Modeling the Linkset

When 50 URIs occur in both http://datahub.io/dataset/twc-healthdata and http://datahub.io/dataset/2000-us-census-rdf, it is represented in VoID like this:

<http://datahub.io/dataset/twc-healthdata>
    void:subset :linkset_2000c93158fafa9776550172052af7dc .

:linkset_2000c93158fafa9776550172052af7dc 
     a void:Linkset, void:Dataset;
     void:target 
       <http://datahub.io/dataset/twc-healthdata>, 
       <http://datahub.io/dataset/2000-us-census-rdf>;
     void:triples 50; 
.

<http://www.rdfabout.com/rdf/usgov/geo/blah_1> void:inDataset :linkset_2000c93158fafa9776550172052af7dc .
<http://www.rdfabout.com/rdf/usgov/geo/blah_2> void:inDataset :linkset_2000c93158fafa9776550172052af7dc .

We can name the Linkset by hashing the targets and current date. For example:

md5.sh -qs http://datahub.io/dataset/twc-healthdata`date +%s`http://datahub.io/dataset/2000-us-census-rdf
2000c93158fafa9776550172052af7dc

Where this has been applied

Limitations of this approach

This is cheaper to calculate because we don't need to go through the hassle of finding and retrieving the full data dump of each bubble, and we don't have as much instance data to process. However, this will miss connections between our bubble and others' when they mention the same URIs that we do, but are not in their own namespace.

What is next?

Clone this wiki locally