Secondary Derivative Datasets

What is first

cr-cron.sh automates much of the construction of secondary derived datasets.
Aggregating subsets of converted datasets is a less-informative alternative to this page, since that just repackages existing data/metadata instead of deriving novel information from the existing data.
The need and broad applicability of secondary derived datasets was annealed during twc-healthdata.

What we will cover

Let's get to it!

Enabling

pr-enable-dataset.sh provides the ability to enable any of of the built-in secondary derived datasets that come with Prizms. Running it from within the conversion [data root](csv2rdf4lod automation data root) without any parameters shows the status for all available derived datasets.

lebot@hub:~/prizms/hub/data/source$ cr-pwd.sh 
source/

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh 
Available datasets:
   pr-spobal-ng           is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
   cr-aggregate-eparams   is *not* enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)

pr-enable-dataset.sh leverages csv2rdf4lod-automation's Triggers mechanism to derive secondary datasets using the same mechanisms that are used to process a single dataset. pr-enable-dataset.sh enable a derived datasets by inserting a trigger within the appropriate "SDV" directory conventions.

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh --as-latest cr-aggregate-eparams
Created hub/cr-aggregate-eparams/version/latest/retrieve.sh -> /home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh

Rerunning the overview shows that cr-aggregate-eparams is now enabled.

lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh 
Available datasets:
   pr-spobal-ng           is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
   cr-aggregate-eparams   is enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)

Now commit the pointers, so that the production user can find them.

lebot@hub:~/prizms/hub/data/source/hub$ git add -f pr-neighborlod/src pr-neighborlod/version/retrieve.sh 
lebot@hub:~/prizms/hub/data/source/hub$ git commit -m 'enabled pr-neighborlod'

CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID

Derived datasets are created with a source identifier for "us". CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID is a [CSV2RDF4LOD environment variable](CSV2RDF4LOD environment variables) used to indicate our source identifier.

Avoiding redundant versions with the version "latest"

Provide the argument, situate the trigger in the version/latest instead of just in version/.

Transporting enabled datasets across Prizms nodes

If you mirror a Prizms node, the soft link that pr-enable-dataset.sh creates and becomes version controlled will likely break. Fortunately, Prizms will be able to recognize this inconsistency and use the naming convention to automatically fix the reference.

Adding a secondary derived dataset

See below.

Disabling

(yes, we need to wrap this)

lebot@opendap:~/prizms/opendap/data/source$ pr-enable-dataset.sh 
Available datasets:
   ...
   cr-isdefinedby   is enabled at us/cr-isdefinedby/version/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/dataset/cr-isdefinedby.sh)

we see it's just a soft link:

ls -lt us/cr-isdefinedby/version/latest/retrieve.sh 

us/cr-isdefinedby/version/latest/retrieve.sh -> /home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/dataset/cr-isdefinedby.sh

So, it's safe to remove it with git rm us/cr-isdefinedby/version/latest/retrieve.sh

What's available

Aggregated portions of other datasets

Some of the automated datasets only aggregate useful subsets of existing datasets -- they don't derive new information but simply repackage what exists. See Aggregating subsets of converted datasets for coverage on:

Aggregating DCAT metadata,
Aggregating DROID file metadata, (gathers all every version, so enable as latest only)
Aggregating Datasets' Conversion Metadata,
Aggregating owl:sameAs links,
Aggregating MetaDatasets,
Aggregating rdfs:isDefinedBy,
Aggregating Turtle-in-comments,
Aggregating a full dump,
Provenance and metadata created from retrieval, tweaking, conversion, and aggregation, and
Sitemaps.

cr-isdefinedby

cr-isdefinedby.sh gathers up all predicates and classes occurring in a Prizms node and asserts rdfs:isDefinedBy its namespace and prov:wasAttributedTo its domain. This is used in the web site to organize terminology that occurs in the data. Find all asserted properties and classes, and assert rdfs:isDefinedBy to their namespace.

This dataset is incremental, so do not enable it as a "latest version only":
Q: Enable derived dataset 'pr-spobal-full-dump' as 'latest version only'? [y/n] n

cr-linksets

cr-linksets gathers up all URIs in a Prizms node that are outside of its namespaces to find those that fall within a LOD Cloud Diagram bubble. See Finding Linksets among Linked Data Bubbles

This dataset recalculates each time. If it is 'latest', a history will not be kept for how this Prizms node became a more integrated part of the rest of the LOD Cloud. If it is "versioned", you will be able to observe this growth.

cr-sitemap

cr-sitemap.sh produces a sitemap for robots.txt, so that automated agents can navigate the Prizms node data site. See Sindice at Ping the Semantic Web.

pr-spobal-ng (and full-dump)

Deriving SPO Balance

pr-spobal-ng calculates SPO Balance for every named graph in a Prizms node's SPARQL endpoint. It creates a single conversion:VersionedDataset/void:Dataset that summarizes all un-summarized named graphs. So, a new version of this dataset only appears when new named graphs appear in the endpoint. The size of the spo balance datasets will reflect the growth of the Prizms node.
- pr-spobal-ng does incremental analysis to new named graphs that appear and are not yet summarized, so do not enable it as a "latest":
- Q: Enable derived dataset 'pr-spobal-ng' as 'latest version only'? [y/n] n
pr-spobal-full-dump calculates SPO Balance for the Prizms node's full RDF dump.
- pr-spobal-full-dump analyzes the ENTIRE Prizms endpoint, so it's probably best only to enable it as a "latest":
- Q: Enable derived dataset 'pr-spobal-full-dump' as 'latest version only'? [y/n] y

pr-neighborlod

pr-neighborlod gathers up all URIs in a Prizms node that are outside of its namespaces, associates it to the URI's domain, and accumulates its Linked Data dereferenced RDF.

A more elaborate analysis would record whether the external URI was dereferenceable as RDF:

:new prov:specOf :external; 
     dcterms:date "2013-09-13T12:28:31+00:00"^^xsd:date;
     a NotDereferencable .

cr-pingback

cr-pingback.sh See Ping the Semantic Web.

cr-sparql-sd

cr-sparql-sd.sh fixes the "http://localhost:8890/sparql" problem when requesting /sparql.

curl -H "Accept: application/rdf+xml" -L $CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT | \
   perl -pi -e "s|http://localhost:8890/sparql|$CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT|" > sparql.rdf

returns

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF
	xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
	xmlns:sd="http://www.w3.org/ns/sparql-service-description#" >
  <rdf:Description rdf:about="http://opendap.tw.rpi.edu/sparql">
    <rdf:type rdf:resource="http://www.w3.org/ns/sparql-service-description#Service" />
    <sd:endpoint rdf:resource="http://opendap.tw.rpi.edu/sparql" />
    <sd:feature rdf:resource="http://www.w3.org/ns/sparql-service-description#UnionDefaultGraph" />
    <sd:feature rdf:resource="http://www.w3.org/ns/sparql-service-description#DereferencesURIs" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/SPARQL_Results_CSV" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/N-Triples" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/RDF_XML" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/Turtle" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/RDFa" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/SPARQL_Results_XML" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/N3" />
    <sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/SPARQL_Results_JSON" />
    <sd:supportedLanguage rdf:resource="http://www.w3.org/ns/sparql-service-description#SPARQL10Query" />
    <sd:url rdf:resource="http://opendap.tw.rpi.edu/sparql" />
  </rdf:Description>

This only needs to be done once, so answer yes to the question:

Q: Enable derived dataset 'cr-sparql-sd' as 'latest version only'? [y/n] y

pr-aggregate-pingbacks

pr-aggregate-pingbacks is incremental, where each version contains those pingbacks that have not yet been aggregated. To keep all pingbacks, answer *no; to keep only the pingbacks since the last time they were aggregated (and thus losing the previous), answer yes.

Q: Enable derived dataset 'pr-aggregate-pingbacks' as 'latest version only'? [y/n] n

In the works

The following datasets have been created for special applications and need to be generalized to suit any Prizms node.

pr whois domain but:
- has pay-level-domain calculation issues
- whois services throttle quickly
- responses are irregularly formatted
- massive usage restrictions (republishing, etc.)
triple counts on named graphs loaded. By either SELECT COUNT or walking the provenance (or both, and looking for inconsistencies). e.g. http://opendap.tw.rpi.edu/namedGraphs; the void:Datasets mentioned don't have void:triples.
those in cr-cron.sh
- cr-mirror-ckan.py (hasn't been exercised beyond healthdata use case; haven't had much need for it since)
- cr-publish-tic-to-endpoint.sh (needs a more efficient grep)
- cr-linksets.sh (works fine within cron; don't fix what's not broken)
- cr-pingback.sh (works fine within cron; don't fix what's not broken)
vcard address -> lat/long cr-address-coordinates.sh
raw analysis
https://github.com/timrdf/rdfstats/wiki
AlchemyAPI of non-RDF linked data URIs, with dcterms:related for the pointers that alchemy provides. (these need attribution to alchemy)
pr-sparql-log

Deriving Between The Edge (BTE) Descriptions

https://github.com/timrdf/vsr/wiki/Characterizing-a-list-of-RDF-node-URIs#bte-vocabulary

This is done for specifically SVN paths in opendap.tw. It doesn't make sense to explode the BTE for the entire Prizms node, so we need to figure out a good general case subset of URIs to process.

SPARQL CONSTRUCTs

SPARQL constructs ala WCL property chain use case.

pr-neighborlod.sh sits on a SPARQL query that it edits in line. See notes about it here.

pr-sparql-log

An RDF dataset derived from grep 'GET /sparql' /var/log/apache2/access.log. This isn't implemented yet, but inspired by Mariano Rico's email to dbpedia. Some privacy/security concerns here...

Questions we could answer:

What clients are accessing my endpoint? (e.g. LODSPeaKr version 20130612, http://sparqles.okfn.org/, sindice?)

I'm really not in the mood to dig into parsing apache logs...

http://www.leancrew.com/all-this/2013/07/parsing-my-apache-logs/ provides some python

What is the access.log pattern? My directive is CustomLog /var/log/apache2/access.log combined which follows log pattern "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"", with meanings:

%h Remote host
%l Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
%u Remote user (from auth; may be bogus if return status (%s) is 401)
%t Time the request was received (standard english format)
\"%r\" First line of request
%>s Status. For requests that got internally redirected, this is the status of the original request --- %>s for the last.
%b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
\"%{Referer}i\" The contents of Foobar: header line(s) in the request sent to the server.
\"%{User-agent}i\" The contents of Foobar: header line(s) in the request sent to the server.

An example log message:

192.168.1.62 - - [10/Jan/2014:02:33:11 +0000] "GET /sparql?show_inline=0&named_graph=&output=rdf&query=... HTTP/1.1" 200 468 "-" "LODSPeaKr version 20130612"

Implementing a derived dataset

pr-enable-dataset.sh offers to install any retrieval trigger that it finds by either of the following two methods:

Any file named "pr-*" situated in the same directory as pr-enable-dataset.sh itself. These are the secondary datasets that are bundled with Prizms itself, which are maintained on GitHub here.
Any file annotated as being a #3> <> a conversion:RetrievalTrigger. This allows any Prizms node instance to add their own secondary derived datasets, which could be in turn used by other Prizms nodes.
- csv2rdf4lod follows the bin/dataset/script.sh convention for several of its secondary derived dataset triggers. The optional corresponding folders (e.g. cr-sitemap/ and cr-sitemap.sh) contain supporting materials that the main trigger requires.
- In the future, this path should be generalized with an CSV2RDF4LOD environment variables for which prizms instance roots to search; right now it's hard coded to csv2rdf4lod-automation.

To add your own secondary dataset,

Create a retrieval trigger according to the Triggers and SDV organization conventions.
- Accept arguments [-n] [version-identifier] for dry run and dataset version to use, respectively.
Include the [tic](tic turtle in comments) metadata #3> <> a conversion:RetrievalTrigger in your retrieval trigger.
IF the trigger is smart enough to not repeat itself mindlessly if called repeatedly, use #3> <> a conversion:RetrievalTrigger, conversion:Idempotent; The script cr-idempotent.sh can be used to determine if a trigger is idempotent.

Once a dataset is enabled, cr-cron.sh uses cr-retrieve.sh to pull any retrieval triggers that are in the data root (with arguments cr-retrieve.sh -w --skip-if-exists) .

Design principles: Monotonicity / Idempotency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly