-
Notifications
You must be signed in to change notification settings - Fork 36
Secondary Derivative Datasets
- cr-cron.sh automates much of the construction of secondary derived datasets.
- Aggregating subsets of converted datasets is a less-informative alternative to this page, since that just repackages existing data/metadata instead of deriving novel information from the existing data.
- The need and broad applicability of secondary derived datasets was annealed during twc-healthdata.
pr-enable-dataset.sh provides the ability to enable any of of the built-in secondary derived datasets that come with Prizms. Running it from within the conversion [data root](csv2rdf4lod automation data root) without any parameters shows the status for all available derived datasets.
lebot@hub:~/prizms/hub/data/source$ cr-pwd.sh
source/
lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh
Available datasets:
pr-spobal-ng is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
cr-aggregate-eparams is *not* enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)
pr-enable-dataset.sh
leverages csv2rdf4lod-automation's Triggers mechanism to derive secondary datasets using the same mechanisms that are used to process a single dataset. pr-enable-dataset.sh
enable a derived datasets by inserting a trigger within the appropriate "SDV" directory conventions.
lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh --as-latest cr-aggregate-eparams
Created hub/cr-aggregate-eparams/version/latest/retrieve.sh -> /home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh
Rerunning the overview shows that cr-aggregate-eparams
is now enabled.
lebot@hub:~/prizms/hub/data/source$ pr-enable-dataset.sh
Available datasets:
pr-spobal-ng is *not* enabled at hub/pr-spobal-ng/version/latest/retrieve.sh (/home/lebot/opt/prizms/bin/dataset/pr-spobal-ng.sh)
cr-aggregate-eparams is enabled at hub/cr-aggregate-eparams/version/latest/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/secondary/cr-aggregate-eparams.sh)
Now commit the pointers, so that the production user can find them.
lebot@hub:~/prizms/hub/data/source/hub$ git add -f pr-neighborlod/src pr-neighborlod/version/retrieve.sh
lebot@hub:~/prizms/hub/data/source/hub$ git commit -m 'enabled pr-neighborlod'
Derived datasets are created with a source identifier for "us". CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID
is a [CSV2RDF4LOD environment variable](CSV2RDF4LOD environment variables) used to indicate our source identifier.
Provide the argument, situate the trigger in the version/latest
instead of just in version/
.
If you mirror a Prizms node, the soft link that pr-enable-dataset.sh
creates and becomes version controlled will likely break. Fortunately, Prizms will be able to recognize this inconsistency and use the naming convention to automatically fix the reference.
See below.
(yes, we need to wrap this)
lebot@opendap:~/prizms/opendap/data/source$ pr-enable-dataset.sh
Available datasets:
...
cr-isdefinedby is enabled at us/cr-isdefinedby/version/retrieve.sh (/home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/dataset/cr-isdefinedby.sh)
we see it's just a soft link:
ls -lt us/cr-isdefinedby/version/latest/retrieve.sh
us/cr-isdefinedby/version/latest/retrieve.sh -> /home/lebot/opt/prizms/repos/csv2rdf4lod-automation/bin/dataset/cr-isdefinedby.sh
So, it's safe to remove it with git rm us/cr-isdefinedby/version/latest/retrieve.sh
Some of the automated datasets only aggregate useful subsets of existing datasets -- they don't derive new information but simply repackage what exists. See Aggregating subsets of converted datasets for coverage on:
- Aggregating DCAT metadata,
- Aggregating DROID file metadata, (gathers all every version, so enable as latest only)
- Aggregating Datasets' Conversion Metadata,
- Aggregating owl:sameAs links,
- Aggregating MetaDatasets,
- Aggregating rdfs:isDefinedBy,
- Aggregating Turtle-in-comments,
- Aggregating a full dump,
- Provenance and metadata created from retrieval, tweaking, conversion, and aggregation, and
- Sitemaps.
cr-isdefinedby.sh gathers up all predicates and classes occurring in a Prizms node and asserts rdfs:isDefinedBy its namespace and prov:wasAttributedTo its domain. This is used in the web site to organize terminology that occurs in the data. Find all asserted properties and classes, and assert rdfs:isDefinedBy to their namespace.
- This dataset is incremental, so do not enable it as a "latest version only":
Q: Enable derived dataset 'pr-spobal-full-dump' as 'latest version only'? [y/n] n
cr-linksets gathers up all URIs in a Prizms node that are outside of its namespaces to find those that fall within a LOD Cloud Diagram bubble. See Finding Linksets among Linked Data Bubbles
- This dataset recalculates each time. If it is 'latest', a history will not be kept for how this Prizms node became a more integrated part of the rest of the LOD Cloud. If it is "versioned", you will be able to observe this growth.
cr-sitemap.sh produces a sitemap for robots.txt, so that automated agents can navigate the Prizms node data site. See Sindice at Ping the Semantic Web.
Deriving SPO Balance
-
pr-spobal-ng calculates SPO Balance for every named graph in a Prizms node's SPARQL endpoint. It creates a single conversion:VersionedDataset/void:Dataset that summarizes all un-summarized named graphs. So, a new version of this dataset only appears when new named graphs appear in the endpoint. The size of the spo balance datasets will reflect the growth of the Prizms node.
- pr-spobal-ng does incremental analysis to new named graphs that appear and are not yet summarized, so do not enable it as a "latest":
Q: Enable derived dataset 'pr-spobal-ng' as 'latest version only'? [y/n] n
-
pr-spobal-full-dump calculates SPO Balance for the Prizms node's full RDF dump.
- pr-spobal-full-dump analyzes the ENTIRE Prizms endpoint, so it's probably best only to enable it as a "latest":
Q: Enable derived dataset 'pr-spobal-full-dump' as 'latest version only'? [y/n] y
pr-neighborlod gathers up all URIs in a Prizms node that are outside of its namespaces, associates it to the URI's domain, and accumulates its Linked Data dereferenced RDF.
A more elaborate analysis would record whether the external URI was dereferenceable as RDF:
:new prov:specOf :external;
dcterms:date "2013-09-13T12:28:31+00:00"^^xsd:date;
a NotDereferencable .
cr-pingback.sh See Ping the Semantic Web.
cr-sparql-sd.sh fixes the "http://localhost:8890/sparql" problem when requesting /sparql.
curl -H "Accept: application/rdf+xml" -L $CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT | \
perl -pi -e "s|http://localhost:8890/sparql|$CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT|" > sparql.rdf
returns
<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:sd="http://www.w3.org/ns/sparql-service-description#" >
<rdf:Description rdf:about="http://opendap.tw.rpi.edu/sparql">
<rdf:type rdf:resource="http://www.w3.org/ns/sparql-service-description#Service" />
<sd:endpoint rdf:resource="http://opendap.tw.rpi.edu/sparql" />
<sd:feature rdf:resource="http://www.w3.org/ns/sparql-service-description#UnionDefaultGraph" />
<sd:feature rdf:resource="http://www.w3.org/ns/sparql-service-description#DereferencesURIs" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/SPARQL_Results_CSV" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/N-Triples" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/RDF_XML" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/Turtle" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/RDFa" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/SPARQL_Results_XML" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/N3" />
<sd:resultFormat rdf:resource="http://www.w3.org/ns/formats/SPARQL_Results_JSON" />
<sd:supportedLanguage rdf:resource="http://www.w3.org/ns/sparql-service-description#SPARQL10Query" />
<sd:url rdf:resource="http://opendap.tw.rpi.edu/sparql" />
</rdf:Description>
This only needs to be done once, so answer yes to the question:
Q: Enable derived dataset 'cr-sparql-sd' as 'latest version only'? [y/n] y
pr-aggregate-pingbacks is incremental, where each version contains those pingbacks that have not yet been aggregated. To keep all pingbacks, answer *no; to keep only the pingbacks since the last time they were aggregated (and thus losing the previous), answer yes.
Q: Enable derived dataset 'pr-aggregate-pingbacks' as 'latest version only'? [y/n] n
The following datasets have been created for special applications and need to be generalized to suit any Prizms node.
-
pr whois domain but:
- has pay-level-domain calculation issues
- whois services throttle quickly
- responses are irregularly formatted
- massive usage restrictions (republishing, etc.)
- triple counts on named graphs loaded. By either SELECT COUNT or walking the provenance (or both, and looking for inconsistencies). e.g. http://opendap.tw.rpi.edu/namedGraphs; the void:Datasets mentioned don't have void:triples.
- those in cr-cron.sh
- cr-mirror-ckan.py (hasn't been exercised beyond healthdata use case; haven't had much need for it since)
- cr-publish-tic-to-endpoint.sh (needs a more efficient grep)
- cr-linksets.sh (works fine within cron; don't fix what's not broken)
- cr-pingback.sh (works fine within cron; don't fix what's not broken)
- vcard address -> lat/long cr-address-coordinates.sh
- raw analysis
- https://github.com/timrdf/rdfstats/wiki
- AlchemyAPI of non-RDF linked data URIs, with dcterms:related for the pointers that alchemy provides. (these need attribution to alchemy)
- pr-sparql-log
https://github.com/timrdf/vsr/wiki/Characterizing-a-list-of-RDF-node-URIs#bte-vocabulary
This is done for specifically SVN paths in opendap.tw. It doesn't make sense to explode the BTE for the entire Prizms node, so we need to figure out a good general case subset of URIs to process.
- SPARQL constructs ala WCL property chain use case.
pr-neighborlod.sh sits on a SPARQL query that it edits in line. See notes about it here.
An RDF dataset derived from grep 'GET /sparql' /var/log/apache2/access.log
. This isn't implemented yet, but inspired by Mariano Rico's email to dbpedia. Some privacy/security concerns here...
Questions we could answer:
- What clients are accessing my endpoint? (e.g. LODSPeaKr version 20130612, http://sparqles.okfn.org/, sindice?)
I'm really not in the mood to dig into parsing apache logs...
- http://www.leancrew.com/all-this/2013/07/parsing-my-apache-logs/ provides some python
What is the access.log pattern? My directive is CustomLog /var/log/apache2/access.log combined
which follows log pattern "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
, with meanings:
-
%h
Remote host -
%l
Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On. -
%u
Remote user (from auth; may be bogus if return status (%s) is 401) -
%t
Time the request was received (standard english format) -
\"%r\"
First line of request -
%>s
Status. For requests that got internally redirected, this is the status of the original request --- %>s for the last. -
%b
Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent. -
\"%{Referer}i\"
The contents of Foobar: header line(s) in the request sent to the server. -
\"%{User-agent}i\"
The contents of Foobar: header line(s) in the request sent to the server.
An example log message:
192.168.1.62 - - [10/Jan/2014:02:33:11 +0000] "GET /sparql?show_inline=0&named_graph=&output=rdf&query=... HTTP/1.1" 200 468 "-" "LODSPeaKr version 20130612"
pr-enable-dataset.sh offers to install any retrieval trigger that it finds by either of the following two methods:
- Any file named "pr-*" situated in the same directory as pr-enable-dataset.sh itself. These are the secondary datasets that are bundled with Prizms itself, which are maintained on GitHub here.
-
Any file annotated as being a
#3> <> a conversion:RetrievalTrigger
. This allows any Prizms node instance to add their own secondary derived datasets, which could be in turn used by other Prizms nodes.- csv2rdf4lod follows the bin/dataset/script.sh convention for several of its secondary derived dataset triggers. The optional corresponding folders (e.g.
cr-sitemap/
andcr-sitemap.sh
) contain supporting materials that the main trigger requires. - In the future, this path should be generalized with an CSV2RDF4LOD environment variables for which prizms instance roots to search; right now it's hard coded to csv2rdf4lod-automation.
- csv2rdf4lod follows the bin/dataset/script.sh convention for several of its secondary derived dataset triggers. The optional corresponding folders (e.g.
To add your own secondary dataset,
- Create a retrieval trigger according to the Triggers and SDV organization conventions.
- Accept arguments
[-n] [version-identifier]
for dry run and dataset version to use, respectively.
- Accept arguments
- Include the [tic](tic turtle in comments) metadata
#3> <> a conversion:RetrievalTrigger
in your retrieval trigger. -
IF the trigger is smart enough to not repeat itself mindlessly if called repeatedly, use
#3> <> a conversion:RetrievalTrigger, conversion:Idempotent;
The script cr-idempotent.sh can be used to determine if a trigger is idempotent.
Once a dataset is enabled, cr-cron.sh uses cr-retrieve.sh to pull any retrieval triggers that are in the data root (with arguments cr-retrieve.sh -w --skip-if-exists
) .
Design principles: Monotonicity / Idempotency.