Skip to content

Publishing LOGD's International Open Government Data Search data

Timothy Lebo edited this page Feb 14, 2012 · 91 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

The International Open Government Data Search (IOGDS) project developed their own publishing scripts instead of using the publishing scripts provided by csv2rdf4lod-automation.

This page is a collection of notes about what they needed that csv2rdf4lod-automation didn't provide, so that we can incorporate it back into the rest of the core automation. The issues that IOGDS has raised against csv2rdf4lod-automation are listed here.

These are notes on reverse engineering the code that is lying around - 
it is not intended to be an authoritative explanation of how IOGDS was constructed. 
It is scraps of evidence put together by an outsider.

(http://logd.tw.rpi.edu/lab/project/logd_internaltional_ogd_catalog some documentation)

What they did use

IOGDS used the directory conventions of the [data root](csv2rdf4lod automation data root), they used the enhancement parameters to specify how to transform their CSV scrapings to RDF, and they used the conversion trigger to invoke the core converter. This got them to the point of having per-file RDF conversion results in manual/ and their aggregations in publish/ (for all 80ish of their datasets).

What they did not use

IOGDS did not use the conversion cockpits' publish/bin/publish.sh to publish into the converted RDF into named graphs named after the VoID datasets' URIs, nor did thet use the Metadataset conventions described in Aggregating subsets of converted datasets (#237 and #238). Instead, they created a stand-alone php that they placed in the [data root](csv2rdf4lod automation data root): source/logd-iogdc-exec.php.

So, the step that they recreated on their own:

 conversion results on disk -> conversion results in triple store named graph

Which they achieved by running the following commands:

gemini$ cd /work/data-gov/csv2rdf4lod-automation/data/source
gemini$ ./logd-iogdc-exec.php load

When invoking the above, the following parameters are set:

    [time-start]                   => 1317517576
    [time]                         => 2011-10-01T21:06:16-04:00
    [dir-pwd]                      => /mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source
    [dir-temp]                     => /tmp/data-gov/iogdc-dump-ttl
    [filename-temp-all]            => iogdc-dump-all.tar.gz
    [uri-base]                     => http://logd.tw.rpi.edu
    [enhancement-id]               => 1
    [uri-metadata-graph]           => http://purl.org/twc/vocab/conversion/MetaDataset
    [uri-metadata-graph-test]      => http://purl.org/twc/vocab/conversion/MetaDataset-test
    [filename-metadata-graph]      => /tmp/data-gov/iogdc-dump-ttl/metadata-graph.ttl
    [filename-metadata-graph-test] => /tmp/data-gov/iogdc-dump-ttl/metadata-graph-test.ttl
    [namespace-dgtwc]              => http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#
    [uri-metadata-logd]            => http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#metadata-logd
    [uri-metadata-logd-test]       => http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#metadata-logd-test
    [filename-metadata-logd]       => /tmp/data-gov/iogdc-dump-ttl/metadata-logd.ttl
    [filename-metadata-logd-test]  => /tmp/data-gov/iogdc-dump-ttl/metadata-logd-test.ttl
    [option]                       => load
    [dir-start]                    => /mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source

(This is in the old [data root](csv2rdf4lod automation data root), which has been superseded by /srv/logd/data/source)

It then determines a list of version directories that should be part of the load:

  0 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/portalu-de/catalog/version/2011-Sep-13',
  1 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/datanest-fair-play-sk/catalog/version/2011-Sep-13',
  2 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/ottawa-ca/catalog/version/2011-Sep-13',
  ...
  ...
  81 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/datagm-org-uk/catalog/version/2011-Sep-14',
  82 => '/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/data-vic-gov-au/catalog/version/2011-Sep-15',

It then switches on the load parameter and says that it will:

load conversion results into dump dir and then put all into one named graph in triple store - http://purl.org/twc/vocab/conversion/MetaDataset

It's using the deprecated vdelete (replaced by vdelete in May 2011 to eliminate need for sudo, add logging, and parametrize the virtuoso configuration parameters instead of hard-coded bindings.):

run_command ( "sudo /opt/virtuoso/scripts/vdelete " . $params["load-uri-target"] );

Fortunately, we can eliminate the hard-coded requirements (there's more than meets the eye - the script above also hard codes all virtuoso parameters...) and switch over to the latest without a hitch:

run_command ( '$CSV2RDF4LOD_HOME/bin/util/virtuoso/vdelete ' . $params["load-uri-target"] );

Now, we can switch to a development triple store without this php script even knowing, by switching the CSV2RDF4LOD environment variables (portion of cr-vars.sh's output shown):

CSV2RDF4LOD_PUBLISH_VIRTUOSO_PORT       1112
CSV2RDF4LOD_PUBLISH_VIRTUOSO_INI_PATH   /srv/logd/config/triple-store/virtuoso/development.ini

It is deleting named graph http://purl.org/twc/vocab/conversion/MetaDataset:

[load-uri-target] => http://purl.org/twc/vocab/conversion/MetaDataset

It then runs through all of the version directories listed (80ish) and gears up to call run_case_task by feeding it the following three values:

/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source/portalu-de/catalog/version/2011-Sep-13
/mnt/raid/data-gov/svn/logd-csv2rdf4lod/data/source
<all of $params, which contains the above two values>

run_case_task hops into the conversion cockpit, unzips the conversion, copies it to a tmp directory and loads it to the triple store:

/tmp/data-gov/iogdc-dump-ttl/portalu-de-catalog-2011-Sep-13.ttl -> http://purl.org/twc/vocab/conversion/MetaDataset

... with the deprecated vload (replaced by vload in May 2011 to eliminate need for sudo, eliminate needless file copying before load, add logging, and parametrize the virtuoso configuration parameters instead of hard-coded bindings.):

run_command ( "sudo /opt/virtuoso/scripts/vload ttl "

Knowing which datasets are part of IOGDS

Datasets' layers are typed to conversion:DatasetCatalog during enhancement, so the following SPARQL query (results) will list the abstract datasets that should be part of IOGDS (this is the crux of the automation to create the void:subset assertions.

PREFIX void:       <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?abstract
WHERE {
  GRAPH ?g  {
    ?abstract 
       a conversion:AbstractDataset;
       void:subset ?versioned .

    ?versioned 
       a conversion:VersionedDataset;
       void:subset ?layer .

    ?layer a conversion:DatasetCatalog .
  }
} 

At some point, they wanted to expose the dump files, so they manually set up http://logd.tw.rpi.edu/2011/iogdc-dump-all-v20110907/. This could already have been fulfilled with the following query (results):

PREFIX void:       <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?versioned ?dump
WHERE {
  GRAPH ?g  {
    ?abstract 
       a conversion:AbstractDataset;
       void:subset ?versioned .

    ?versioned 
       a conversion:VersionedDataset;
       void:subset ?layer .
    optional{ ?versioned void:dataDump ?dump }

    ?layer a conversion:DatasetCatalog .
  }
} 

S2S gets RDF-encoded configurations from its own triple store

https://scm.escience.rpi.edu/svn/public/s2s/trunk/src/rdf/logd-s2s.owl is a version-controlled RDF description of the OpenSearch service that S2S needs to run the IOGDS demonstration. logd-s2s.owl is loaded into the default graph of a TDB+Joseki triple store. S2S is invoked by providing it the URI of the service http://logd.tw.rpi.edu/s2s/1/1/LogdIntlSearchService. The RDF description is illustrated here:

Fulfilling OpenSearch queries (what S2S needed) with SPARQL queries (what LOGD had)

The S2S Framework needs an OpenSearch web service to obtain data, which is provided by http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php (<- gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php <- google svn). The OpenSearch XML (<- gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1/opensearch.xml <- google svn) describes the service, which accepts OpenSearch requests and fulfills them by executing SPARQL queries against the LOGD triple store. http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php is described in logd-s2s.owl, which is loaded in a TDB+Joseki triple store and DESCRIBED (and has a new URI that actually resolves).

http://logd.tw.rpi.edu/ws/iogdc/1.1 is version controlled in http://data-gov-wiki.googlecode.com/svn/trunk/web/logd.tw.rpi.edu/ws/iogdc/1.1 (this is a little prettier for navigation) and can be obtained by:

svn checkout http://data-gov-wiki.googlecode.com/svn/trunk/web/logd.tw.rpi.edu/ws/iogdc/1.1

The forward-facing OGDSearch.php depends on phpOGDSearch.php and phpWebUtil.php.

OGDSearch.php creates an OGDSearch, sets some params, and runs it:

$svc = new OGDSearch();
$svc->params_config[OGDSearch::CONFIG_FIELD_TITLE] = "OGDSearch-test";
$svc->params_config[OGDSearch::CONFIG_FIELD_URI_METADATASET] = "<http://purl.org/twc/vocab/conversion/MetaDataset-test>";
$svc->run();

phpOGDSearch.php specifies the endpoint that it queries:

const CONFIG_SPARQL_ENDPOINT= "http://gemini.tw.rpi.edu:8890/sparql";

and recognizes the params that we tweaked:

const CONFIG_FIELD_URI_METADATASET =  "config uri metadataset"; // "<http://purl.org/twc/vocab/conversion/MetaDataset>";

Testing S2S IOGDS demo with development data

There is an S2S SearchService instance that points to the OpenSearch XML description document (<-- gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1/opensearch-test.xml <-- google svn ). Looking at these service descriptions, one can see that the service http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php (<-- gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php <-- google svn) is called to fulfill IOGDS's OpenSearch queries.

The URI for the IOGDC test S2S SearchService instance is http://logd.tw.rpi.edu/s2s/1/1/TestLogdIntlSearchService, which is not dereferenceable, but can be DESCRIBED by a TDB+Joseki endpoint. The Service URI it is only used to identify the service when calling the init(serviceURI) JavaScript function. (A new URI for the service is now resolvable.)

Diffing http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch.php and http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php:

@gemini:/var/www/html/logd.tw.rpi.edu/ws/iogdc/1.1$ diff OGDSearch.php OGDSearch-test.php
74,75c74,75
< $svc->params_config[OGDSearch::CONFIG_FIELD_TITLE] = "OGDSearch";
< $svc->params_config[OGDSearch::CONFIG_FIELD_URI_METADATASET] = "<http://purl.org/twc/vocab/conversion/MetaDataset>";
---
> $svc->params_config[OGDSearch::CONFIG_FIELD_TITLE] = "OGDSearch-test";
> $svc->params_config[OGDSearch::CONFIG_FIELD_URI_METADATASET] = "<http://purl.org/twc/vocab/conversion/MetaDataset-test>";

And then there was Drupal...

The "production demonstration" is currently at http://logd.tw.rpi.edu/demo/international_dataset_catalog_search.

Editing it (http://logd.tw.rpi.edu/node/9903/edit), we can see how the S2S

...
<script src="/s2s/scripts/core/logd-S2SWidget.js" type="text/javascript"></script>
...
<script src="http://logd.tw.rpi.edu/ws/iogdc/1.1/OGDSearch-test.php?request=TotalDataset"></script>
...
<body onload="javascript:init('http://logd.tw.rpi.edu/s2s/1/1/LogdIntlSearchService');

development mirror: http://logd.tw.rpi.edu/demo/development/iogds_development_mirror

Clone this wiki locally