Skip to content

Script: cr test conversion.sh

timrdf edited this page Apr 11, 2012 · 108 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

Motivation

Since csv2rdf4lod is being continually developed, it is good to use the latest and greatest version (by using git pull). But what if some new behavior of the converter changes, producing your data differently? Well, that's a problem. And you need to know about it ASAP. Even better, I need to know about it ASAP. Ideally, I would know about the problem and fix it before I even release the next version of the converter. That way, you wouldn't have to worry about it. cr-test-conversion.sh helps you identify these problems so that you can handle them quickly. At the same time, it helps you share your explicit expectations for the converter so that I can verify that it works for you before I release another version.

Ultimately, verifying that the conversion meets your expectations makes your applications more stable.

Running the unit tests

Make sure tdbloader is installed and on your path:

$ which tdbloader
/opt/tdb/TDB-0.8.2/bin/tdbloader

From your conversion cockpit, run:

cr-test-conversion.sh --setup --verbose

This will use tdbloader to load the publish/* dump files into publish/tdb/ and run the unit tests at ../../rq or rq/.

Implementation

The script $CSV2RDF4LOD_HOME/bin/util/cr-test-conversion.sh is a start at tackling this challenge. Like virtually all other cr- scripts, it is invoked from any conversion cockpit. When invoked, it applies a variety of SPARQL queries to verify the converted data.

Dependencies

The testing infrastructure is currently using Jena's TDB because it lets us set up a triple store in a local directory of our choosing. See TWC's page for help installing Jena TDB. If you can successfully tdbloader and tdbquery, then you're good to go. (If you have a burning desire to test using other triple stores, go vote for #150)

Setting up a unit test

Get into a cr:dataset directory (running cr-pwd-type.sh says cr:dataset) and run:

/srv/logd/data/source/nycopendata-socrata-com/zip-code-breakdowns# cr-test-conversion.sh --rq

Get into a conversion cockpit and run:

/srv/logd/data/source/nycopendata-socrata-com/zip-code-breakdowns/version/2012-Apr-11# cr-test-conversion.sh

Using version-controlled csv2rdf4lod skeletons to report bugs

version control strategies discusses how csv2rdf4lod-automation can be used within a version control system. When using one, it becomes incredibly easy to report a bug, all one needs to do is commit the .rq and point others to the URL of the test on the SVN web server. For example, someone could say:

Hey, this [1] doesn't work and I need it Real Soon!, it's for my demo.

[1] https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/version/2011-Jun-27/rq/test/ask/present/thing_2.rq

With just this URL, I can run to my terminal:

$ mkdir hurry-and-fix; cd hurry-and-fix
$ svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog \
               source/data-gov-au/catalog
$ cd source/data-gov-au/catalog/version/2011-Jun-27
$ export CSV2RDF4LOD_PUBLISH=true; export CSV2RDF4LOD_PUBLISH_TDB=true
$ ./convert-catalog.sh

bash-3.2$ cr-test-conversion.sh --verbose
................................................................................
rq/test/ask/absent/subject-uri-follows-sdv-naming.rq (Ask => No)

      <http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/data.gov.au/version/2011-Jun-27/thing_2> ?p ?o .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           - - - FAIL - - 
rq/test/ask/present/thing_2-keywords-parsed.rq (Ask => No)

      :thing_2 dcterms:subject "Bicycles", 
                               "Bike paths",
                               "Cycling",
                               "Transport" .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           - - - FAIL - -
rq/test/ask/present/thing_2-keywords-unparsed.rq (Ask => No)

                   #http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/version/2011-Jun-27/
      :thing_2 dgtwc:keywords   "Bicycles ,  Bike paths ,  Cycling ,  Transport" ;
               e1:keywords_tags "Bicycles ,  Bike paths ,  Cycling ,  Transport" .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           - - - FAIL - -
rq/test/ask/present/thing_2.rq (Ask => No)

      :thing_2
         e1:data_gov_au_category "Community ,  Health ,  Transport" ;
         dgtwc:categories        "Community ,  Health ,  Transport" ;
         # The following two should be parsed into the three triples below:
         dgtwc:category  "Community", 
                         "Health",
                         "Transport";
         # The following two should be parsed into the three triples below:
         e1:keywords_tags        "Bicycles ,  Bike paths ,  Cycling ,  Transport" ;
         dgtwc:keywords          "Bicycles ,  Bike paths ,  Cycling ,  Transport" ;
         dcterms:subject "Bicycles", 
                         "Bike paths",
                         "Cycling",
                         "Transport" .

--------------------------------------------------------------------------------
1 of 4 passed

And I can see your new concerns!

Exposing RDF conversion unit tests as RDF

By extending Vocabulary of Interlinked Datasets (VoID) and reusing Description of a Project (DOAP), we can model an abstract dataset that is under version control and has unit tests:

<http://logd.tw.rpi.edu/source/worldbank-org/dataset/world-development-indicators>
  a conversion:AbstractDataset, void:Dataset;
  a conversion:VersionControlledDataset;
  doap:repository [
    a doap:SVNRepository;
    doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/worldbank-org/world-development-indicators/>;
  ];
  a conversion:UnitTestedDataset;
  conversion:testable_by [ 
     a doap:Project;
     doap:developer <http://tw.rpi.edu/instances/MaryamFazel-Zarandi>;
     doap:developer <http://tw.rpi.edu/instances/TimLebo>;
     doap:repository [ 
       a doap:SVNRepository;
       doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/worldbank-org/world-development-indicators/rq/>
     ];
  ];

Sometimes tests can only apply to specific versions, since they have to assume specific values for a specific data element. Although they aren't as broadly applicable, they are still useful. The following RDF encoding states A versioned dataset is under version control and has unit tests:

<http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/version/2011-Jun-27>
  a conversion:VersionedDataset, void:Dataset;
  a conversion:VersionControlledDataset;
  doap:repository [
    a doap:SVNRepository;
    doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/>;
  ];
  a conversion:UnitTestedDataset;
  conversion:testable_by [ 
     a doap:Project;
     doap:developer <http://tw.rpi.edu/instances/YongmeiShi>;
     doap:developer <http://tw.rpi.edu/instances/TimLebo>;
     doap:repository [ 
       a doap:SVNRepository;
       doap:location 
  <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/version/2011-Jun-27/rq/>
     ];
  ];
.

cr-test-conversion.sh --catalog -w will write a listing that types the SPARQL-based unit test as an earl:TestCase. For example, source/worldbank-org/world-development-indicators/rq/test/list.ttl:

@prefix earl: <http://www.w3.org/ns/earl#> .

<ask/absent/impossible_series.rq>      a earl:TestCase .
<ask/absent/impossible.rq>             a earl:TestCase .
<ask/present/has-a-triple.rq>          a earl:TestCase .
<ask/present/has-impossible_series.rq> a earl:TestCase .
<ask/present/has-a-indicator.rq>       a earl:TestCase .
<ask/present/has-a-entry.rq>           a earl:TestCase .
<ask/present/has-a-country.rq>         a earl:TestCase .

Test results [vocabularies](RDF vocabularies used):

cr-test-conversion.sh usage

cr-test-conversion.sh --help:

 usage: cr-test-conversion.sh
  --rq                   : Create initial rq/test/ask/{present,absent}/*.rq directory structure.
  --setup                : Run tests, populate the tdb/ beforehand.
  --setup {--verbose, -v}: Run tests, populate the tdb/ beforehand, and show query contents.
                         : Run tests. Needs rq/test or ../../rq/test and publish/tdb/.
  {--verbose, -v}        : Run tests. Needs same as above. Shows the query contents while testing.
  --catalog -w           : Find all rq/test and create rq/test/list.ttl rdf:typing them to earl:TestCase.
  --catalog              : Show dryrun of finding all rq/test; print hypothetical contents of rq/test/list.ttl.
  --show-catalog         : Show all rq/test/list.ttl

Setup

bash-3.2$ cd /source/medicare-gov/catalog

bash-3.2$ ls
version/

bash-3.2$ cr-test-conversion.sh --rq 
Creating rq/test for dataset medicare-gov catalog
rq/test/ask/present
rq/test/ask/present/a-dataset-exists.rq
rq/test/ask/absent
rq/test/ask/absent/impossible.rq

bash-3.2$ ls
version/
rq/

The two sample queries (a-dataset-exists.rq and impossible.rq) take the following form. If you follow this capitalization and structure, the --verbose flag will be a little cleaner when executing the tests.

...
ASK
WHERE {
   GRAPH ?g {
      ...
   }
}

(or on another machine, according to Version control strategies: only the essential minimum is needed)

Next, we can hop into a conversion cockpit and prepare to test:

bash-3.2$ cd version/2011-Jul-18/

bash-3.2$ ls
source/
doc/
manual/
convert-catalog.sh
automatic/
publish/
bash-3.2$ export CSV2RDF4LOD_PUBLISH_TDB=true

bash-3.2$ publish/bin/publish.sh
...
 WARN [main] (FactoryGraphTDB.java:241) - No BGP optimizer
Load: publish/medicare-gov-catalog-2011-Jul-18.nt
34,552 triples: loaded in 2.3 seconds [15,254.7 triples/s]

Test!

SOURCE THE my-csv2rdf4lod-source-me.sh for the project that you are testing against. See my-csv2rdf4lod-source-me.sh.

  • then reset your CSV2RDF4LOD_HOME CSV2RDF4LOD_CONVERT_MACHINE_URI CSV2RDF4LOD_CONVERT_PERSON_URI to point to your copy of the converter.
bash-3.2$ cr-test-conversion.sh 
../../rq/test/ask/absent/impossible.rq Ask => No
../../rq/test/ask/present/a-dataset-exists.rq Ask => Yes
--------------------------------------------------------------------------------
2 of 2 passed

If you'd like to see a bit more, use -v or --verbose:

bash-3.2$ cr-test-conversion.sh --verbose
................................................................................
../../rq/test/ask/absent/impossible.rq (Ask => No)

      twi:TimLebo owl:sameAs twi:notTimLebo .

................................................................................
../../rq/test/ask/present/a-dataset-exists.rq (Ask => Yes)

      ?dataset a conversion:Dataset, void:Dataset .

--------------------------------------------------------------------------------
2 of 2 passed

Example: Testing GovTrack

From a conversion cockpit:

bash-3.2$ find rq
rq
rq/test
rq/test/ask
rq/test/ask/absent
rq/test/ask/absent/9-to-7.rq
rq/test/ask/present
rq/test/ask/present/0-to-2.rq
rq/test/ask/present/2-to-3.rq
rq/test/ask/present/3-to-5.rq
rq/test/ask/present/3-to-7.rq
rq/test/ask/present/5-to-1.rq
rq/test/ask/present/7-to-5.rq

export CSV2RDF4LOD_PUBLISH_TDB=true to load the conversion into a TDB directory to query.

http://download.geonames.org/export/zip/US.zip diagram of enhancements to geonames zip code dump
bash-3.2$ cr-test-conversion.sh -v
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/
rq/test/ask/absent/9-to-7.rq (Ask => Yes)           - - - FAIL - - -

      typed_subdivision_order_3:r40040c9reference_199_VA_US geonames:parentFeature <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> .

................................................................................
rq/test/ask/present/0-to-2.rq (Ask => Yes) 

      zip-us-us:point_40040 
         a                       wgs:Point;
         geonames:parentFeature <http://logd.tw.rpi.edu/id/usps-com/zip/23690>;
         wgs:lat                ?lat;
         wgs:long               ?long .

................................................................................
rq/test/ask/present/2-to-3.rq (Ask => Yes) 

      <http://logd.tw.rpi.edu/id/usps-com/zip/23690> geonames:parentFeature typed_place:Yorktown_VA_US .

................................................................................
rq/test/ask/present/3-to-5.rq (Ask => Yes) 

      typed_place:Yorktown_VA_US geonames:parentFeature typed_subdivision_order_1:VA_US .

................................................................................
rq/test/ask/present/3-to-7.rq (Ask => Yes) 

      typed_place:Yorktown_VA_US geonames:parentFeature <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> .

................................................................................
rq/test/ask/present/5-to-1.rq (Ask => Yes) 

      typed_subdivision_order_1:VA_US geonames:parentFeature typed_country:US .

................................................................................
rq/test/ask/present/7-to-5.rq (Ask => Yes) 

      <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> geonames:parentFeature typed_subdivision_order_1:VA_US .

--------------------------------------------------------------------------------
6 of 7 passed
Clone this wiki locally