-
Notifications
You must be signed in to change notification settings - Fork 11
Retrieving CKAN's Dataset Distribution Files
This github wiki documents the technology behind the Linked Data aggregation site http://healthdata.tw.rpi.edu.
- Contributing to TWC Healthdata, if you're trying to do that.
-
Accessing CKAN listings shows how to find out what datasets are available in a particular instance of CKAN.
- Getting the listing needs to be done just once, and it was done by Tim.
- The listing consists of hub-healthdata-gov
/<dataset-id>/dcat.ttl
files containing the data download URL for each dataset (e.g.).
This page will describe how to retrieve the data for a dataset, using one of the dcat.ttl
files in data/source/hub-healthdata-gov and the scripts from csv2rdf4lod-automation. After you retrieve the data, you can develop the enhancement parameters according to Modeling Guide and commit them back to data/source/hub-healthdata-gov.
If we look at the contents of source/hub-healthdata-gov/food-recalls/dcat.ttl after cloning this project, we see that the RDF dataset that we're about to make was derived from a distribution of one of the datasets listed on our mirror of http://hub.healthdata.gov's CKAN. Each dataset in our CKAN mirror refers back to healthdata.gov's original dataset using prov:alternateOf, so anything we do with our mirrored dataset can be traced back to healthdata.gov.
lebot@healthdata:/srv/twc-healthdata/data/source/hub-healthdata-gov/food-recalls$ cat dcat.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix conversion: <http://purl.org/twc/vocab/conversion/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix : <http://purl.org/twc/health/id/> .
<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/food-recalls>
a void:Dataset;
conversion:source_identifier "hub-healthdata-gov";
conversion:dataset_identifier "food-recalls";
prov:wasDerivedFrom :as_a_csv_b6935a6f-e1ed-4302-8f34-28b4af61cd83;
.
:as_a_csv_b6935a6f-e1ed-4302-8f34-28b4af61cd83
a dcat:Distribution;
dcat:downloadURL <http://www.accessdata.fda.gov/scripts/newpetfoodrecalls/PetFoodRecallProductsList2009.xls>;
.
<http://hub.healthdata.gov/dataset/food-recalls>
a dcat:Dataset;
dcat:distribution :as_a_csv_b6935a6f-e1ed-4302-8f34-28b4af61cd83;
.
Any tool that recognizes DCAT and PROV-O will understand the descriptions above. csv2rdf4lod's cr-retrieve.sh is one of them, and performs the described downloads according to the csv2rdf4lod directory conventions.
Running cr-retrieve.sh from the dataset directory results in an RDF conversion of the tabular XLS.
lebot@healthdata:/srv/twc-healthdata/data/source/hub-healthdata-gov/food-recalls$ cr-retrieve.sh
Attempting to use URL modification date to name version: 2012-May-08
version : 2012-May-08
url : http://www.accessdata.fda.gov/scripts/newpetfoodrecalls/PetFoodRecallProductsList2009.xls
...
lebot@healthdata:/srv/twc-healthdata/data/source/hub-healthdata-gov/food-recalls$ cat version/2012-May-08/automatic/PetFoodRecallProductsList2009.xls_ALL.csv.e1.ttl
...
:thing_2
dcterms:isReferencedBy <http://localhost/source/hub-healthdata-gov/dataset/food-recalls/version/2012-May-08> ;
void:inDataset <http://localhost/source/hub-healthdata-gov/dataset/food-recalls/version/2012-May-08> ;
e1:fiscal_year "2006" ;
e1:order "0001" ;
e1:recall_completed_dt "10/25/2006" ;
e1:recall_initiation_dt "09/14/2005" ;
e1:district_awareness_dt "09/09/2005" ;
e1:event_create_dt "09/13/2005" ;
e1:classification_dt "10/19/2005" ;
e1:recall_event_id "33405.0" ;
e1:recall_number "V-001-6" ;
e1:pet "Fish" ;
e1:trade_name "Maracide" ;
e1:product_description "Medication Mardel biospheres Maracide for Ick, Velvet, and other external Parasites. Contents 2 fl. oz. (59mL) and Fish Care Guide. Virbac Animal Health. Active Ingredients: Ma
lachite green, Chitosan." ;
e1:best_before_dates " " ;
e1:lot "Lot number WE433" ;
e1:product_distributed_qnty "14,470 units" ;
e1:industry "Animal Drugs, Devices and Diagnostics" ;
e1:product_code "68-" ;
e1:recall_status "Terminated" ;
e1:column_19 "" ;
ov:subjectDiscriminator <http://localhost/source/hub-healthdata-gov/dataset/food-recalls/petfoodrecallproductslist2009.xls-all> ;
dcterms:isReferencedBy <http://localhost/source/hub-healthdata-gov/dataset/food-recalls/petfoodrecallproductslist2009.xls-all> ;
ov:csvRow "2"^^xsd:integer .
...
Since enhancement parameters were not available when csv2rdf4lod-automation converted the RDF, the result is a rather poor model of the data. We need to develop the enhancement parameters so that the resulting RDF is [much more useful](Modeling Guide), i.e. connected to other datasets by sharing URIs for the entities described (e.g. fiscal years) and reusing existing vocabulary to describe those entities. Also, it would be helpful if those dates were typed to xsd:date so that more intelligent queries can be performed on the data.
Fortunately, csv2rdf4lod-automation creates a stub enhancement parameters file, so that a lot of good RDF can come from changing a few lines of the default. See Conversion process phase: tweak enhancement parameters for more complete documentation beyond what we discuss for this example.
In this example, the enhancement files are (from twc-healthdata/data/source/hub-healthdata-gov/food-recalls):
- version/2012-May-08/manual/PetFoodRecallProductsList2009.xls_ALL.csv.e1.params.ttl/PetFoodRecallProductsList2009.xls_ALL.csv.e1.params.ttl)
- version/2012-May-08/manual/PetFoodRecallProductsList2009.xls_Dogs,_Cats,_Horses.csv.e1.params.ttl
One enhancement parameters file is created for each table that will be converted, and the XLS produced two tables because it had two sheets ("ALL" and "Dogs, Cats, Horses").
Before we start modifying enhancement parameters, lets relocate them so that they apply to all versions of the dataset, instead of just the XLS on 2012-May-08. We also want to add them to the github repository. Committing it before we modify it will help us see what we've done with it.
# from twc-healthdata/data/source/hub-healthdata-gov/food-recalls
mv version/2012-May-08/manual/PetFoodRecallProductsList2009.xls_* .
git add -f *.e1.params.ttl
git commit -m 'initial eparams for food-recall'
git push
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 2.23 KiB, done.
Total 8 (delta 5), reused 0 (delta 0)
To [email protected]:jimmccusker/twc-healthdata.git
bb0f684..d71df5d master -> master
The following enhancement parameters will apply anytime we re-retrieve the original tablular data from accessdata.fda.gov. We'll demonstrate that soon, after we've made a few enhancements.
- version/PetFoodRecallProductsList2009.xls_ALL.csv.e1.params.ttl
- version/PetFoodRecallProductsList2009.xls_Dogs,_Cats,_Horses.csv.e1.params.ttl
After developing the enhancement, commit it back to the github repository (to your own, and then submit a pull request to ours:
git add -f PetFoodRecallProductsList2009.xls_ALL.csv.e1.params.ttl
git commit -m 'first cut at eparams for food-recall'
See a first cut at changing PetFoodRecallProductsList2009.xls_ALL.csv.e1.params.ttl
In Step 1, we showed how to retrieve the data file for a dataset and to get some not-so-good RDF conversions from it, since there were no enhancement parameters available to lead to a better RDF model.
In Step 2, we showed how to relocate and tweak the enhancement parameters so that we can produce some better RDF.
Now that the enhancement parameters are available in the github repository, we can wipe everything that we've done and start over, as if we were someone else that wanted to reproduce our work:
# from twc-healthdata/data/source/hub-healthdata-gov/food-recalls/version
cd ..
rm -rf version/2012-May-08/ version/retrieve.sh
find .
.
./dcat.ttl
./version/PetFoodRecallProductsList2009.xls_ALL.csv.e1.params.ttl
./version/PetFoodRecallProductsList2009.xls_Dogs,_Cats,_Horses.csv.e1.params.ttl
All we need to reproduce everything that we just did is 1) the dcat.ttl file describing where to get the data and 2) the newly-developed enhancement parameters to apply when we get the data. We can reproduce everything by running the same command that we did in Step 1:
bash-3.2$ cr-retrieve.sh
Attempting to use URL modification date to name version: 2012-May-08
version : 2012-May-08
url : http://www.accessdata.fda.gov/scripts/newpetfoodrecalls/PetFoodRecallProductsList2009.xls
...
And we can inspect the better RDF than we got in Step 1:
vi version/2012-May-08/automatic/PetFoodRecallProductsList2009.xls_ALL.csv.e1.ttl
...
:recall_2 rdfs:label "recall_2" ;
dcterms:identifier "recall_2" ;
coin:slug "recall_2" ;
dcterms:isReferencedBy <http://localhost/source/hub-healthdata-gov/dataset/food-recalls/version/2012-May-08> ;
void:inDataset <http://localhost/source/hub-healthdata-gov/dataset/food-recalls/version/2012-May-08> ;
a food-recalls_vocab:Recall ;
e1:fiscal_year "2006"^^xsd2:gYear ;
e1:order "0001" ;
e1:completed "2006-10-25"^^xsd:date ;
e1:initiated "2005-09-14"^^xsd:date ;
e1:district_awareness "2005-09-09"^^xsd:date ;
e1:event_create "2005-09-13"^^xsd:date ;
e1:classification_date "2005-10-19"^^xsd:date ;
...
The following diagram illustrates how Contributing to TWC Healthdata and this page work to create the final integrated RDF in the SPARQL endpoint http://healthdata.tw.rpi.edu/sparql, and looks ahead to how lodspeakr is set up and how DataFAQs starts to automate some analysis of the datasets. The pdf version contains links on many of the nodes shown.
- The Benefits of Mass Raw Conversions
- Contributing to TWC Healthdata, if you're trying to do that.
- Monitoring Incremental Integration tries to get an overview of how awesome we are doing with integrating these datasets.