Skip to content

Design Objective: Capturing and Exposing Provenance

Tim L edited this page Dec 10, 2013 · 43 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

See Provenance.

Introduction

Collecting, converting, enhancing, and republishing data raises important concerns about the integrity of the resulting data products and applications that use them. For TWC, this is especially true because we are a university that aggregates a variety of governmental data from both government and non-government sources -- each with their own degree of authority and trustworthiness. To address this, we strive to provide a transparent process that allows data consumers to inspect the lineage of the data products we produce. The ability to accurately cite original data source organizations is a direct benefit of this transparency, allowing improved credibility for derived data products and applications. It also gives credit where credit is due for those you have spent a lot of time and energy to create the original data. The following figure illustrates a variety of challenges when aggregating data in a geographically and organizationally distributed system.

Provenance captured during conversion

Our aggregation process captures provenance during the following activities:

  • 1 - Following URL redirects from a dataset listing page to the actual data file download
  • 2 - Retrieving the data file
  • 3 - Extracting files from a zip file
  • 4 - Manual modification of original data file (coarse level)
  • 5 - Associating the RDF dump file from csv2rdf4lod's invocation with its tabular data file and parameter inputs
  • 6 - Enhanced predicate creation lineage
  • 7 - Associating data triples to the tabular cell that caused them
  • 8 - Populating named graph in LOGD SPARQL endpoint
diagram of provenance captured during csv2rdf4lod conversion

1 - Following URL redirects from a dataset listing page to the actual data file download

Data.gov survey: data source modification dates describes an example of how URL redirects are followed from data.gov's details page (e.g. http://www.data.gov/details/1000) to the actual data URL available from the hosting government agency (e.g. http://www.epa.gov/enviro/html/frs_demo/geospatial_data/state_files/state_single_nm.zip). The figure used in the discussion is reproduced below. This trace provides justification for its connection to data.gov and shows how we came to find the data that we retrieved. On a more practical level, it also allows us to associate the dataset to data.gov in a general fashion. Since csv2rdf4lod is agnostic to the source of information, it does not hard-code a dataset as being "from data.gov" or any other source you may be interested in curating -- it simply accepts short identifiers for the source, dataset, and version so that it can construct URIs for the entities it creates. Although the source identifier is used to create a URI for the source that can be described in detail later, and csv2rdf4lod allows for direct annotation within enhancement parameters itself, it would require an unreasonable amount of time to manually complete for thousands of datasets. With provenance, the connection from a downloaded file (and its resulting dataset) to how we found out about the file in the first place provides that connection with minimal effort.

The trace illustrated below is captured using the data.gov-specific script bin/dg-create-dataset-dir.sh, which is run from the directory /source/data-gov/ using csv2rdf4lod's [directory structure convention](Directory Conventions). The script combines bin/util/pcurl.sh and a custom data.gov details page scrapper that asserts the irw:refersTo relationship between the details page and the data file download link. This same approach can be applied to dataset listing pages from organizations other than data.gov to produce the same type of connections.

2 - Retrieving the data file

The provenance of data file downloads is captured by csv2rdf4lod's utility bin/util/pcurl.sh, which wraps the unix curl command to provide the same functionality with one addition -- the creation of an additional metadata file for each download requested. The metadata file contains the provenance of the retrieved file including:

  • Original URL request, its HTTP header, and its reported last-modified dateTime.
  • Final URL redirect, its HTTP header, and its reported last-modified dateTime.
  • The version and md5 of the curl used to retrieve the URL.
  • The dateTime at which the final URL was retrieved.
  • An association between the downloaded file and the final URL from which it was retrieved.
  • The md5 of the downloaded file.

The following commands illustrate how to use pcurl.sh to capture the retrieval of the white house visitor list. First, the source/ directory is created to hold an exact copy of the file we retrieved from the White House (see Conversion cockpit). Although the location of this directory follows csv2rdf4lod's "source - dataset - version" naming convention, pcurl.sh can be used anywhere to request any URL. Next, we go into the source directory and call pcurl.sh like we would call curl. After it completes, we see the retrieved file and its provenance in the same directory. pcurl.sh also accepts the -I parameter to only retrieve the HTTP header information.

$ mkdir -p /source/whitehouse-gov/visitor-records/version/0910/source
$ cd /source/whitehouse-gov/visitor-records/version/0910/source
$ pcurl.sh http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0910.csv
$ ls
  > WhiteHouse-WAVES-Released-0910.csv
  > WhiteHouse-WAVES-Released-0910.csv.pml.ttl

The results of these commands can be found on LOGD at WhiteHouse-WAVES-Released-0910.csv and WhiteHouse-WAVES-Released-0910.csv.pml.ttl. For a more detailed discussion on dataset naming, dataset versioning, and the file directory structure used by csv2rdf4lod, see naming phase and Directory Conventions.

3 - Extracting files from a zip file

The provenance of data files extracted from a zip archive is captured by csv2rdf4lod's utility punzip.sh, which wraps the unix unzip command in the same way that pcurl.sh wraps curl as described above -- it creates an metadata file for each file extracted. The metadata file contains the provenance of the extracted file, including:

  • The zip archive last-modified dateTime.
  • The version and md5 of the punzip used to extract the file.
  • The dateTime at which the file was retrieved.
  • An association between the extracted file and the zip archive.
  • The md5 of the extracted file.

The following commands illustrate how to use punzip.sh to capture the extraction of files from a zip archive of the data.gov's dataset 1000 (EPA Facility Registry System Facilities for the State of New Mexico).

$ cd /source/data-gov/1000/version/2010-Aug-30/source
$ ls
  > state_single_nm.zip
  > state_single_nm.zip.pml.ttl
$ punzip.sh state_single_nm.zip
$ ls
  > state_single_nm.zip
  > state_single_nm.zip.pml.ttl
  > Facility_State_File_Documentation_0401_2010.pdf
  > Facility_State_File_Documentation_0401_2010.pdf.pml.ttl
  > STATE_SINGLE_NM.CSV 
  > STATE_SINGLE_NM.CSV.pml.ttl

The results of these commands can be found on LOGD at STATE_SINGLE_NM.CSV.pml.ttl and Facility_State_File_Documentation_0401_2010.pdf.pml.ttl.

4 - Manual modification of original data file (coarse level)

It is a primary objective in our aggregation process to minimize the amount of manual modifications throughout the process. The structure assistance parameters provided by csv2rdf4lod allow us to describe which non-data regions of an original tabular file should be avoided and how certain common interpretations should be applied -- including adding or renaming column headers. This declarative-based approach reduces the amount of manual labor, preserves the state of original data as we received it, and decreases the risk of human error.

Unfortunately, not all modifications are supported, and situations may require manual effort to continue curating a dataset.


For example, because csv2rdf4lod requires 3-star data and not proprietary 2-star data, $CSV2RDF4LOD_HOMEbin/util/xls2csv.sh can be used to convert Excel formats to CSV. The following commands will retrieve (with provenance) an Excel file, convert it to CSV, and record the provenance of the CSV file.

$ cd source/data-gov/4383/version/2011-Nov-29/source
$ pcurl.sh http://explore.data.gov/download/wfna-38ey/XLS -n STELPRDC5087258 -e xls
 (creates STELPRDC5087258.xls and STELPRDC5087258.xls.pml.ttl)
$ cd source/data-gov/4383/version/2011-Nov-29/
$ mkdir manual/
$ xls2csv.sh source/STELPRDC5087258.xls -od manual/
 (creates manual/STELPRDC5087258_tblFMDirectory2011.xls.csv)
$ justify.sh source/STELPRDC5087258.xls manual/STELPRDC5087258_tblFMDirectory2011.xls.csv xls2csv
 (creates manual/STELPRDC5087258_tblFMDirectory2011.xls.csv.pml.ttl)

As an older example, the (24 September, 2010) White House Visitors list changed from a comma-separated file to a tab-separated. Before the cell delimiter enhancement was available, we needed to manually change the delimiter. Because an extra step is involved, we need to account for this in the provenance of our conversion.

The following four unix commands illustrate how this would be achieved (though, the cell delimiter should be used now). First, the manual/ directory is created to contain any data files that required manual intervention before continuing to the conversion process. Next, the original data file that we obtained from the White House website is used to create a modified analog in the manual directory. Next, we note the association between the original and modified analog using the justify.sh script, which produces the justifications in manual/WhiteHouse-WAVES-Released-0910.csv.pml.ttl.

$ mkdir manual
$ cat source/WhiteHouse-WAVES-Released-0910.csv | tr '\t' ',' > manual/WhiteHouse-WAVES-Released-0910.csv
$ justify.sh source/WhiteHouse-WAVES-Released-0910.csv manual/WhiteHouse-WAVES-Released-0910.csv tab2comma
$ less manual/WhiteHouse-WAVES-Released-0910.csv.pml.ttl

The results of these commands can be found on LOGD at WhiteHouse-WAVES-Released-0910.csv.pml.ttl

5 - Associating the RDF dump file from csv2rdf4lod's invocation with its tabular data file and parameter inputs

As described in conversion cockpit, the RDF from the conversion is placed in the automatic/ directory in a file named to correspond to the input file from either source/ or manual/. For example, when converting source/STATE_SINGLE_PW.CSV using the conversion parameters manual/STATE_SINGLE_PW.CSV.e1.params.ttl, the output appears in automatic/STATE_SINGLE_PW.CSV.e1.ttl and STATE_SINGLE_PW.CSV.e1.ttl.pml.ttl after pulling the conversion trigger. The first file contains the RDF created from the tabular input, while the .void.ttl file contains metadata describing the conversion process. (The .void.ttl extension is used for historical purposes; it was chosen before other types of metadata and provenance were included) The metadata includes:

  • RDF property definitions
    • (e.g. e1:frs_facility_detail_report_url a rdf:Property)
  • RDFS class definitions
    • (e.g. ds1008_vocab:EPA_Region a rdfs:Class , owl:Class .)
  • VoID dataset hierarchy
  • Creation and modification dates
    • (dcterms:modified "2011-11-28T15:13:44.627-05:00"^^xsd:dateTime)
  • Preferred namespace prefixes
    • (e.g. [] vann:preferredNamespacePrefix "pmlp"; vann:preferredNamespaceUri "http://inference-web.org/2.0/pml-provenance.owl#")
  • Provenance of the conversion process (inputs, outputs, DOAP of conversion utility, time of invocation)
  • Copy of conversion parameters
  • Count of number of triples in dataset

csv2rdf4lod, implicit part of dataset and kept as separate subset.

6 - Enhanced predicate creation lineage

mashathon example

The conversion:enhances raw:frs_facility_detail_report_url in the following description of e1:location_address allows any system working with e1:frs_facility_detail_report_url to find out that newer layers are used to describe the same resources described in the original table. In this example, the enhancement 1 layer changes the range Literal to a Resource.

These descriptions are in the same automatic/*.e1.void.ttl metadata file as described in the last section.

e1:frs_facility_detail_report_url 
  a rdf:Property ;
  ov:csvRow "1"^^xsd:integer ;
  ov:csvCol "1"^^xsd:integer ;
  ov:csvHeader "FRS_FACILITY_DETAIL_REPORT_URL" ;
  conversion:enhancement_layer "1" ;
  dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Aug-30/conversion/enhancement/1> ;
  rdfs:label "FRS_FACILITY_DETAIL_REPORT_URL" ;
  rdfs:range rdfs:Resource ;
  conversion:enhances raw:frs_facility_detail_report_url .

7 - Associating data triples to the tabular cell that caused them

This is a prototype feature that needs motivation and use cases to develop. The [CSV2RDF4LOD environment variable](CSV2RDF4LOD environment variables) CSV2RDF4LOD_CONVERT_PROVENANCE_GRANULAR enables it, but has not been exercised since the proof of concept Transparent funding: Using provenance to justify mashup values. The demonstration is also discussed on inference-web.org.

8 - Populating named graph in LOGD SPARQL endpoint

See Named graphs that know where they came from for a description of how provenance is loaded into a named graph while that named graph is populated with a RDF file from the web.

Where to go from here

  • Provenance for a listing of provenance-related pages and topics.

Historical: Some links to previous material:

Clone this wiki locally