Skip to content

Design Objective: Capturing and Exposing Provenance

timrdf edited this page Nov 28, 2011 · 43 revisions

See Provenance.

Introduction

Collecting, converting, enhancing, and republishing data raises important concerns about the integrity of the resulting data products and applications that use them. For TWC, this is especially true because we are a university that aggregates a variety of governmental data from both government and non-government sources -- each with their own degree of authority and trustworthiness. To address this, we strive to provide a transparent process that allows data consumers to inspect the lineage of the data products we produce. The ability to accurately cite original data source organizations is a direct benefit of this transparency, allowing improved credibility for derived data products and applications. It also gives credit where credit is due for those you have spent a lot of time and energy to create the original data. The following figure illustrates a variety of challenges when aggregating data in a geographically and organizationally distributed system.

Provenance captured during conversion

Our aggregation process captures provenance during the following activities:

  • 1 - Following URL redirects from a dataset listing page to the actual data file download
  • 2 - Retrieving the data file
  • 3 - Extracting files from a zip file
  • 4 - Manual modification of original data file (coarse level)
  • 5 - Associating the RDF dump file from csv2rdf4lod's invocation with its tabular data file and parameter inputs
  • 6 - Enhanced predicate creation lineage
  • 7 - Associating data triples to the tabular cell that caused them
  • 8 - Populating named graph in LOGD SPARQL endpoint
diagram of provenance captured during csv2rdf4lod conversion

1 - Following URL redirects from a dataset listing page to the actual data file download

Data.gov survey: data source modification dates describes an example of how URL redirects are followed from data.gov's details page (e.g. http://www.data.gov/details/1000) to the actual data URL available from the hosting government agency (e.g. http://www.epa.gov/enviro/html/frs_demo/geospatial_data/state_files/state_single_nm.zip). The figure used in the discussion is reproduced below. This trace provides justification for its connection to data.gov and shows how we came to find the data that we retrieved. On a more practical level, it also allows us to associate the dataset to data.gov in a general fashion. Since csv2rdf4lod is agnostic to the source of information, it does not hard-code a dataset as being "from data.gov" or any other source you may be interested in curating -- it simply accepts short identifiers for the source, dataset, and version so that it can construct URIs for the entities it creates. Although the source identifier is used to create a URI for the source that can be described in detail later, and csv2rdf4lod allows for direct annotation within enhancement parameters itself, it would require an unreasonable amount of time to manually complete for thousands of datasets. With provenance, the connection from a downloaded file (and its resulting dataset) to how we found out about the file in the first place provides that connection with minimal effort.


The trace illustrated below is captured using the data.gov-specific script bin/dg-create-dataset-dir.sh, which is run from the directory /source/data-gov/ using csv2rdf4lod's [directory structure convention](Directory Conventions). The script combines bin/util/pcurl.sh and a custom data.gov details page scrapper that asserts the irw:refersTo relationship between the details page and the data file download link. This same approach can be applied to dataset listing pages from organizations other than data.gov to produce the same type of connections.


2 - Retrieving the data file

The provenance of data file downloads is captured by csv2rdf4lod's utility bin/util/pcurl.sh, which wraps the unix curl command to provide the same functionality with one addition -- the creation of an additional metadata file for each download requested. The metadata file contains the provenance of the retrieved file including:

  • Original URL request, its HTTP header, and its reported last-modified dateTime.
  • Final URL redirect, its HTTP header, and its reported last-modified dateTime.
  • The version and md5 of the curl used to retrieve the URL.
  • The dateTime at which the final URL was retrieved.
  • An association between the downloaded file and the final URL from which it was retrieved.
  • The md5 of the downloaded file.

The following commands illustrate how to use pcurl.sh to capture the retrieval of the white house visitor list. First, the source/ directory is created to hold an exact copy of the file we retrieved from the White House (see Conversion cockpit). Although the location of this directory follows csv2rdf4lod's "source - dataset - version" naming convention, pcurl.sh can be used anywhere to request any URL. Next, we go into the source directory and call pcurl.sh like we would call curl. After it completes, we see the retrieved file and its provenance in the same directory. pcurl.sh also accepts the -I parameter to only retrieve the HTTP header information.

$ mkdir -p /source/whitehouse-gov/visitor-records/version/0910/source
$ cd /source/whitehouse-gov/visitor-records/version/0910/source
$ pcurl.sh http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0910.csv
$ ls
  > WhiteHouse-WAVES-Released-0910.csv
  > WhiteHouse-WAVES-Released-0910.csv.pml.ttl

The results of these commands can be found on LOGD at WhiteHouse-WAVES-Released-0910.csv and WhiteHouse-WAVES-Released-0910.csv.pml.ttl. For a more detailed discussion on dataset naming, dataset versioning, and the file directory structure used by csv2rdf4lod, see Getting started with csv2rdf4lod to create verbatim RDF conversions of tabular data.

3 - Extracting files from a zip file

The provenance of data files extracted from a zip archive is captured by csv2rdf4lod's utility punzip.sh, which wraps the unix unzip command in the same way that pcurl.sh wraps curl as described above -- it creates an additional file for each file extracted. The additional file contains the provenance of the download including:

  • The zip archive last-modified dateTime.
  • The version and md5 of the punzip used to extract the file.
  • The dateTime at which the file was retrieved.
  • An association between the extracted file and the zip archive.
  • The md5 of the extracted file.

The following commands illustrate how to use punzip.sh to capture the extraction of files from a zip archive of the data.gov's dataset 1000 (EPA Facility Registry System Facilities for the State of New Mexico).


$ cd /source/data-gov/1000/version/2010-Aug-30/source
$ ls
  > state_single_nm.zip
  > state_single_nm.zip.pml.ttl
$ punzip.sh state_single_nm.zip
$ ls
  > state_single_nm.zip
  > state_single_nm.zip.pml.ttl
  > Facility_State_File_Documentation_0401_2010.pdf
  > Facility_State_File_Documentation_0401_2010.pdf.pml.ttl
  > STATE_SINGLE_NM.CSV 
  > STATE_SINGLE_NM.CSV.pml.ttl

The results of these commands can be found on LOGD at STATE_SINGLE_NM.CSV.pml.ttl and Facility_State_File_Documentation_0401_2010.pdf.pml.ttl.

4 - Manual modification of original data file (coarse level)

It is a primary objective in our aggregation process to minimize the amount of manual modifications throughout the process. The structure assistance parameters provided by csv2rdf4lod allow us to describe which non-data regions of an original tabular file should be avoided and how certain common interpretations should be applied -- including adding or renaming column headers. This declarative-based approach reduces the amount of manual labor, preserves the state of original data as we received it, and decreases the risk of human error.


Unfortunately, not all modifications are supported, and manual effort is required to continue development of the dataset. As an example, the most recent (24 September, 2010) White House Visitors list changed from a comma-separated file to a tab-separated. In previous versions, we were able to apply csv2rdf4lod interpretations directly to the source file, but because the (tab) delimiter is not a current parameter in csv2rdf4lod, manual replacement was required. An extra step is required, and we need to account for this in the provenance of our conversion.


The following four unix commands illustrate how this is achieved. First, the manual/ directory is created to contain any data files that required manual intervention before continuing to the conversion process. Next, the original data file that we obtained from the White House website is used to create a modified analog in the manual directory. Next, we note the association between the original and modified analog using the justify.sh script, which produces the justifications in manual/WhiteHouse-WAVES-Released-0910.csv.pml.ttl.


$ mkdir manual
$ cat source/WhiteHouse-WAVES-Released-0910.csv | tr '\t' ',' > manual/WhiteHouse-WAVES-Released-0910.csv
$ justify.sh source/WhiteHouse-WAVES-Released-0910.csv manual/WhiteHouse-WAVES-Released-0910.csv tab2comma
$ less manual/WhiteHouse-WAVES-Released-0910.csv.pml.ttl

The results of these commands can be found on LOGD at WhiteHouse-WAVES-Released-0910.csv.pml.ttl

5 - Associating the RDF dump file from csv2rdf4lod's invocation with its tabular data file and parameter inputs

csv2rdf4lod, implicit part of dataset and kept as separate subset.

6 - Enhanced predicate creation lineage

mashathon example

7 - Associating data triples to the tabular cell that caused them

Transparent funding: Using provenance to justify mashup values

8 - Populating named graph in LOGD SPARQL endpoint

Where to go from here

Historical

Some links to previous material:

Clone this wiki locally