Skip to content

Provenance

Tim L edited this page May 18, 2015 · 83 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

Measure of care and attention

http://altmetrics.org/workshop2011/

How many times has a dataset been converted (results)? More invocations of the converter correlates to the amount of human care and attention paid to the results. This must, of course be viewed in light of the distribution (and minimum) for all other datasets. The integral from -infin to the point dataset d is on the curve can be a measure for how much care has been given w.r.t. the rest (c.f. percentile). The winner for LOGD is clearly the NITRD conversion, since there were 12 tables each requiring cell-based conversion and some converter enhancements (and debugging) to accomplish.

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?version ?logs
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    ?version conversion:num_invocation_logs ?logs
  }
}ORDER BY DESC(?logs)

Grouping the previous results by the source (results). Clearly, LOGD still spends most of its attention on data.gov data... (note: nitrd is missing for some reason):

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?contributor count(*) as ?logs
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    ?version conversion:num_invocation_logs ?logs; 
             dcterms:contributor            ?contributor .
  }
}GROUP BY ?contributor ORDER BY DESC(?logs)

Proof Markup Language

PML predicate use distribution (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?p count(*) as ?count
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
    [] ?p []
  }
  filter(regex(?p,'^http://inference-web.org/2.0.*'))
} group by ?p order by desc(?count)

PML class use distribution (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?type count(*) as ?count
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
    [] a ?type
  }
  filter(regex(?type,'^http://inference-web.org/2.0.*'))
} group by ?type order by desc(?count))

Attribution

Where am I mentioned in LOGD's csv2rdf4lod instance? (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?s ?p ?o
WHERE {
  {GRAPH ?g {
    ?s ?p <http://tw.rpi.edu/instances/TimLebo> 
  }}
  UNION 
  {GRAPH ?g {
    <http://tw.rpi.edu/instances/TimLebo> ?p ?o 
  }}
} order by ?s ?o ?p

Identifying a dataset's source data using retrieval and conversion provenance

(example being developed)

Make OWL property chain to assert dcterms:source:

  • file C was downloaded and used to create a void:subset T of conversion:Dataset D.
  • T dcterms:source C .
  • D dcterms:source C .
[] rdfs:subPropertyOf dcterms:source;
   owl:propertyChain (
       :todo
     ).

Crediting data catalog for discovery of a dataset

Can be done by looking at beginning of irw:redirects in pcurl.sh's provenance.

Using file hashes

Timestamped file hashes are used to describe instances of pmlp:Information

By comparing the timehash of the source/ file at justify.sh time (source -> manual) to pcurl.sh time, we can identify inconsistencies in the source/ file.

Data "freshness"

(from Andrea Splendiani on [email protected])

we would need an extra information: how fresh the information is.

Do you know if there is any standard metadata to indicate the last refresh of the endpoint content ? Technically speaking this kind of information should be associated to data as provenance. In practice however, 90% of utility can be reached by having some state information for each big graph in the endpoint, corresponding to major data sources.

In practice it would be nice to have a standard dictionary so that we can ask to the triplestore: list of graphs/datasets.

for each of these (or for endpoint itself if this holds information which is "coherent" source-wise):

  • update frequency
  • last update
  • data source (type and in case link).

Matthew Gamble mentions the myriad of proposals for capturing this metadata http://www.w3.org/wiki/DatasetDynamics.

Heterogeneous provenance

Incorporating first-party provenance into the csv2rdf4lod workflow - how do they connect?

Examples of first-parties that provide some provenance (or could, with an email, some hot cocoa, and perhaps a puppy).

  • NCBI eg 20 row example
  • Flu db is aggregating
  • CHSI aggregating
  • Impact teen

Day dictionary mentions it came from a sensor, something was aggregates. Etc.

Questions to answer:

  • How does a publisher express it?
  • How would a third patty annotate isolated data to claim its Source?

Triple level provenance

http://www.geonames.org/data-sources.html lists a couple of dozen sources.

JWS section

Collecting, converting, enhancing, and republishing data raises important concerns about the integrity of the resulting data products and applications that use them. For TWC LGOD, this is especially true as a university aggregating a variety of governmental data from both government and non-government sources -- each with their own degree of authority and trustworthiness. To address this, we have incorporated provenance capture to facilitate transparency by allowing data consumers to inspect and query the lineage of the data products we produce. The ability to accurately cite original data source organizations is a direct consequence of this transparency, allowing improved credibility for derived data products and applications. This additional metadata also gives credit where credit is due for those you have spent a lot of time and energy to create the original data [EOS, TRANSACTIONS, AMERICAN GEOPHYSICAL UNION 91 p. 297-298].

Provenance in the LOGD workflow begins with the naming of the dataset. Short identifiers for the “source”, “dataset”, and “version” are central to the construction of the dataset’s URI and implicitly place it within the provenance context of who, what, and which. The URLs from which the government data files are retrieved is captured at the time they are retrieved and is encoded using the Proof Markup Language (PML) [5]. Although relatively simple, this information is critically valuable for associating any subsequent data products to its authoritative source. Through the rest of the LOGD workflow, data products are organized into those produced automatically (and repeatably) and those influenced by manual effort (and less repeatably) with their causal associations captured and encoded using PML. The development of a converter capable of interpreting the tabular structure of CSV formats according to declarative parameters [http://doi.acm.org/10.1145/1839707.1839755] was essential for minimizing the amount of manual modification of original government files.

In addition to capturing the provenance among holistic files, the csv2rdf4lod converter provides provenance at the granular triple. This ability was motivated by previous analysis of user-based trust in semantic-based mashups [http://dx.doi.org/10.1007/978-3-642-17819-1_21]. This allows inquiry and inspection as the assertion level, such as “How do you know that the UK gave Ethiopia $107,958,576 USD for Education in 2007/8?” The following figure shows one web application leveraging this granular provenance. Clicking the text “oh yeah?” in the table invokes a SPARQL DESCRIBE query on the triple’s subject and predicate, causing provenance fragments from the original CSV’s rows and columns to be combined to identify the original spreadsheet’s URL, the cell that caused the triple, the interpretation parameters applied, and the author of the parameters.

Rescued from LOGD chapter

Provenance in the LOGD workflow begins with naming the dataset. Short, descriptive identifiers for the \textbf{source} organization, the \textbf{dataset} itself, and the \textbf{version} retrieved constitute the minimal requirements and frame the aggregation process around {\em who} is providing {\em what} data, and {\em which} concrete form was retrieved. These same values are used to construct the datasets' URIs, implicitly encoding provenance of datasets and their entities directly within their names\footnote{These three attributes are also encoded explicitly using the conversion's RDFS/OWL vocabulary.}\footnote{Although URIs should be ``opaque", we assert that every available opportunity to aid any data consumer should be exploited. This includes developers that may be viewing RDF serializations and query results.}. The recommended practice of reusing identifiers from the source organization further enables data consumers to associate Linked Data from TWC LOGD to their original source, without relying on the explicit associations that we also provide. In the absence of source's identifiers, the recommended practice for naming the version according to publish date, last-modified date, and retrieval date provides a preferred order of naming techniques that can also aid consumers in orienting with {\em which} dataset was retrieved. Further, because the URIs created during the workflow are resolvable according to Linked Data principles, they cite the curator as the owner, which is expressed each time they are resolved by a Linked Data agent.

When actually retrieving source data files, the requested URL, time requested, and user account initiating retrieval are recorded. This contextual information permits data consumers to reproduce our actions and compare their results to ours. Although the need for manual modifications is minimized by a parameterized and automated converter, the variability of input formats makes manual modification inevitable. In these situations, the adjusted result is stored separately without modifying the original, and the associations between adjusted and original are captured. Providing these intermediate results allows data consumers to compare the input and output of the potentially {\em ad hoc} adjustment process. Further, the user account of the person performing the adjustment is also captured, providing the data consumer an additional consideration when inspecting the workflow. As mentioned earlier, the development and use of a conversion utility guided by external transformation parameters minimizes custom code, reduces human error, provides uniformity, and maximizes reproducibility. Providing the conversion tool, inputs, and parameters on the web and describing its invocation with RDF assertions permits data consumers to reproduce the process and compare our cached results and their own. A salient advantage of using {\tt csv2rdf4lod} is the wealth of metadata, dataset-level provenance, self-description, and optional triple-level provenance\footnote{The triple-level provenance that {\tt csv2rdf4lod} provides is reification-based, so the size of the provenance encoding is a function of the sum, {\em not} the product, of the table's rows and columns.} that it provides with no additional user effort. Human involvement of the conversion is captured by recording the user account initiating the conversion as well as the user account that created and modified the external transformation parameters provided to the conversion utility. Again, these references can be used by a data consumer when evaluating the quality of results or acknowledging contributions to subsequent results.

The final step of the workflow is loading the dataset's RDF dump file into the named graph of a SPARQL endpoint. Because the results of this step establish the first interaction with the data consumer, capturing its provenance is paramount and its encoding is stored {\em directly} in the named graph with the loaded data. This provenance cites the dump file retrieved, which is justified by {\tt csv2rdf4lod} during the conversion process and completes the full trail from a SPARQL query into a named graph, through the workflow, and back to the originating source organization.

Attribution example

From http://hints.cancer.gov/dataset.jsp, http://hints.cancer.gov/agreement.jsp?selected=2007SAS:

HINTS Data Terms of Use

It is of utmost importance to ensure the confidentiality of survey participants. Every effort has been made to exclude identifying information on individual respondents from the computer files. Some demographic information such as sex, race, etc., has been included for research purposes. NCI expects that users of the data set will adhere to the strictest standards of ethical conduct for the analysis and reporting of nationally collected survey data. It is mandatory that all research results be presented/published in a manner that protects the integrity of the data and ensures the confidentiality of participants.

In order for the Health Information National Trends Survey (HINTS) to provide a public-use or another version of data to you, it is necessary that you agree to the following provisions.

   1. You will not present/publish data in which an individual can be identified. Publication of small cell sizes should be avoided.
   2. You will not attempt to link nor permit others to link the data with individually identified records in another database.
   3. You will not attempt to learn the identity of any person whose data are contained in the supplied file(s).
   4. If the identity of any person is discovered inadvertently, then the following should be done;
         1. no use will be made of this knowledge,
         2. the HINTS Program staff will be notified of the incident,
         3. no one else will be informed of the discovered identity.
   5. You will not release nor permit others to release the data in full or in part to any person except with the written approval of the HINTS Program staff.
   6. If accessing the data from a centralized location on a time sharing computer system or LAN, you will not share your logon name and password with any other individuals. You will also not allow any other individuals to use your computer account after you have logged on with your logon name and password.
   7. For all software provided by the HINTS Program, you will not copy, distribute, reverse engineer, profit from its sale or use, or incorporate it in any other software system.
   8. The source of information should be cited in all publications. The appropriate citation is associated with the data file used. Please see Suggested Citations in the Download HINTS Data section of this Web site, or the Readme.txt associated with the ASCII text version of the HINTS data.
   9. Analyses of large HINTS domains usually produce reliable estimates, but analyses of small domains may yield unreliable estimates, as indicated by their large variances. The analyst should pay particular attention to the standard error and coefficient of variation (relative standard error) for estimates of means, proportions, and totals, and the analyst should report these when writing up results. It is important that the analyst realizes that small sample sizes for particular analyses will tend to result in unstable estimates.
  10. You may receive periodic e-mail updates from the HINTS administrators.

http://projects.iq.harvard.edu/datacitation_workshop/pages/attendees

DFID moving example

Sure, the first provenance broke. But subsequent versions of the same dataset retrieved using the newer URL. TODO: figure out how to recognize this and recover from it. http://logd.tw.rpi.edu/source/dfid-gov-uk/dataset_page/statistics-on-international-development-2009

Formerly known as

http://oas.samhsa.gov/WebOnly.htm#NSDUHtabs

Detailed Tables
National Survey on Drug Use & Health
formerly called the Household Survey on Drug Abuse (NHSDA) 

Workshops

Related work

http://lists.w3.org/Archives/Public/public-rdf-prov/2011Oct/0000.html

How do I refer to the quads that state that a triple was published at a web address yesterday?

Why would you use quads for that? You use triples. You put the triples into a named graph. The name of that graph can be used to refer to the assertion. For example, :G1 below is a name for the graph containing triples that state that {:s1 :p1 :o1.} was published yesterday at <http://example.com/>.

:G1 {
 [] a :Publishing;
    :date "2011-09-30"^^xsd:date;
    :webAddress <http://example.com/>;
    :triples :G2.
}
:G2 {
 :s1 :p1 :o1.
}

http://www.nass.usda.gov/Data_and_Statistics/Citation_Request/index.asp

http://www4.wiwiss.fu-berlin.de/bizer/ldif/ and http://www.assembla.com/code/ldif/git/nodes/ldif/ldif-core/src/main/resources/owl/provenance.owl

olaf and Jun's extension to PROV-O http://www.w3.org/mid/CAExK0De7Dyhd4Feus5wwrn3NngLiPNbiumTwY+NekdeZPBZ6PA@mail.gmail.com paper: http://events.linkeddata.org/ldow2012/papers/ldow2012-paper-03.pdf

http://www.w3.org/mid/[email protected] SCoRO, the Scholarly Contributions and Roles Ontology

Simon: ontology just published by the Press Association for representing news. It bears similarities with PROV, which I guess isn't surprising as it's also about "what has happened". http://data.press.net/ontology/

[1] "lift" used here in the sense described by: R. V. Guha, Contexts: A Formalization and Some Applications, Ph.D. Thesis, Stanford University, 1995, linked from http://www-formal.stanford.edu/guha/.

http://nanopub.org/wordpress/?page_id=57

Tracing where and who provenance in Linked Data: A calculus

Clone this wiki locally