conversion:interpret

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

conversion:interpret is one of many conversion:Enhancements.

What we will cover

conversion:interpret can be used to override input values with predetermined replacements. This can help avoid the need to modify the original input file to prepare for conversion. Avoiding such a step helps maintain the fidelity of the original source.

conversion:interpret can be used to:

Decipher codes
Ignoring certain values globally or for certain columns
Cleaning up values
Applying regex to values

Let's get to it

Deciphering codes

Often, codes are used to abbreviate longer, more meaningful, values. For example, one dataset uses "P", "S", and "H" to stand for "President", "Senate", and "House". Though this might be useful for those that intimately know the dataset, it makes it more difficult for for the rest of the world. The following enhancements can make these myopic identifiers understandable by a much larger world-wide community:

  conversion:enhance [
     ov:csvCol         1;
     conversion:interpret [
        conversion:symbol         "S";
        conversion:interpretation <http://dbpedia.org/resource/United_States_Senate>;
     ];
     conversion:interpret [
        conversion:symbol         "H";
        conversion:interpretation <http://dbpedia.org/resource/United_States_House_of_Representatives>;
     ];
     conversion:interpret [
        conversion:symbol         "P";
        conversion:interpretation <http://dbpedia.org/resource/President_of_the_United_States>;
     ];
  ];

Interpret as null (in all columns)

The most popular use for this enhancement is to omit empty string values for a cell of a table. They are left there in the naive interpretation to be as faithful to the original data as possible, but they often mean nothing and clutter things up.

Certain values are used to express that there is no value for a relationship. These can be ignored by setting the "interpret as null" enhancement parameter, so that the null values do not interfere with the actual values. Triples are not asserted for values that should be interpreted as null. The null value can be interpreted for all columns or for a specific column.

Note, this structure is also used in #Codebook Resource Promotion parameter, but is used by an enhancement not by the conversion process.

e.g., Dataset 1530

:dataset a void:Dataset;
   conversion:base_uri           "http://logd.tw.rpi.edu"^^xsd:anyURI;
   conversion:source_identifier  "data-gov";
   conversion:dataset_identifier "1530";
   conversion:dataset_version    "2009-May-18";
   conversion:conversion_process [
      conversion:interpret [
         conversion:symbol "-", "- ";
         conversion:interpretation conversion:null;
      ];
   ];
.

    @prefix raw: <http://logd.tw.rpi.edu/source/data-gov/dataset/1530/vocab/raw/> .
 
    ds1530:thing_1    raw:organization "-" .
    ds1530:thing_2538 raw:closed_date  "- " .

becomes

    ''no triple asserted''

Other datasets that benefit from this enhancement include Health Information National Trends Survey 2005 ("#NULL!"), Dataset 10030 (" - "), Dataset 1330 ("?? Total").

An interesting extension to this enhancement would be to add a pattern for what to interpret as null.

Interpret as null (in a specific column-specific)

The above example showed how to interpret a symbol as null for all columns. This behavior can be set for a specific column by moving the interpretation to a single enhancement.

 :dataset a void:Dataset;
    conversion:base_uri           "http://logd.tw.rpi.edu"^^xsd:anyURI;
    conversion:source_identifier  "data-gov";
    conversion:dataset_identifier "1530";
    conversion:dataset_version    "2009-May-18";
    conversion:conversion_process [
       conversion:enhancement [
          ov:csvCol 1;
          conversion:interpret [
             conversion:symbol "-", "- ";
             conversion:interpretation conversion:null;
          ];
       ];
    ];
 .

Other datasets that benefit from this enhancement includes Dataset 1491.

Cleaning up some values

A not-so elegant use of this enhancement is to tweak a handful of values into other values.

For example, NITRD added some footnotes when mentioning some agencies. We can tidy them up. This obviously only makes sense for a small number of values, and begs for a more generic value-tweaking mechanism (we just haven't seen enough need for it, yet).

      conversion:enhance [
         ov:csvCol         1;
         conversion:interpret [
            conversion:symbol "NIH 2";
            conversion:interpretation "NIH";
         ];
         conversion:interpret [
            conversion:symbol "DOE 2";
            conversion:interpretation "DOE";
         ];

Applying regex

The "doi:" prefix can be removed when processing the following input with conversion:interpret.

"David Tilman","doi:10.6073/AA/knb-lter-cdr.157002.122"

      conversion:enhance [
         ov:csvCol          2;
         ov:csvHeader       "doi";
         conversion:equivalent_property bibo:doi;
         conversion:interpret [
            conversion:regex          "^doi:";
            conversion:interpretation "";
         ];
         conversion:range   rdfs:Literal;

results in:

<http://dx.doi.org/10.6073/AA/knb-lter-cdr.157002.122>
   dcterms:author "David Tilman" ;
   bibo:doi "10.6073/AA/knb-lter-cdr.157002.122" ;

(see also conversion:object_search)

Why does the output RDF have conversion:symbol and conversion:interpretation?

Although this output might look like a bug with conversion:symbol and conversion:interpretation, it is actually intended:

    typed_agency:NIH 
        dcterms:identifier "NIH 2" ;
        a federal_research_and_development_budget_for_networking_and_information_technology_vocab:Agency ;
        conversion:symbol "NIH 2" ;
        conversion:interpretation "NIH" ;

Subsequent enhancement parameters can point to this dataset and get the symbol/interpretation pairings.
Subsequent enhancement parameters can also point to this dataset to get dcterms:identifiers during ObjectSameAsLinking.

$CSV2RDF4LOD_HOME/bin/util/symbol-interpretation.awk

Queries

See which input values are interpreted differently according to enhancement parameters (results):

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct *
WHERE {
  graph <http://purl.org/twc/vocab/conversion/ConversionProcess> {
    ?layer
       conversion:conversion_process [
          conversion:interpret [
            conversion:symbol         ?symbol;
            conversion:interpretation ?interp
          ];
       ]
    .
  }
}

For some more details on implementation and utilities, see Codebook enhancements.

What is next

$CSV2RDF4LOD_HOME bin/util/distinct-values-2-symbol-interps.pl to query a SPARQL endpoint for distinct values of a predicate and produce an eparams template with symbol/interpretations.
$CSV2RDF4LOD_HOME /bin/util/symbol-interpretation.awk to accept 416,"Wage Stabilization Board" and output symbol/interpretation eparams.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly