Dataset composition resulting from naming by source, dataset, and version

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

The identifiers for source, dataset, and version is done in Conversion process phase: name

Understanding the void:subset hierarchy

In addition to these special named graphs, there are many named graphs that fall into four categories. These categories are listed in order of size and correspond to their level within the void:subset hierarchy:

Abstract Dataset named graphs contain all of the data triples and all of the metadata for an Abstract Dataset. An Abstract Dataset incorporates all Versioned Datasets that have been created for it. An example instance of an Abstract Dataset is http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records. The AbstractDataset name graph is populated with zero or more of its (unversioned) Datasets as needed. We accept requests to populate the Abstract Dataset named graphs in the LOGD triple store.
Versioned Dataset named graphs contain all of the data triples and all of the metadata for a Versioned Dataset. A Versioned Dataset incorporates all data triples and metadata from the layers (e.g. "raw", "e1") that have been created for it. Versioned Datasets exist for each Abstract Dataset. Two example instances of a Versioned Dataset are http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510 and http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0810, corresponding to the May and August releases of the White House Visitor Access Records. The LOGD triple store is populated with Versioned Datasets as needed. Requests to do so are accepted.
Layer Dataset named graphs contain all data triples and all of the metadata for a Layer Dataset. The two most popular Layer Datasets are the "raw" and "e1" layers, while additional enhancements would provide layers "e2", "e3", etc. The term layer is used to reflect the parallel predicates that layer additional descriptions on top of the same entities within the dataset -- each layer provides a new set of predicates that enables backward compatibility and incremental adoption. Layer Datasets exist for each Version of a Dataset. Three example instances of a Layered Dataset include http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/raw, http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/enhancement/1, and http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0810/conversion/raw. The LOGD triple store is populated with Layer Datasets as needed. Requests to do so are accepted.
Dataset Sample named graphs are the smallest type of named graph. They contain a subset of the data triples and all of the metadata for a Layer Dataset. This subset is intended to provide quick access for overview and/or survey analysis applications. Sample Datasets exist for each Layer of each Version of a Dataset. Three example instances of Dataset Sample include http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/raw/subset/sample, http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/enhancement/1/subset/sample, and http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0810/conversion/raw/subset/sample. The LOGD triple store is populated with all available Dataset Samples.

The following queries can be used at http://logd.tw.rpi.edu/sparql to find and describe datasets.

What graphs are in the triple store?

SELECT DISTINCT ?g
   WHERE { GRAPH ?g { 
      [] a [] . 
   }
}

(When you know you aren't going to select the variable, do not name it in the graph pattern)

What Dataset Samples are loaded in the triple store?

results:

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>

SELECT distinct max(?modified) as ?last_modified 
       count(?sample) as ?num_modifications 
       ?sample 
WHERE {
  GRAPH ?sample {
    ?sample a conversion:DatasetSample; 
            dcterms:modified ?modified .
  }
}
GROUP BY ?sample 
ORDER BY DESC(?last_modified) DESC(?num_modifications)

Of the Dataset Samples that are loaded in the triplestore, which have their Sampled datasets loaded?

(IOU: doesn't return anything...)

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX void:       <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>

SELECT distinct max(?modified) as ?last_modified 
       count(?sample) as ?num_modifications 
       ?sample ?sampled
WHERE {
  GRAPH ?sample {
    ?sample a conversion:DatasetSample; 
            dcterms:modified ?modified .
    ?sampled void:subset ?sample .
  }
  GRAPH ?sampled {
    ?sampled a []
  }
}
GROUP BY ?sample ?sampled
ORDER BY DESC(?last_modified) DESC(?num_modifications)

How up to date are the dataset descriptions?

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix logd:       <http://logd.tw.rpi.edu/vocab/>

SELECT *
WHERE {    
  graph logd:Dataset {      
    logd:Dataset dcterms:modified ?modified .
  }  
}

How do the datasets fit into the void:subset hierarchy?

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:          <http://rdfs.org/ns/void#>

select distinct ?dataset ?subdataset ?size ?dump 
where { 
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {  
                 ?dataset a void:Dataset . 
      optional { ?dataset void:subset ?subdataset } 
      optional { ?subdataset conversion:num_triples ?size } 
      optional { ?subdataset void:dataDump          ?dump } 
  } 
} order by ?dataset ?subdataset

What (unversioned) datasets are at the roots of the void:subset hierarchies?

prefix foaf:       <http://xmlns.com/foaf/0.1/>
prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT DISTINCT ?source_id ?source_homepage ?dataset_id ?dataset_homepage ?dataset max(?modified) AS ?lastModified
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    ?dataset a conversion:Dataset;
             void:subset [ a conversion:VersionedDataset ] ;
             conversion:dataset_identifier ?dataset_id;
             dcterms:modified              ?modified ;
             dcterms:source                ?organization .
    ?organization a foaf:Agent;
                  dcterms:identifier ?source_id .
  }
  graph ?meta {
    ?meta a conversion:MetaDataset .
    optional{ ?organization foaf:homepage ?source_homepage  }
    #exceeds execution time threshold: optional{ ?dataset      foaf:homepage ?dataset_homepage }
  }  
} ORDER BY ?dataset

How many verbatim conversions are there?

prefix foaf:       <http://xmlns.com/foaf/0.1/>
prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT count(?dataset)
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    ?dataset a conversion:LayerDataset; 
             conversion:conversion_identifier "raw" .
  }
}

What datasets are part of the LOD cloud?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT DISTINCT ?dataset ?dump  
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {  
    ?dataset a conversion:Dataset;
             void:subset [ a conversion:SameAsDataset; 
                           void:dataDump ?dump ] 
  }  
}

What dataset samples are there, and which are loaded in the triple store?

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

SELECT DISTINCT ?source_id ?dataset_id ?version_id ?layer_id ?sample_uri ?dump_file ?created_date ?loaded_boolean
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset
       a conversion:Dataset;
       conversion:source_identifier  ?source_id;
       conversion:dataset_identifier ?dataset_id;

       void:subset [ a conversion:VersionedDataset;
                     conversion:version_identifier ?version_id;

                     void:subset [ a conversion:LayerDataset;
                                   conversion:conversion_identifier ?layer_id;
                                   dcterms:created                  ?created_date;
                                   void:subset ?sample_uri ]
                   ] .
    ?sample_uri a conversion:DatasetSample;
                void:dataDump ?dump_file .
  }
  optional {
    graph ?sample_uri {
       ?sample_uri a ?loaded_boolean .
       filter(?loaded_boolean = void:Dataset)
    }
  }
} ORDER BY ?source_id ?dataset_id ?version_id ?layer_id

What datasets are from "data-gov"?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?dataset
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset a conversion:Dataset;
    conversion:source_identifier "data-gov" .
  }
}

What Datasets are at the root of the void:subset hierarchy?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select *
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset a conversion:Dataset;
             void:subset [ a conversion:VersionedDataset ] .
  }
}

What VoID data subsets are within data-gov's dataset 1008?

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>

select distinct ?dataset ?subdataset ?size ?dump  
where {   
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {    
    ?dataset a void:Dataset ; 
             conversion:source_identifier "data-gov"; 
             conversion:dataset_identifier "1008" .
    optional { ?dataset    void:subset            ?subdataset }    
    optional { ?subdataset conversion:num_triples ?size }    
    optional { ?subdataset void:dataDump          ?dump }  
  }  
} order by ?dataset ?subdataset

Is data.gov's dataset 8 loaded in the sparql endpoint?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

ASK
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
               ?dataset a void:Dataset;
                        conversion:source_identifier  "data-gov";
                        conversion:dataset_identifier "8" .
    optional { ?dataset void:subset ?subdataset }

    optional { ?NOPARENT void:subset ?dataset }
    filter(!bound(?NOPARENT))
  }
  graph ?dataset {
     [] a []
  }
}

Is the raw sample loaded?

prefix ov:         <http://open.vocab.org/terms/>

ask
WHERE {
  graph <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/1st-anniversary/conversion/raw/subset/sample> {
     [] ov:csvRow ?row
  }
}

Is the first enhancement sample loaded?

prefix ov:         <http://open.vocab.org/terms/>

ask
WHERE {
  graph <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/1st-anniversary/conversion/e1/subset/sample> {
     [] ov:csvRow ?row
  }
}

What predicates do the datasets use?

prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?dataset ?predicate
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset conversion:uses_predicate ?predicate
  }
}

Datasets with wgs:lat

prefix wgs:        <http://www.w3.org/2003/01/geo/wgs84_pos#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>

select distinct ?g 
where { 
  graph ?g { 
    ?s wgs:lat ?lat 
  } 
}

3-level vs. 4-level void:subset hierarchy (cf. single vs. multiple CSVs)

Datasets comprising only one CSV create a 3-level hierarchy, while datasets comprising more than one CSV create a 4-level hierarchy. Query for all unversioned datasets

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?unversioned
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    {
      # Unversioned datasets with single CSV
      ?unversioned void:subset            ?versioned .
      ?versioned   void:subset            ?layer     .
      ?layer       conversion:num_triples ?triples ;
                   void:dataDump          ?dump      .
    }
    union
    {
      # Unversioned datasets with multiple CSVs
      ?unversioned     void:subset            ?versioned       .
      ?versioned       void:subset            ?layer           .
      ?layer           void:dataDump          ?dump ;
                       void:subset            ?multi_component .
      ?multi_component conversion:num_triples ?triples         .
    }
  }
} order by ?unversioned

Same as above:

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select count(distinct ?unversioned)
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    { # Unversioned datasets with single CSV
      ?unversioned void:subset [ 
                        void:subset [ 
                             conversion:num_triples ?triples ;
                             void:dataDump          ?dump     
                        ]
                   ]
    }
    union
    { # Unversioned datasets with a multiple CSVs
      ?unversioned void:subset [
                        void:subset [ 
                             void:dataDump ?dump ;
                             void:subset [
                                  conversion:num_triples ?triples 
                             ]
                        ]
                   ]
    }
  }
}

(see http://data-gov.tw.rpi.edu/wiki/URI_design_for_RDF_conversion_of_CSV-based_data#VoID_descriptions for a diagram illustrating the different VoID hierarchies between single- and multi-CSV datasets.)

A 3-level example with explicit names

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select ?p ?o
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21> .

    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21/conversion/raw> .

    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21/conversion/raw> ?p ?o .
  }
}

A 4-level example with explicit names

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select ?p ?o
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary> .
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary/conversion/raw> .
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary/conversion/raw>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/FM_FACILITY_FILE/version/1st-anniversary/conversion/raw> .
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/FM_FACILITY_FILE/version/1st-anniversary/conversion/raw> ?p ?o
  }
}

All dump files and their triple counts of an (unversioned) Dataset

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

SELECT ?dataDump sum(?num_triples) as ?triples
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {

    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008> 
      void:subset [
        a conversion:VersionedDataset;
        void:subset ?layer ] .

    {
      ?layer conversion:num_triples ?num_triples;
             void:dataDump          ?dataDump.
    }
    UNION
    {
      ?layer void:dataDump ?dataDump;
             void:subset   ?multiple_table .

      ?multiple_table conversion:num_triples ?num_triples .
    }
  }
}

Getting a dump file of a sample subset of a dataset

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>
select distinct ?dataset ?versionedDataset ?layerDataset ?sample ?dump
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset          void:subset ?versionedDataset .
    ?versionedDataset a conversion:VersionedDataset;
                      void:subset ?layerDataset .
    ?layerDataset     a conversion:LayerDataset;
                      void:subset ?sample .
    ?sample           a conversion:DatasetSample;
                      void:dataDump ?dump .
  }
} order by ?dataset ?versionedDataset ?layerDataset ?sample

Getting a dump file of a sample subset of a dataset (#2)

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>
SELECT DISTINCT ?source_id ?dataset_id ?sample ?dump
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?sample a conversion:DatasetSample;
            conversion:source_identifier   ?source_id;
            conversion:dataset_identifier  ?dataset_id;
            conversion:version_identifier "1st-anniversary";
            void:dataDump ?dump .
  }
} ORDER BY ?sample

Attributes on Datasets with void:dataDumps

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:          <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT *
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    {?dataset a conversion:LayerDataset; void:dataDump ?dump }
   optional { ?dataset conversion:source_identifier ?source_id }
   optional { ?dataset conversion:dataset_identifier ?dataset_id }
   optional { ?dataset conversion:dataset_version ?version_id }
  }  
}

Counts of datasets with different sets of attributes

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:          <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT count(distinct ?dataset1) as ?dumps 
           count(distinct ?dataset2) as ?to_source 
           count(distinct ?dataset3) as ?to_dataset 
           count(distinct ?dataset4) as ?to_version
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    {?dataset1 void:dataDump ?dumpfile}
    union

    {?dataset2 void:dataDump ?dumpfile; conversion:source_identifier ?source_id}
    union
           
     {?dataset3 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id}
   union

    {?dataset4 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id; conversion:dataset_version ?version_id}
  }  
}

Datasets (intentionally) without a version

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:          <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT distinct ?dataset3
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
     ?dataset3 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id .
    optional{?dataset3 conversion:dataset_version ?version_id}
    filter(!bound(?version_id))
  }  
} order by ?dataset3

What types are instances of void:Dataset?

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:          <http://rdfs.org/ns/void#>

SELECT DISTINCT ?type
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {  
     ?dataset a void:Dataset ; a ?type .
  }
} order by ?type

Historical note

This was originally developed on LOGD's site, but moved here because they didn't like it.

Dump file validation

Li Ding put together a dump file validation service to make sure the dump files exist.

Related work

Lee Feigenbaum discusses how they used named graphs for versioning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly