Skip to content

Dataset composition resulting from naming by source, dataset, and version

timrdf edited this page Jun 26, 2012 · 21 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

See also

Understanding the void:subset hierarchy

In addition to these special named graphs, there are many named graphs that fall into four categories. These categories are listed in order of size and correspond to their level within the void:subset hierarchy:


The following queries can be used at http://logd.tw.rpi.edu/sparql to find and describe datasets.

What graphs are in the triple store?

SELECT DISTINCT ?g
   WHERE { GRAPH ?g { 
      [] a [] . 
   }
}

(When you know you aren't going to select the variable, do not name it in the graph pattern)

What Dataset Samples are loaded in the triple store?

results:

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>

SELECT distinct max(?modified) as ?last_modified 
       count(?sample) as ?num_modifications 
       ?sample 
WHERE {
  GRAPH ?sample {
    ?sample a conversion:DatasetSample; 
            dcterms:modified ?modified .
  }
}
GROUP BY ?sample 
ORDER BY DESC(?last_modified) DESC(?num_modifications)

Of the Dataset Samples that are loaded in the triplestore, which have their Sampled datasets loaded?

(IOU: doesn't return anything...)

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX void:       <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>

SELECT distinct max(?modified) as ?last_modified 
       count(?sample) as ?num_modifications 
       ?sample ?sampled
WHERE {
  GRAPH ?sample {
    ?sample a conversion:DatasetSample; 
            dcterms:modified ?modified .
    ?sampled void:subset ?sample .
  }
  GRAPH ?sampled {
    ?sampled a []
  }
}
GROUP BY ?sample ?sampled
ORDER BY DESC(?last_modified) DESC(?num_modifications)

How up to date are the dataset descriptions?

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix logd:       <http://logd.tw.rpi.edu/vocab/>

SELECT *
WHERE {    
  graph logd:Dataset {      
    logd:Dataset dcterms:modified ?modified .
  }  
}

How do the datasets fit into the void:subset hierarchy?

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:          <http://rdfs.org/ns/void#>

select distinct ?dataset ?subdataset ?size ?dump 
where { 
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {  
                 ?dataset a void:Dataset . 
      optional { ?dataset void:subset ?subdataset } 
      optional { ?subdataset conversion:num_triples ?size } 
      optional { ?subdataset void:dataDump          ?dump } 
  } 
} order by ?dataset ?subdataset 

What (unversioned) datasets are at the roots of the void:subset hierarchies?

prefix foaf:       <http://xmlns.com/foaf/0.1/>
prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT DISTINCT ?source_id ?source_homepage ?dataset_id ?dataset_homepage ?dataset max(?modified) AS ?lastModified
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    ?dataset a conversion:Dataset;
             void:subset [ a conversion:VersionedDataset ] ;
             conversion:dataset_identifier ?dataset_id;
             dcterms:modified              ?modified ;
             dcterms:source                ?organization .
    ?organization a foaf:Agent;
                  dcterms:identifier ?source_id .
  }
  graph ?meta {
    ?meta a conversion:MetaDataset .
    optional{ ?organization foaf:homepage ?source_homepage  }
    #exceeds execution time threshold: optional{ ?dataset      foaf:homepage ?dataset_homepage }
  }  
} ORDER BY ?dataset

How many verbatim conversions are there?

prefix foaf:       <http://xmlns.com/foaf/0.1/>
prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT count(?dataset)
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    ?dataset a conversion:LayerDataset; 
             conversion:conversion_identifier "raw" .
  }
}

What datasets are part of the LOD cloud?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT DISTINCT ?dataset ?dump  
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {  
    ?dataset a conversion:Dataset;
             void:subset [ a conversion:SameAsDataset; 
                           void:dataDump ?dump ] 
  }  
}   

What dataset samples are there, and which are loaded in the triple store?

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

SELECT DISTINCT ?source_id ?dataset_id ?version_id ?layer_id ?sample_uri ?dump_file ?created_date ?loaded_boolean
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset
       a conversion:Dataset;
       conversion:source_identifier  ?source_id;
       conversion:dataset_identifier ?dataset_id;

       void:subset [ a conversion:VersionedDataset;
                     conversion:version_identifier ?version_id;

                     void:subset [ a conversion:LayerDataset;
                                   conversion:conversion_identifier ?layer_id;
                                   dcterms:created                  ?created_date;
                                   void:subset ?sample_uri ]
                   ] .
    ?sample_uri a conversion:DatasetSample;
                void:dataDump ?dump_file .
  }
  optional {
    graph ?sample_uri {
       ?sample_uri a ?loaded_boolean .
       filter(?loaded_boolean = void:Dataset)
    }
  }
} ORDER BY ?source_id ?dataset_id ?version_id ?layer_id

What datasets are from "data-gov"?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?dataset
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset a conversion:Dataset;
    conversion:source_identifier "data-gov" .
  }
}

What Datasets are at the root of the void:subset hierarchy?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select *
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset a conversion:Dataset;
             void:subset [ a conversion:VersionedDataset ] .
  }
}

What VoID data subsets are within data-gov's dataset 1008?

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>

select distinct ?dataset ?subdataset ?size ?dump  
where {   
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {    
    ?dataset a void:Dataset ; 
             conversion:source_identifier "data-gov"; 
             conversion:dataset_identifier "1008" .
    optional { ?dataset    void:subset            ?subdataset }    
    optional { ?subdataset conversion:num_triples ?size }    
    optional { ?subdataset void:dataDump          ?dump }  
  }  
} order by ?dataset ?subdataset 

Is data.gov's dataset 8 loaded in the sparql endpoint?

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

ASK
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
               ?dataset a void:Dataset;
                        conversion:source_identifier  "data-gov";
                        conversion:dataset_identifier "8" .
    optional { ?dataset void:subset ?subdataset }

    optional { ?NOPARENT void:subset ?dataset }
    filter(!bound(?NOPARENT))
  }
  graph ?dataset {
     [] a []
  }
} 

Is the raw sample loaded?

prefix ov:         <http://open.vocab.org/terms/>

ask
WHERE {
  graph <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/1st-anniversary/conversion/raw/subset/sample> {
     [] ov:csvRow ?row
  }
}

Is the first enhancement sample loaded?

prefix ov:         <http://open.vocab.org/terms/>

ask
WHERE {
  graph <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/1st-anniversary/conversion/e1/subset/sample> {
     [] ov:csvRow ?row
  }
}

What predicates do the datasets use?

prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?dataset ?predicate
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset conversion:uses_predicate ?predicate
  }
}

Datasets with wgs:lat

prefix wgs:        <http://www.w3.org/2003/01/geo/wgs84_pos#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>

select distinct ?g 
where { 
  graph ?g { 
    ?s wgs:lat ?lat 
  } 
}

3-level vs. 4-level void:subset hierarchy (cf. single vs. multiple CSVs)

Datasets comprising only one CSV create a 3-level hierarchy, while datasets comprising more than one CSV create a 4-level hierarchy. Query for all unversioned datasets

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?unversioned
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    {
      # Unversioned datasets with single CSV
      ?unversioned void:subset            ?versioned .
      ?versioned   void:subset            ?layer     .
      ?layer       conversion:num_triples ?triples ;
                   void:dataDump          ?dump      .
    }
    union
    {
      # Unversioned datasets with multiple CSVs
      ?unversioned     void:subset            ?versioned       .
      ?versioned       void:subset            ?layer           .
      ?layer           void:dataDump          ?dump ;
                       void:subset            ?multi_component .
      ?multi_component conversion:num_triples ?triples         .
    }
  }
} order by ?unversioned 

Same as above:

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select count(distinct ?unversioned)
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    { # Unversioned datasets with single CSV
      ?unversioned void:subset [ 
                        void:subset [ 
                             conversion:num_triples ?triples ;
                             void:dataDump          ?dump     
                        ]
                   ]
    }
    union
    { # Unversioned datasets with a multiple CSVs
      ?unversioned void:subset [
                        void:subset [ 
                             void:dataDump ?dump ;
                             void:subset [
                                  conversion:num_triples ?triples 
                             ]
                        ]
                   ]
    }
  }
} 

(see http://data-gov.tw.rpi.edu/wiki/URI_design_for_RDF_conversion_of_CSV-based_data#VoID_descriptions for a diagram illustrating the different VoID hierarchies between single- and multi-CSV datasets.)

A 3-level example with explicit names

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select ?p ?o
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21> .

    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21/conversion/raw> .

    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21/conversion/raw> ?p ?o .
  }
}

A 4-level example with explicit names

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select ?p ?o
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary> .
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary/conversion/raw> .
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary/conversion/raw>
        void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/FM_FACILITY_FILE/version/1st-anniversary/conversion/raw> .
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/FM_FACILITY_FILE/version/1st-anniversary/conversion/raw> ?p ?o
  }
}   

All dump files and their triple counts of an (unversioned) Dataset

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

SELECT ?dataDump sum(?num_triples) as ?triples
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {

    <http://logd.tw.rpi.edu/source/data-gov/dataset/1008> 
      void:subset [
        a conversion:VersionedDataset;
        void:subset ?layer ] .

    {
      ?layer conversion:num_triples ?num_triples;
             void:dataDump          ?dataDump.
    }
    UNION
    {
      ?layer void:dataDump ?dataDump;
             void:subset   ?multiple_table .

      ?multiple_table conversion:num_triples ?num_triples .
    }
  }
}

Getting a dump file of a sample subset of a dataset

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>
select distinct ?dataset ?versionedDataset ?layerDataset ?sample ?dump
where {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?dataset          void:subset ?versionedDataset .
    ?versionedDataset a conversion:VersionedDataset;
                      void:subset ?layerDataset .
    ?layerDataset     a conversion:LayerDataset;
                      void:subset ?sample .
    ?sample           a conversion:DatasetSample;
                      void:dataDump ?dump .
  }
} order by ?dataset ?versionedDataset ?layerDataset ?sample

Getting a dump file of a sample subset of a dataset (#2)

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:       <http://rdfs.org/ns/void#>
SELECT DISTINCT ?source_id ?dataset_id ?sample ?dump
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?sample a conversion:DatasetSample;
            conversion:source_identifier   ?source_id;
            conversion:dataset_identifier  ?dataset_id;
            conversion:version_identifier "1st-anniversary";
            void:dataDump ?dump .
  }
} ORDER BY ?sample

Attributes on Datasets with void:dataDumps

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:          <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT *
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    {?dataset a conversion:LayerDataset; void:dataDump ?dump }
   optional { ?dataset conversion:source_identifier ?source_id }
   optional { ?dataset conversion:dataset_identifier ?dataset_id }
   optional { ?dataset conversion:dataset_version ?version_id }
  }  
}

Counts of datasets with different sets of attributes

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:          <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT count(distinct ?dataset1) as ?dumps 
           count(distinct ?dataset2) as ?to_source 
           count(distinct ?dataset3) as ?to_dataset 
           count(distinct ?dataset4) as ?to_version
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
    {?dataset1 void:dataDump ?dumpfile}
    union

    {?dataset2 void:dataDump ?dumpfile; conversion:source_identifier ?source_id}
    union
           
     {?dataset3 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id}
   union

    {?dataset4 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id; conversion:dataset_version ?version_id}
  }  
}

Datasets (intentionally) without a version

prefix dcterms:    <http://purl.org/dc/terms/>
prefix void:          <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
 
SELECT distinct ?dataset3
WHERE {    
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {      
     ?dataset3 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id .
    optional{?dataset3 conversion:dataset_version ?version_id}
    filter(!bound(?version_id))
  }  
} order by ?dataset3

What types are instances of void:Dataset?

prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void:          <http://rdfs.org/ns/void#>

SELECT DISTINCT ?type
WHERE {
  graph <http://logd.tw.rpi.edu/vocab/Dataset> {  
     ?dataset a void:Dataset ; a ?type .
  }
} order by ?type

See also

Historical note

  • This was originally developed on LOGD's site, but moved here because they didn't like it.

Dump file validation

Li Ding put together a dump file validation service to make sure the dump files exist.

Related work

  • Lee Feigenbaum discusses how they used named graphs for versioning.
Clone this wiki locally