-
Notifications
You must be signed in to change notification settings - Fork 36
Dataset composition resulting from naming by source, dataset, and version
- The identifiers for source, dataset, and version is done in Conversion process phase: name
In addition to these special named graphs, there are many named graphs that fall into four categories. These categories are listed in order of size and correspond to their level within the void:subset hierarchy:
-
Abstract Dataset named graphs contain all of the data triples and all of the metadata for an Abstract Dataset. An Abstract Dataset incorporates all Versioned Datasets that have been created for it. An example instance of an Abstract Dataset is http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records. The AbstractDataset name graph is populated with zero or more of its (unversioned) Datasets as needed. We accept requests to populate the Abstract Dataset named graphs in the LOGD triple store.
-
Versioned Dataset named graphs contain all of the data triples and all of the metadata for a Versioned Dataset. A Versioned Dataset incorporates all data triples and metadata from the layers (e.g. "raw", "e1") that have been created for it. Versioned Datasets exist for each Abstract Dataset. Two example instances of a Versioned Dataset are http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510 and http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0810, corresponding to the May and August releases of the White House Visitor Access Records. The LOGD triple store is populated with Versioned Datasets as needed. Requests to do so are accepted.
-
Layer Dataset named graphs contain all data triples and all of the metadata for a Layer Dataset. The two most popular Layer Datasets are the "raw" and "e1" layers, while additional enhancements would provide layers "e2", "e3", etc. The term layer is used to reflect the parallel predicates that layer additional descriptions on top of the same entities within the dataset -- each layer provides a new set of predicates that enables backward compatibility and incremental adoption. Layer Datasets exist for each Version of a Dataset. Three example instances of a Layered Dataset include http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/raw, http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/enhancement/1, and http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0810/conversion/raw. The LOGD triple store is populated with Layer Datasets as needed. Requests to do so are accepted.
-
Dataset Sample named graphs are the smallest type of named graph. They contain a subset of the data triples and all of the metadata for a Layer Dataset. This subset is intended to provide quick access for overview and/or survey analysis applications. Sample Datasets exist for each Layer of each Version of a Dataset. Three example instances of Dataset Sample include http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/raw/subset/sample, http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0510/conversion/enhancement/1/subset/sample, and http://logd.tw.rpi.edu/source/whitehouse-gov/dataset/visitor-records/version/0810/conversion/raw/subset/sample. The LOGD triple store is populated with all available Dataset Samples.
The following queries can be used at http://logd.tw.rpi.edu/sparql to find and describe datasets.
SELECT DISTINCT ?g
WHERE { GRAPH ?g {
[] a [] .
}
}
(When you know you aren't going to select the variable, do not name it in the graph pattern)
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct max(?modified) as ?last_modified
count(?sample) as ?num_modifications
?sample
WHERE {
GRAPH ?sample {
?sample a conversion:DatasetSample;
dcterms:modified ?modified .
}
}
GROUP BY ?sample
ORDER BY DESC(?last_modified) DESC(?num_modifications)
Of the Dataset Samples that are loaded in the triplestore, which have their Sampled datasets loaded?
(IOU: doesn't return anything...)
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct max(?modified) as ?last_modified
count(?sample) as ?num_modifications
?sample ?sampled
WHERE {
GRAPH ?sample {
?sample a conversion:DatasetSample;
dcterms:modified ?modified .
?sampled void:subset ?sample .
}
GRAPH ?sampled {
?sampled a []
}
}
GROUP BY ?sample ?sampled
ORDER BY DESC(?last_modified) DESC(?num_modifications)
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix logd: <http://logd.tw.rpi.edu/vocab/>
SELECT *
WHERE {
graph logd:Dataset {
logd:Dataset dcterms:modified ?modified .
}
}
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void: <http://rdfs.org/ns/void#>
select distinct ?dataset ?subdataset ?size ?dump
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a void:Dataset .
optional { ?dataset void:subset ?subdataset }
optional { ?subdataset conversion:num_triples ?size }
optional { ?subdataset void:dataDump ?dump }
}
} order by ?dataset ?subdataset
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT DISTINCT ?source_id ?source_homepage ?dataset_id ?dataset_homepage ?dataset max(?modified) AS ?lastModified
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a conversion:Dataset;
void:subset [ a conversion:VersionedDataset ] ;
conversion:dataset_identifier ?dataset_id;
dcterms:modified ?modified ;
dcterms:source ?organization .
?organization a foaf:Agent;
dcterms:identifier ?source_id .
}
graph ?meta {
?meta a conversion:MetaDataset .
optional{ ?organization foaf:homepage ?source_homepage }
#exceeds execution time threshold: optional{ ?dataset foaf:homepage ?dataset_homepage }
}
} ORDER BY ?dataset
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT count(?dataset)
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a conversion:LayerDataset;
conversion:conversion_identifier "raw" .
}
}
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT DISTINCT ?dataset ?dump
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a conversion:Dataset;
void:subset [ a conversion:SameAsDataset;
void:dataDump ?dump ]
}
}
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT DISTINCT ?source_id ?dataset_id ?version_id ?layer_id ?sample_uri ?dump_file ?created_date ?loaded_boolean
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset
a conversion:Dataset;
conversion:source_identifier ?source_id;
conversion:dataset_identifier ?dataset_id;
void:subset [ a conversion:VersionedDataset;
conversion:version_identifier ?version_id;
void:subset [ a conversion:LayerDataset;
conversion:conversion_identifier ?layer_id;
dcterms:created ?created_date;
void:subset ?sample_uri ]
] .
?sample_uri a conversion:DatasetSample;
void:dataDump ?dump_file .
}
optional {
graph ?sample_uri {
?sample_uri a ?loaded_boolean .
filter(?loaded_boolean = void:Dataset)
}
}
} ORDER BY ?source_id ?dataset_id ?version_id ?layer_id
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select distinct ?dataset
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a conversion:Dataset;
conversion:source_identifier "data-gov" .
}
}
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select *
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a conversion:Dataset;
void:subset [ a conversion:VersionedDataset ] .
}
}
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void: <http://rdfs.org/ns/void#>
select distinct ?dataset ?subdataset ?size ?dump
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a void:Dataset ;
conversion:source_identifier "data-gov";
conversion:dataset_identifier "1008" .
optional { ?dataset void:subset ?subdataset }
optional { ?subdataset conversion:num_triples ?size }
optional { ?subdataset void:dataDump ?dump }
}
} order by ?dataset ?subdataset
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
ASK
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a void:Dataset;
conversion:source_identifier "data-gov";
conversion:dataset_identifier "8" .
optional { ?dataset void:subset ?subdataset }
optional { ?NOPARENT void:subset ?dataset }
filter(!bound(?NOPARENT))
}
graph ?dataset {
[] a []
}
}
prefix ov: <http://open.vocab.org/terms/>
ask
WHERE {
graph <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/1st-anniversary/conversion/raw/subset/sample> {
[] ov:csvRow ?row
}
}
prefix ov: <http://open.vocab.org/terms/>
ask
WHERE {
graph <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/1st-anniversary/conversion/e1/subset/sample> {
[] ov:csvRow ?row
}
}
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select distinct ?dataset ?predicate
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset conversion:uses_predicate ?predicate
}
}
prefix wgs: <http://www.w3.org/2003/01/geo/wgs84_pos#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void: <http://rdfs.org/ns/void#>
select distinct ?g
where {
graph ?g {
?s wgs:lat ?lat
}
}
Datasets comprising only one CSV create a 3-level hierarchy, while datasets comprising more than one CSV create a 4-level hierarchy. Query for all unversioned datasets
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select distinct ?unversioned
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
{
# Unversioned datasets with single CSV
?unversioned void:subset ?versioned .
?versioned void:subset ?layer .
?layer conversion:num_triples ?triples ;
void:dataDump ?dump .
}
union
{
# Unversioned datasets with multiple CSVs
?unversioned void:subset ?versioned .
?versioned void:subset ?layer .
?layer void:dataDump ?dump ;
void:subset ?multi_component .
?multi_component conversion:num_triples ?triples .
}
}
} order by ?unversioned
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select count(distinct ?unversioned)
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
{ # Unversioned datasets with single CSV
?unversioned void:subset [
void:subset [
conversion:num_triples ?triples ;
void:dataDump ?dump
]
]
}
union
{ # Unversioned datasets with a multiple CSVs
?unversioned void:subset [
void:subset [
void:dataDump ?dump ;
void:subset [
conversion:num_triples ?triples
]
]
]
}
}
}
(see http://data-gov.tw.rpi.edu/wiki/URI_design_for_RDF_conversion_of_CSV-based_data#VoID_descriptions for a diagram illustrating the different VoID hierarchies between single- and multi-CSV datasets.)
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select ?p ?o
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
<http://logd.tw.rpi.edu/source/data-gov/dataset/1008>
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21> .
<http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21>
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21/conversion/raw> .
<http://logd.tw.rpi.edu/source/data-gov/dataset/1008/version/2010-Jul-21/conversion/raw> ?p ?o .
}
}
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select ?p ?o
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
<http://logd.tw.rpi.edu/source/data-gov/dataset/1033>
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary> .
<http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary>
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary/conversion/raw> .
<http://logd.tw.rpi.edu/source/data-gov/dataset/1033/version/1st-anniversary/conversion/raw>
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1033/FM_FACILITY_FILE/version/1st-anniversary/conversion/raw> .
<http://logd.tw.rpi.edu/source/data-gov/dataset/1033/FM_FACILITY_FILE/version/1st-anniversary/conversion/raw> ?p ?o
}
}
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?dataDump sum(?num_triples) as ?triples
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
<http://logd.tw.rpi.edu/source/data-gov/dataset/1008>
void:subset [
a conversion:VersionedDataset;
void:subset ?layer ] .
{
?layer conversion:num_triples ?num_triples;
void:dataDump ?dataDump.
}
UNION
{
?layer void:dataDump ?dataDump;
void:subset ?multiple_table .
?multiple_table conversion:num_triples ?num_triples .
}
}
}
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void: <http://rdfs.org/ns/void#>
select distinct ?dataset ?versionedDataset ?layerDataset ?sample ?dump
where {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset void:subset ?versionedDataset .
?versionedDataset a conversion:VersionedDataset;
void:subset ?layerDataset .
?layerDataset a conversion:LayerDataset;
void:subset ?sample .
?sample a conversion:DatasetSample;
void:dataDump ?dump .
}
} order by ?dataset ?versionedDataset ?layerDataset ?sample
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void: <http://rdfs.org/ns/void#>
SELECT DISTINCT ?source_id ?dataset_id ?sample ?dump
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?sample a conversion:DatasetSample;
conversion:source_identifier ?source_id;
conversion:dataset_identifier ?dataset_id;
conversion:version_identifier "1st-anniversary";
void:dataDump ?dump .
}
} ORDER BY ?sample
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT *
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
{?dataset a conversion:LayerDataset; void:dataDump ?dump }
optional { ?dataset conversion:source_identifier ?source_id }
optional { ?dataset conversion:dataset_identifier ?dataset_id }
optional { ?dataset conversion:dataset_version ?version_id }
}
}
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT count(distinct ?dataset1) as ?dumps
count(distinct ?dataset2) as ?to_source
count(distinct ?dataset3) as ?to_dataset
count(distinct ?dataset4) as ?to_version
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
{?dataset1 void:dataDump ?dumpfile}
union
{?dataset2 void:dataDump ?dumpfile; conversion:source_identifier ?source_id}
union
{?dataset3 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id}
union
{?dataset4 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id; conversion:dataset_version ?version_id}
}
}
prefix dcterms: <http://purl.org/dc/terms/>
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?dataset3
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset3 void:dataDump ?dumpfile; conversion:source_identifier ?source_id; conversion:dataset_identifier ?dataset_id .
optional{?dataset3 conversion:dataset_version ?version_id}
filter(!bound(?version_id))
}
} order by ?dataset3
prefix conversion: <http://purl.org/twc/vocab/conversion/>
prefix void: <http://rdfs.org/ns/void#>
SELECT DISTINCT ?type
WHERE {
graph <http://logd.tw.rpi.edu/vocab/Dataset> {
?dataset a void:Dataset ; a ?type .
}
} order by ?type
- This was originally developed on LOGD's site, but moved here because they didn't like it.
Li Ding put together a dump file validation service to make sure the dump files exist.
- Lee Feigenbaum discusses how they used named graphs for versioning.