Skip to content

Aggregating subsets of converted datasets

Tim L edited this page Jul 6, 2014 · 189 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

What we will cover

This page describes how to select different subsets of the all RDF produced during conversion, so that smaller portions are widely available without imposing the overhead of loading all data triples for datasets.

Let's get to it!

Five types of aggregations are placed in different named graphs for easy access. All queries shown on this page can be executed at http://logd.tw.rpi.edu/sparql. See also Querying datasets created by csv2rdf4lod.

Design Pattern: Aggregations are versioned datasets

All aggregations described on this page follow the convention that their results produce a new version of a new dataset. All members of the aggregation are collected into a new directory:

  • CSV2RDF4LOD_BASE_URI/source/CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/dataset/cr-publish-XXX-to-endpoint/version/TODAY/source

where XXX is expanded for the script name and TODAY is the current date in the form YYYY-Mon-DD. A sibling publish/ directory is created to hold the aggregation of the files collected in source/, and scripts in publish/bin/ are created and used to load the aggregation into the triple store. This pattern follows that of the conventions for the conversion cockpit.

Aggregation 1/9: DCAT

cr-publish-dcat-to-endpoint.sh creates a new version of the abstract dataset:

  • CSV2RDF4LOD_BASE_URI/source/CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/dataset/cr-publish-dcat-to-endpoint

(where the [variables](CSV2RDF4LOD environment variables) in capitals are expanded). The script must be run from a [cr:data-root](directory conventions), e.g. /srv/twc-healthdata/data/source. It aggregates all files *dcat.ttl in the [cr:dataset](directory conventions) and [cr:directory-of-versions](directory conventions) directories. These dcat files reference the data file download URLs for the dataset that the current directory represents. They can be created by cr-create-dataset-dirs-from-ckan.py and are recognized and acted upon by cr-retrieve.sh.

% pwd
/srv/twc-healthdata/data/source

% cr-pwd-type.sh 
cr:data-root

% cr-vars.sh | grep OUR_SOURCE_ID
CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID                        tw-rpi-edu

% cr-publish-dcat-to-endpoint.sh -n
8 . hub-healthdata-gov/third-national-survey-older/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost-report/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost-report-data/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost-report-data-fy2011/dcat.ttl
...
publish/tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.nt
publish/tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.sd_name
publish/tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.void.ttl

% cd tw-rpi-edu/cr-publish-dcat-to-endpoint/version/2012-Sep-17
% ls source/*dcat.ttl | wc -l
     136

% ls -lt publish/
total 432
drwxr-xr-x  5 lebot  staff     170 Sep 17 23:54 bin
-rw-r--r--  1 lebot  staff    1325 Sep 17 23:54 tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.void.ttl
-rw-r--r--  1 lebot  staff  209888 Sep 17 23:54 tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.nt
-rw-r--r--  1 lebot  staff      97 Sep 17 23:54 tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.sd_name

Aggregation 2/9: DROID

PRONOM

Aggregation 3/9: Dataset Conversion Metadata (PROV-O, DCTerms, VoID)

cr-publish-void-to-endpoint.sh creates a new version of the abstract dataset:

  • CSV2RDF4LOD_BASE_URI/source/CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/dataset/cr-publish-void-to-endpoint

(where the [variables](CSV2RDF4LOD environment variables) in capitals are expanded). The script must be run from a [cr:data-root](directory conventions), e.g. /srv/twc-healthdata/data/source. It aggregates all conversion cockpits' source/*.void.ttl files, which contain provenance of the retrieval, tweaking, conversion (including enhancement parameters), and aggregation process as well as VoID and DC Terms metadata that is produced by the converter.

Note that $CSV2RDF4LOD_PUBLISH_SUBSET_VOID_NAMED_GRAPH used to determine the graph before the "Create a new version of the abstract dataset" convention was established, but this environment variable is now deprecated. The naming of this script (void) is inaccurate because it provides a lot more metadata than just VoID -- including provenance.

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?p
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    <http://logd.tw.rpi.edu/source/nitrd-gov/dataset/nsf_awards/version/2011-Jan-27> ?p ?o
  }
} order by ?p

The data is loaded into a named graph named after the dataset:

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?p
WHERE {
  GRAPH <http://logd.tw.rpi.edu/source/nitrd-gov/dataset/nsf_awards/version/2011-Jan-27>  {
    ?s ?p ?o
  }
} order by ?p

Note that cr-publish-params-to-endpoint.sh used to load into $CSV2RDF4LOD_PUBLISH_CONVERSION_PARAMS_NAMED_GRAPH until the "Create a new version of the abstract dataset" convention.

Datasets that promote their properties to geonames:parentFeature: prefix geonames: http://www.geonames.org/ontology# prefix conversion: http://purl.org/twc/vocab/conversion/

select ?dataset count(*) as ?count
where {
  graph <http://purl.org/twc/vocab/conversion/ConversionProcess> {
    ?dataset conversion:conversion_process [
      conversion:enhancement_identifier ?e;
      conversion:enhance [ 
        conversion:subproperty_of geonames:parentFeature
      ]
    ] 
  }
}
group by ?dataset ?e
order by ?count

Superproperties referenced: prefix geonames: http://www.geonames.org/ontology# prefix conversion: http://purl.org/twc/vocab/conversion/

select distinct ?superproperty
where {
  graph <http://purl.org/twc/vocab/conversion/ConversionProcess> {
    ?dataset conversion:conversion_process [
      conversion:enhancement_identifier ?e;
      conversion:enhance [ 
        conversion:subproperty_of ?superproperty
      ]
    ] 
  }
}
order by ?superproperty

TODO: demonstrate query that accesses dataset's data based on it's enhancement params. Requires a conversion using latest build b/c the enhancement params URI was changed to the actual dataset URI for easier connection.

Aggregation 4/9: Sitemap

See Ping the Semantic Web

Aggregation 5/9: owl:sameAs links

loaded into $CSV2RDF4LOD_PUBLISH_SUBSET_SAMEAS_NAMED_GRAPH (e.g. http://purl.org/twc/vocab/conversion/SameAsDataset) by $CSV2RDF4LOD_HOME/bin/cr-publish-sameas-to-endpoint.sh.

http://logd.tw.rpi.edu/query/logd-stat-num-outlinks.sparql: prefix owl: http://www.w3.org/2002/07/owl#

SELECT count(*) as ?count
WHERE {
  graph <http://purl.org/twc/vocab/conversion/SameAsDataset> {
    ?s owl:sameAs ?o
  }
  filter( ! (   regex(str(?s),"^http://logd.tw.rpi.edu*") 
             && regex(str(?o),"^http://logd.tw.rpi.edu*") )
  )
}

http://logd.tw.rpi.edu/query/logd-stat-num-outlinks-govtrack.sparql: prefix owl: http://www.w3.org/2002/07/owl#

SELECT count(*) as ?count
WHERE {
  graph <http://purl.org/twc/vocab/conversion/SameAsDataset> {
    ?s owl:sameAs ?o
  }
  filter(regex(str(?o),"^http://www.rdfabout.com/rdf/usgov*"))
}

Queries about links to other bubbles in the LOD cloud can be done using the query above and changing the regex:

http://logd.tw.rpi.edu/query/logd-stat-num-outlinks-geonames.sparql: filter(regex(str(?o),"^http://sws.geonames.org*"))

http://logd.tw.rpi.edu/query/logd-stat-num-outlinks-dbpedia.sparql: filter(regex(str(?o),"^http://dbpedia.org/resource*"))

TODO: links among VersionedDataset URIs vs. links between VersionedDataset URIs (b/c of owl:sameAs between layers).

TODO: this should return stuff:

prefix void:       <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>

select distinct ?dataset ?dump 
where {
  graph ?g {
    ?dataset a conversion:SameAsDataset; void:dataDump ?dump .
  }
} order by ?dataset

Aggregation 6/9: MetaDatasets

Some datasets are actually describing other datasets. For example, data.gov's 92 describes all of data.gov's other "raw" datasets. All Metadata datasets are loaded (by $CSV2RDF4LOD/bin/util/cr-virtuoso-load-metadataset.sh) into a special named graph so they can be accessed to augment dataset descriptions.

source/data-gov/92/version/data_gov_catalog.csv.e1.params.ttl adds the types a conversion:DatasetCatalog, conversion:MetaDataset; in its global enhancement parameters:

<http://logd.tw.rpi.edu/source/data-gov/dataset/92/version/2011-Jul-11/conversion/enhancement/1>
   a conversion:LayerDataset, void:Dataset;

   conversion:base_uri           "http://logd.tw.rpi.edu"^^xsd:anyURI;
   conversion:source_identifier  "data-gov";
   conversion:dataset_identifier "92";
   conversion:version_identifier "2011-Jul-11";

   a conversion:DatasetCatalog, conversion:MetaDataset;

The following query lists datasets that have been converted, but have not been described by another dataset (via the csv2rdf4lod converter, which asserts the ov:csvRow). Thanks to Greg for tweaking the query for efficiency.

PREFIX void:       <http://rdfs.org/ns/void#>
PREFIX ov:         <http://open.vocab.org/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>

SELECT ?source_id ?dataset_id ?version_id
WHERE {
 GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
   # Datasets that have been converted
   ?converted a conversion:VersionedDataset;
              conversion:source_identifier  ?source_id ;
              conversion:dataset_identifier ?dataset_id ;
              conversion:version_identifier ?version_id .
 }
OPTIONAL {
 GRAPH <http://purl.org/twc/vocab/conversion/MetaDataset>  {
   # But we have no metadata for it.
   ?converted ov:csvRow ?row
 }
}
   filter(!bound(?row))
} order by ?source_id ?dataset_id ?version_id

TODO: replace directory processing with query in logd-load-metadata-graph.sh:

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT distinct ?metadata
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
    ?metadata a conversion:MetaDataset; 
              conversion:conversion_process [] .
  }
}

Aug 2011 query for DatasetCatalogs (which are MetaDatasets). The are typed at the Abstract and atomic levels. (results:

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?sourceID ?datasetID
WHERE {
  GRAPH ?g {
    ?abstract a conversion:DatasetCatalog;
        conversion:source_identifier  ?sourceID;
        conversion:dataset_identifier ?datasetID .
  }
}ORDER BY ?sourceID

(results)

PREFIX void:       <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?sourceID ?datasetID
WHERE {
  GRAPH ?g {
    {?abstract
        conversion:source_identifier  ?sourceID;
        conversion:dataset_identifier ?datasetID;
        void:subset [ a conversion:VersionedDataset;
                      void:subset [ 
                        # Datasets from one source data file
                        a conversion:DatasetCatalog 
                      ];
                    ] .
    }
    UNION {
      ?abstract
        conversion:source_identifier  ?sourceID;
        conversion:dataset_identifier ?datasetID;
        void:subset [ a conversion:VersionedDataset;
                      void:subset [ 
                        void:subset [ 
                          # Datasets from multiple source data files
                          a conversion:DatasetCatalog;
                        ];
                      ];
                    ] .
    }
  }
}ORDER BY ?sourceID

7/9 Aggregating rdfs:isDefinedBy

See cr-isdefinedby.

8/9 Turtle-in-comments

cr-publish-tic-to-endpoint.sh

> cr-pwd-type.sh 
cr:data-root

> cr-publish-tic-to-endpoint.sh cr:auto
healthdata-tw-rpi-edu/catalog/version/2012-Sep-19/publish/bin/virtuoso-delete-healthdata-tw-rpi-edu-catalog-2012-Sep-19.sh
healthdata-tw-rpi-edu/catalog/version/2012-Sep-19/publish/bin/virtuoso-load-healthdata-tw-rpi-edu-catalog-2012-Sep-19.sh
healthdata-tw-rpi-edu/catalog/version/retrieve.sh
hub-healthdata-gov/2008-basic-stand-alone-carrier/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-durable/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-home/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-hospice/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-inpatient/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-outpatient/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-prescription/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-skilled/dcat.ttl
...

> find tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/10.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/100.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/101.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/102.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/103.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/104.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/105.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/106.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/107.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/108.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/109.ttl
...
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/10.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/100.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/101.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/102.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/103.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/104.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/105.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/106.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/107.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/108.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/109.ttl.ttl
...
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/bin/ln-to-www-root-tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sh
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/bin/virtuoso-delete-tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sh
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/bin/virtuoso-load-tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sh
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.nt
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sd_name
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.void.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-latest.ttl

9/9 Aggregating a Full Dump

See One click data dump

Example: LOGD's instance of csv2rdf4lod

The LOGD SPARQL endpoint has three special named graphs:

  • http://logd.tw.rpi.edu/vocab/Dataset contains information about the LOGD datasets that was asserted during conversion to RDF. This includes the VoID subset hierarchy and dataDumps, SCOVO triple counts, references to (and definitions of) the predicates and classes used, and some PML justifications tracing the provenance of the tabular conversions to RDF.
  • http://purl.org/twc/vocab/conversion/MetaDataset contains information about datasets obtained from other sources. For example, it includes data.gov's Dataset 92 because it describes the rest of data.gov's offerings. A second dataset is TWC's own data catalog that describes similar aspects for datasets from other sources.
  • http://purl.org/twc/vocab/conversion/SameAsDataset contains owl:sameAs links among entities within the LOGD datasets as well as into DBPedia, Geonames, and GovTrack. All of the links are co-located in a single graph to help explore the interconnectivity of the LOGD datasets.

Development notes

Starred * scripts are exemplars for the pattern; fewer (non-zero) pluses + means developed more recently. (When a new entry appears, add a + to every other existing entry).

(See this list, too)

  • pr-whois-domain.sh
    • Adapted from pr-neighborlod.sh
    • Set URL explicitly to the URI node dump.
    • Hid SPARQL querying rq and rq2
    • Hid DROIDing
    • Never $worthwhile
    • Hid $worthwhile cleanup
    • FORGETs what version it should create when it's placed in a conversion cockpit.
    • Adds destruction of version before marching on to do it again.
  • pr-aggregate-pingbacks.sh +
    • Adopted Aggregation exemplar, but need to add verification functionality.
    • Had to switch from PATH=$PATH'$HOME/bin/util/cr-situate-paths.sh' to PATH=$PATH'$HOME/bin/install/paths.sh' when moving from csv2rdf2lod to Prizms.
  • opendap-svn-file-hierarchy ++
    • Adapted from pr-neighborlod.sh; took out $rq2 handling, soft link PATH handling.
  • bin/dataset/cr-sparql-sd.sh +++
    • Adopted trimmed version of Aggregation exemplar.
    • Removed idempotency; we only want it to be run once.
  • data-carved-graphs-btes ++++
  • bin/dataset/cr-aggregate-dcat.sh (Aggregation exemplar)++++
    • reused softlink-safe $this logic
    • reused dryrun conditional
    • cleans out version (as opposed to new dataset exemplar, which increments)
  • bin/dataset/pr-neighborlod.sh (New dataset exemplar)++++
    • updated $this and $HOME logic when is a soft link, augments PATH and CLASSPATH in-line.
    • updated the "retrieve from local endpoint" pattern to a variable for the query file.
    • removed check to prevent attempt to make worthwhile version (removes itself if not worthwhile)
    • modifies SPARQL template before execution
    • swaps SPARQL query from subject-based to object-based when the former runs dry.
    • increments version (as opposed to aggregate exemplar, which increments)
  • bin/secondary/cr-aggregate-eparams.sh +++++
    • pushd conversion root
    • more complete "SDV" naming logic
    • removed all graph naming/clearing clutter
    • handles "$0" when is a soft link
    • includes the #3> <> a conversion:RetrievalTrigger; that pr-enable-dataset.sh needs to list.
  • pr-spobal-ng.sh ++++++
    • removes retrieval attempt if it did not become worthwhile.
    • should be extended to accept the sd:name to process.
  • WCL's asset-alchemyapi/retrieve.sh +++++++
    • accepts the URI to analyze, or uses cache-queries.sh to SPARQL query for those that need to be analyzed.
  • WCL's property-chains/retrieve.sh ++++++++
    • retrieves with cache-queries.sh
    • recursively generates versions until no triples returned
    • cheats on loading (vload) - needs to be cleaned up.
  • bin/cr-pingback.sh ++++++*+++
    • does not depend on CSV2RDF4LOD_HOME;
    • can run from source directory;
    • runs without "cr:auto" argument;
    • CUT a lot of the file aggregation stuff;
    • bails if run within last week -- see cr-publish-droid-to-endpoint.sh or older
  • bin/util/cr-full-dump.sh ++++++++++
    • avoids using aggregate-source-rdf.sh
  • bin/cr-publish-droid-to-endpoint.sh +++++++++++
    • uses # - - - - to delineate the source/* linking;
    • links into source/ via "sdv".ttl
  • bin/cr-publish-isdefinedby-to-endpoint.py +++++++++++++
  • bin/cr-publish-isdefinedby-to-endpoint.sh +++++++++++++
    • checks for $CSV2RDF4LOD_PUBLISH_SPARQL_ENDPOINT;
    • wraps python;
    • uses aggregate-source-rdf.sh --link-as-latest;
    • works from cr:data-root cr:source cr:dataset cr:directory-of-versions not just cr:data-root cr:source
  • bin/cr-publish-cockpit.sh +++++++++++++
    • just hops into cockpit and runs convert-aggregate.sh
  • bin/cr-publish-params-to-endpoint.sh +++++++++++
    • uses aggregate-source-rdf.sh --link-as-latest
  • bin/cr-publish-tic-to-endpoint.sh ++++++++++++
    • links into source/ by for loop tally;
    • processes source into automatic
  • bin/cr-publish-void-to-endpoint.sh ++++++++++++
    • uses --link-as-latest
  • bin/cr-publish-dcat-to-endpoint.sh ++++++++++++
    • uses publish/bin to load/delete graph;
    • uses dryrun.sh $dryrun ending;
    • uses aggregate-source-rdf.sh (not link latest);
    • links into source/"$sdv".ttl with a for loop
  • bin/cr-publish-sameas-to-endpoint.sh +++++++++++++
    • still did its own graph loading with vload;
    • should use aggregate-source-rdf.sh

The pattern for all of these scripts is:

  • dryrun.sh $dryrun beginning
  • Invoke from the conversion data root (i.e., cr-pwd-type.sh cr:data-root)
  • Use [env var](CSV2RDF4LOD environment variables) $CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID to know which source (organization) directory to create the aggregate dataset.
  • Make a cockpit for a new versioned dataset. e.g. source/tw-rpi-edu/dataset/cr-publish-dcat-to-endpoint/version/2012-Sep-07
  • Hard link files into the new cockpit's source/ directory.
  • Aggregate source/* into publish/* with aggregate-source-rdf.sh --link-as-latest source/*
  • Use the publish/bin/* scripts to publish like a normal dataset.
  • dryrun.sh $dryrun ending

Old

Each type of aggregation is described below and is performed by $CSV2RDF4LOD_HOME/bin/util/cr-virtuoso-load-metadata.sh, whose behavior is controlled by changing certain CSV2RDF4LOD environment variables:

  • CSV2RDF4LOD_CONVERT_DATA_ROOT - the [data root](csv2rdf4lod data root) from which to aggregate and publish. See [this](Publishing conversion results with a Virtuoso triplestore), too.
  • CSV2RDF4LOD_PUBLISH_LOD_MATERIALIZATION_WWW_ROOT - the /var/www directory that publishes files to the web.
  • CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT - the endpoint that ${CSV2RDF4LOD_HOME}/bin/util/virtuoso/vload populates.

See also

Clone this wiki locally