CSV2RDF4LOD environment variables (considerations for a distributed workflow)

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

Different CSV2RDF4LOD environment variables apply in different situations. Your variable settings can depend on:

the Project you are working on.
the Machine you are working on.
the Dataset you are working on.
who You are (as opposed to your team members).
what you are Doing (e.g., bulk conversion, developing enhancements, testing, etc.).

For example, you could be working on LOGD, LOBD, SWQP, or OrgPedia. All of which have a different CSV2RDF4LOD_BASE_URI (http://logd.tw.rpi.edu, http://health.tw.rpi.edu, etc.)

In simple environments, the my-csv2rdf4lod-source-me.sh created when [installing csv2rdf4lod-automation](Installing csv2rdf4lod automation) does the job. But as you start working on many projects, collaborating with others through version control, and start using different machines, things can start to get a bit messy. This page offers some recommendations and best practices for managing the issues in these more complicated environments.

Templates for each group of environment variables are available here.

Naming conventions

In the simple case of one machine, one project, and one you, stick with my-csv2rdf4lod-source-me.sh. When you get more of any of those, the following naming conventions help organize the settings for CSV2RDF4LOD environment variables according to how, when, or why they should be used. The for-, on-, as-, and when- lend themselves to a nice sort order and indicate the type of environment variables the file contains.

csv2rdf4lod-source-me-for-PROJECTNAME.sh
csv2rdf4lod-source-me-on-MACHINENAME.sh  
csv2rdf4lod-source-me-as-USERNAME.sh
csv2rdf4lod-source-me-when-ACTIVITYNAME.sh

Including documentation pointers

We recommend including these comments in your source-me scripts so people have pointers to the latest information about what they are for and how to use them.

#3 <#> a <http://purl.org/twc/vocab/conversion/CSV2RDF4LOD_environment_variables> ;
#3     rdfs:seeAlso 
#3     <http://purl.org/twc/page/csv2rdf4lod/distributed_env_vars>,
#3     <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Script:-source-me.sh> .

Using version control

We recommend that you version control all source-me scripts, so that the same configurations can be used to reproduce your conversion results. See Version control strategies: only the essential minimum is needed.

Example

The following source-me scripts are on TWC's SVN, which is public so that others can reproduce the conversions.

csv2rdf4lod-source-me-for-logd.sh (a project)
csv2rdf4lod-source-me-on-gemini.sh (a machine)
csv2rdf4lod-source-me-on-sam.sh (a machine)
csv2rdf4lod-source-me-as-lebot.sh (a person)
csv2rdf4lod-source-me-when-publishing.sh (an activity)

The source-me scripts above can be checked out using the command:

svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source --depth=files

The following source-me scripts are on TWC's SVN, which is private because it contains ports, usernames, and passwords for the endpoint administration.

csv2rdf4lod-source-me-when-publishing-via-virtuoso.sh (an activity)

Grab them:

svn checkout https://scm.escience.rpi.edu/svn/private/projects/logd/config/        /mnt/raid/logd/svn/config/
svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/ /mnt/raid/logd/svn/source/

Note that the second call is getting the entire [data root](csv2rdf4lod data root) as well as the configurations.

Source (in ~/.bashrc) the right mix based on what machine you're on, what project you're working on, and who you are:

alias l='ls -lt'
source /mnt/raid/logd/svn/config/csv2rdf4lod-source-me-on-gemini.sh
source /mnt/raid/logd/svn/config/csv2rdf4lod-source-me-for-logd.sh
source /mnt/raid/logd/svn/config/csv2rdf4lod-source-me-as-lebot.sh

The conversion trigger, too!

The conversion triggers can contain dataset-specific CSV2RDF4LOD environment variables and should also be version controlled. This eliminates the need for the consumer to know "what data files should be converted?".

CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER="true"

Since the conversion trigger is version-specific, you can apply it to all future versions in the source/SSS/DDD/version/2source.sh. See Automated creation of a new Versioned Dataset.

Tracking down where a CSV2RDF4LOD environment variable is being set

$CSV2RDF4LOD_HOME/bin/util/cr-where-was-envvar-set.sh will dig through all of the source-me scripts to show you where a particular environment variable is set:

$ cr-where-was-envvar-set.sh --help
usage: cr-where-was-envvar-set.sh [-rc ~/.bashrc] [ [--list] | [CSV2RDF4LOD_var] [--only] ]
                -rc : the rc (.bashrc, .login, etc.) file used to source all csv2rdf4lod-source-mes.
             --list : show the source-mes that are used to set up the environment.
  [CSV2RDF4LOD_var] : a CSV2RDF4LOD_ environment variable name.
                      All variables are listed by running cr-vars.sh.
                      If not specified, defaults to CSV2RDF4LOD_HOME.
             --only : omit the CSV2RDF4LOD_ variables that are more specific than the one specified.

see https://github.com/timrdf/csv2rdf4lod-automation/wiki/CSV2RDF4LOD-environment-variables
    https://github.com/timrdf/csv2rdf4lod-automation/wiki/Script:-source-me.sh

To see what source-me scripts are used to setup your environment:

$ cr-where-was-envvar-set.sh -rc ~/.bashrc --list
/srv/logd/data/source/csv2rdf4lod-source-me-for-logd.sh
/srv/logd/data/source/csv2rdf4lod-source-me-on-gemini.sh
/srv/logd/data/source/csv2rdf4lod-source-me-as-lebot.sh
/srv/logd/data/source/csv2rdf4lod-source-me-when-publishing.sh
/srv/logd/config/triple-store/virtuoso/csv2rdf4lod-source-me-for-virtuoso-credentials.sh

To show where CSV2RDF4LOD_PUBLISH_VIRTUOSO is set, while omitting the variables CSV2RDF4LOD_PUBLISH_VIRTUOSO_HOME, CSV2RDF4LOD_PUBLISH_VIRTUOSO_PORT, etc. If you want to see all "children" variables, omit the --only parameter.

$cr-where-was-envvar-set.sh -rc ~/.bashrc CSV2RDF4LOD_PUBLISH_VIRTUOSO --only

/srv/logd/data/source/csv2rdf4lod-source-me-for-logd.sh:export CSV2RDF4LOD_PUBLISH_VIRTUOSO="true"
/srv/logd/data/source/csv2rdf4lod-source-me-for-logd.sh:export CSV2RDF4LOD_PUBLISH_VIRTUOSO="false" 
/srv/logd/config/triple-store/virtuoso/csv2rdf4lod-source-me-for-virtuoso-credentials.sh:export CSV2RDF4LOD_PUBLISH_VIRTUOSO="false" #

cr:dev and refusing to publish

In multi-developer environments, it is useful to have development sandboxes and a single production [data root](csv2rdf4lod automation data root). In the canonical directory structure for a project (e.g., logd), the data root should be at:

/srv/logd/data/source (which is at svn),

while development sandboxes should be at:

/srv/logd/data/dev/lebot/source,
/srv/logd/data/dev/sym/source, or
/srv/logd/data/dev/difrad/source (depending on the developer's user name).

If everybody could publish "as the project" from their development sandbox, then it becomes very difficult to trace where published data "came from". So, we would like to prevent people from publishing from their sandboxes.

The current logic is if [[ pwd == */dev/[^/]*/source/* ]];, which is encapsulated by calling CSV2RDF4LOD_HOME /bin/util/is-pwd-a.sh:

gemini:/srv/logd/data/source/data-rpi-edu/research-centers/version/2011-Oct-18$ is-pwd-a.sh cr:dev
no

gemini:/srv/logd/data/dev/lebot/source/data-rpi-edu/research-centers/version/2011-Oct-18$ is-pwd-a.sh cr:dev
yes

https://github.com/timrdf/csv2rdf4lod-automation/issues/248

https://github.com/timrdf/csv2rdf4lod-automation/commit/03e0d19ba72650bd36b2b780767868b49ac8bb5f

Situating

cr-situate-classpaths.sh and cr-situate-classpaths.sh:

export CLASSPATH=$CLASSPATH`$CSV2RDF4LOD_HOME/bin/util/cr-situate-classpaths.sh`
export PATH=$PATH`$CSV2RDF4LOD_HOME/bin/util/cr-situate-paths.sh`

What is next

https://github.com/jimmccusker/twc-healthdata/tree/master/data/source, which follows these conventions.
my-csv2rdf4lod-source-me.sh, the simple place to set environment variables.
Reusing enhancement parameters for multiple versions or datasets
https://github.com/jimmccusker/twc-healthdata/wiki/The-Benefits-of-Mass-Raw-Conversions provides an example of this applied to the healthdata.tw.rpi.edu project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly