Skip to content

ESGF_Project_Configuration

Stephen Pascoe edited this page Apr 9, 2014 · 5 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category REVISE.
This page contains useful content but needs revision. It may contain out of date or inaccurate content.

Configuring a new project for ESGF publication

This tutorial shows how to configure the ESGF publisher for a new project. The publisher is a utility for:

  • Collecting and publishing metadata to support ESGF search
  • Making data visible and available for download

The CSSEF project will be used as a running example. See the [ ESGF publication reference ](http://www2-pcmdi.llnl.gov/Members/bdrach/.personal /esg-publisher-configuration/) for details.

First a few definitions:

  • Files : physical data files, typically in a self-describing format such as netCDF

  • Datasets : collections of related files

  • Variables : multidimensional array data together with associated attributes. For example: sea level pressure for dataset foo.bar. The data associated with a variable may be contained in several files, typically organized by simulation time.

Configuring a new project for ESGF involves specifying:

  • How datasets will be identified
  • What metadata will be collected
  • What is the source of the metadata
  • How the data organized

The configuration file is a text file, /esg/config/esgcet/esg.ini .

Project identification

A project is identified by an alphanumeric name. The name appears in two places in esg.ini :

  • In the [DEFAULT] section, option project_options lists all configured projects:

    project_options =
        cssef | CSSEF Project | 1
        test  | Test Project 2
    
  • The project section delimiter:

    [project:cssef]
    

    CSSEF project options go here

    ...

Dataset identification

Every dataset in ESGF has a unique identifier. A dataset ID is typically built from a set of component identifiers separated by periods. For example, one of the datasets in CSSEF is named 'cssef.LLNL.cesm1-cam5.def01.run0001.cam.h0'. For this project the identifier is built up from the components:

project: cssef

institute: LLNL

model: cesm1-cam5

experiment: def01

subexperiment: run0001

submodel: cam

hfrequency: h0

It's standard practice for the project identifer to be the leading component of dataset identifiers. The configuration option that defines the dataset identifier is dataset_id :

dataset_id = %(project)s.%(institute)s.%(experiment)s.%(subexperiment)s.%(model)s.%(submodel)s

The publisher fills in the %(category)s format strings as data is scanned. A descriptive dataset name may also be specified; this appears in the ESG web interface:

dataset_name_format = project=%(project_description)s, institute=%(institute)s, experiment=%(experiment_description)s, subexperiment=%(subexperiment)s,  model=%(model_description)s, submodel=%(submodel)s, version=%(version)s

Category definition

The metadata items collected by ESGF are called _ categories _ . These are the items presented for search in the ESGF Web Portal. We have already seen some examples in the previous section: project, model, etc. Every ESGF project has at least the categories project, model, and experiment .

Most projects will define more than just the minimum categories. For CSSEF, the components of the dataset ID will all be categories. In general there will be additional categories not contained in the dataset ID.

categories =
        project | enum | true | true | 0
        institute | enum | true | true | 1
        model | enum | true | true | 2
        experiment | enum | true | true | 3
        subexperiment | enum | true | true | 4
        submodel | enum | true | true | 5
        hfrequency | enum | true | true | 6

Each line defines one category: _ name | category_type | is_mandatory | is_thredds_property | display_order _

For example, the subexperiment category is an enumerated value, is mandatory (must always be defined), and will be represented as a property element in the corresponding THREDDS catalog. The display order is only used by the publisher GUI.

project_options appears in the [DEFAULT] section of esg.ini:

project_options =
        cssef | CSSEF | 1
        ...

experiment_options appears in the project section. Each experiment is defined by three values: _ project | experiment_name | experiment_description _

experiment_options =
        cssef | def01 | DEF01
        cssef | nond01 | NOND01

All other enumerations appear in the project section, and are comma-separated lists of allowed values: Values for category _ foo _ are listed in option _ foo _ _options:

institute_options = PCMDI, LLNL
model_options = cam4, cam5, cesm1-cam5
subexperiment_options = run0001,run0002,run0003,run0004,run0005
submodel_options = cam, cice, clm2
hfrequency_options = h, h0, h1, h2, h3, h4, h5

For any category, a default value may be specified:

category_defaults =
        institute | LLNL
        model | cesm1-cam5

Adding projects, models, and experiments to the local database.

The local database has a record for each project, model, and experiment . When data is published, the scanned values for these categories is validated against the database, and an exception is raised if the value is not found. To add new values:

  • Add projects to project_options , experiments to experiment_options , and models to model_options as described above.

  • Also, add model descriptions to the file esgcet_models_table.txt . This file is usually located in the same directory as esg.ini, but may be explicitly defined by option initial_models_table in the [initialize] section of esg.ini. The format of each line is:
    _ project | model_id | URL | description _

  • Run 'esginitialize -c ' to store the options.

For a category to be searchable in the ESGF Web Portal, it must be defined in /esg/config/facets.properties on the index node .

Metadata sources

Metadata can be derived from a number of sources:

  • command line
  • directory names
  • file global attributes
  • dataset IDs specified in a mapfile
  • project handlers

The publisher has two mutually-exclusive modes of scanning metadata, for the purpose of generating dataset identifiers:

  • read-directories : metadata for dataset identification is derived from the directory structure of the data files. The publisher assumes that in any given leaf directory, all files belong to the same dataset. This orresponds to the --read-directories option of esgpublish.

  • read-files : dataset ID metadata is derived by opening and reading metadata from each data file.This corresponds to the --read-files publisher option.

For example, the CSSEF data is laid out in the following directory structure:

/data/project/institute/model/experiment/subexperiment/submodel/hfrequency

The corresponding configuration option is directory_format :

directory_format = /data/%(project)s/%(institute)s/%(model)s/%(experiment)s/%(subexperiment)s/%(submodel)s/%(hfrequency)s

When the data is scanned in read-directories mode, the directory names are pattern-matched against directory_format to generate dataset identifiers. Any directories that do not match the pattern are ignored.

By contrast, in read-files mode the directory_format is ignored. The publisher opens each file to determing the values needed for dataset identification. If the information is missing or doesn't match an enumerated list, an exception is raised.

Project Handlers

Project handlers are the components of the publisher responsible for opening and extracting metadata from files. Even in read-directories mode, files are opened and scanned during publication. The project_handler_name defines the handler to be used:

project_handler_name = basic_builtin

The following handlers are built in:

  • basic_builtin : Generic project handler. No assumptions are made about file-level attributes. Reads attributes title, Conventions, source, and history if present. Also generates creation_time and format attributes. Use this one unless your data resembles CMIP3 or CMIP5 data.

  • ipcc4_builtin : CMIP3

  • ipcc5_builtin : CMIP5

What if your project data has important metadata that is not extracted by any of the built-in handlers? In that case an option is to write a [ customized handler ](http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation /customizing-the-esg-publisher-with-handlers/) .

Data Layout

Some projects such as CMIP5 make the assumption that each data file contains exactly one variable. In this case, the publisher will also aggregate files along the time dimension in order to support DAP-enabled applications such as UV-CDAT and LAS:

variable_per_file = true

The default is false , meaning the publisher assumes that each file contains multiple variables to be published. No aggregations will be generated.

Miscellanea

Here are some other useful options:

# Don't publish these variables:
thredds_exclude_variables = a, a_bnds, b, b_bnds, bounds_lat, bounds_lon, height, lat_bnds, lev_bnds, lon_bnds, p0, time_bnds, lat, lon, longitude, latitude, time, lev, depth, depth_bnds, plev, geo_region, plev_bnds, tau_bnds, longitude_bnds, latitude_bnds, tau, region, layer, pressure1, bnds
# Generate dataset version numbers from the date rather than 1, 2, 3, ...
version_by_date = true

Here is a complete project configuration for CSSEF:

[project:cssef]

categories =
        project | enum | true | true | 0
        institute | enum | true | true | 1
        model | enum | true | true | 2
        experiment | enum | true | true | 3
        subexperiment | enum | true | true | 4
        submodel | enum | true | true | 5
        hfrequency | enum | true | true | 6

category_defaults =
        institute | LLNL
        model | cesm1-cam5

institute_options = PCMDI, LLNL
model_options = cam4, cam5, cesm1-cam5
experiment_options =
        cssef | def01 | DEF01
        cssef | nond01 | NOND01

subexperiment_options = run0001,run0002,run0003,run0004,run0005
submodel_options = cam, cice, clm2
hfrequency_options = h, h0, h1, h2, h3, h4, h5

directory_format = /data/%(project)s/%(institute)s/%(model)s/%(experiment)s/%(subexperiment)s/%(submodel)s/%(hfrequency)s
dataset_id = %(project)s.%(institute)s.%(model)s.%(experiment)s.%(subexperiment)s.%(submodel)s.%(hfrequency)s
dataset_name_format = project=%(project_description)s, institute=%(institute)s, experiment=%(experiment_description)s, subexperiment=%(subexperiment)s,  model=%(model_description)s, submodel=%(submodel)s, version=%(version)s
project_handler_name = basic_builtin
thredds_exclude_variables = a, a_bnds, b, b_bnds, bounds_lat, bounds_lon, height, lat_bnds, lev_bnds, lon_bnds, p0, time_bnds, lat, lon, longitude, latitude, time, lev, depth, depth_bnds, plev, geo_region, plev_bnds, tau_bnds, longitude_bnds, latitude_bnds, tau, region, layer, pressure1, bnds
variable_per_file = false
version_by_date = false
Clone this wiki locally