Anusuriya Devaraju, Michael Diepenbroek, Uwe Schindler, Robert Huber
{adevaraju, mdiepenbroek, uschindler, rhuber}@marum.de
20th September 2018
PANGAEA is a data infrastructure for archiving and publishing Earth and Environmental datasets. It is hosted by the Alfred Wegener Institute, Helmholtz Center for Polar and Marine Research (AWI) and the Center for Marine Environmental Sciences (MARUM) of University of Bremen. The infrastructure holds more than 370000 datasets from individual researchers, projects, data centers and research infrastructures. The datasets include quantitative, textual data and binary files such as audio and video. Datasets are archived with their metadata in a relational database. They are published and registered with Digital Object Identifiers (DOIs), and are accessible via the web portal [1]. For advanced interaction, related web services and APIs, e.g., OAI-PMH metadata provider and Elasticsearch API, and a data warehouse are also available.
The PANGAEA metadata primarily represents descriptions of datasets. At present, it only contains device types (e.g., pollen sampling device and barometer) and method types (e.g., X-ray fluorescence and continuous flow analysis (CFA)). These types are applied as part of the faceted search on the PANGAEA web portal. We may improve the discovery of PANGAEA datasets by capturing the persistent identifiers of devices and their relations to datasets as part of the metadata. For example, the web portal may display datasets with the relevant device persistent identifiers. These identifiers should be translated into actionable links on the portal. This association between a dataset and its source (device) is vital for reproducibility of science as it adds to a better understanding of the lineage and quality of the dataset through the device information.
In PANGAEA, data owners may publish datasets at different stages, e.g., raw, derived, and data products, depending on their applications. Consequently, in some cases users misused and misinterpreted the two fields ‘device’ and ‘method types’. We may avoid this ambiguity through device persistent identifiers, as their landing pages provide more comprehensive descriptions of the devices.
In the PANGAEA database, the ‘term’ table contains standard terminologies used to describe datasets including the definition of parameter (quantity and features), method and device types. Currently, there are 1364 device types defined in the table.
The metadata service only includes the device type as part of the metadata of a dataset, e.g., see the field ‘agg-device’ in the response returned by the request http://ws.pangaea.de/es/pangaea/panmd/886115. More detailed information about the device type is available through the PANGAEA ElasticSearch Term Index [2]. For example, here is the metadata of the device type ‘Box corer’.
The following table summarizes the metadata related to the device type. The table excludes the common fields (e.g., _index,_score) originated from the ElasticSearch.
Property | Occurrence | Definition | DataType |
_id | 1 | Id of the term | Integer |
name | 1 | Name of the term | String |
abbreviation | 0..1 | Abbreviation of the term | String |
terminology_id | 1 | The id of the terminology/ontology that specifies the term | Integer |
description_uri | 0..1 | The URI of the term | String |
status | 1 | The id of the status of the term (for internal editing purpose) | Integer |
terminology | 1 | The name of the terminology/ontology that specifies the term; identifies by the terminology id. | String |
The following are the required properties to support the PID Instrument use case in PANGAEA:
Property | Occurrence | Definition | DataType |
Identifier | 1 | Unique and persistent identifier of a device | URI |
Name | 1 | The name of the instrument, which will be used to support the full-text search. | String |
Device Type | 0..1 | Controlled vocabularies of device types, which will be used to support the full-text search. | String |