Skip to content

Latest commit

 

History

History
226 lines (124 loc) · 13.4 KB

README.md

File metadata and controls

226 lines (124 loc) · 13.4 KB

rmacroRDM


Tools to handle trait macroecological datasets in R


Trait databases

The purpose of this package is perfectly explained in Morgan Ernest's blog post: "Trait Databases: What is the End Goal?", following the release of their data paper on the life histories of 21,000+ amniotes (reptiles, birds, mammals).

Trait databases are all the rage these days, for good reason. Traits are interesting from evolutionary and ecological perspectives: How and why do species differ in traits, how do traits evolve, how quickly do traits change in response to changing environment, and what impacts do these differences have on community assembly and ecosystem function. They have the potential to link individual performance with local, regional, and even global processes. There’s lots of trait data out there, but most of it has been buried in papers, books, theses, gray literature, field guides, etc.

The future envisioned by Morgan for trait databases is shared:

what we need is a centralized trait database where people can contribute trait data and where that data is easily accessible by anyone who wants to use it for research.

Dreams of programmatically accessible centralised repositories are certainly not unique within the biological trait community and a whole range of initiatives such as the ecoretriever and rNBN and other packages are beginning to demontrate their potential as research resources.

But compilation and use of data sets requires good data management so that contributions can be easily integrated and tracebility and specification maintained. It is also key to reprocibility and the ability to assess the quality of analyses and inferences made from the data.


A need for RDM in macroecolgy at the researcher and lab level

The problem is that many of unreposited data resources are far from machine readable and poorly specified. Many macroecological datasets are still generated by compiling disparate datasets scraped from pdfs and literature searches, thereafter entered into excel spreadsheets.

Managing such datasets for reproducibility and reusability clearly imposes extra computational and data management effort, with the responsibility lying currently on individual researchers to provide for these impositions. They may often lack the time, skill, resources or tools needed to achieve good data management practices resulting in one of the weakest links in data access and reusability. Support to facilitate good research data management at the level of individual researchers and labs seems key.

As noted by Morgan, for trait databases, there really isn't even an agreed structure to which data should be specified yet. So standardisation is a key starting point for useful data products. Challenges to reaching consensus on standards stem from diffulties in the very definition of traits, which are well summarised in Tom Webb's blog post: Trait databases: the desirable and the possible:

Basically, I have been trying to imagine what this kind of meta-dataset might look like. And my difficulty in doing this in part boils down to how we define a ‘trait’.

But such questions can be revisited at later dates if enough metadata is retained. It does imply however that, in such situations where standards have not been agreed on at various levels, a framework that allows flexibility would work best.

Ultimately: without broad buy-in from the trait community into some form of standardised framework, success of researchers using centralised repositories for research and contributing their data to it, is doubtful.



rmacroRDM: macroecological data management package in R

Both of the last two projects I have been working on involved compiling datasets into large master trait dataset. The basic premise has been pretty consistent. Here's a master data sheet, here's bunch more data in various formats and with varied reference information, here's some more open sources of data, put it all together and prepare it for for analysis. So, I've ended up with quite a bit of functional code and a somewhat developed framework to see how a better developed package could really provide useful functionality.

I'm hoping to compile the code into a package to facilitate data management and standardisation of macroecological trait datasets. The idea is to both make good data management easy, and in return, enforce a standardised format for the resulting data products. I hope this will in turn facilitate the accumulation of well specified, alignable data which are easier to add to, subset, query, visualise analyse and share.

features

standardised format

The package helps users compile data through a framework of standardised but flexible data objects. Data objects bring together relevant information to better manage the compilation process.

store record level data

Morgan Ernest highlights the importance of compiling observations rather than species averages:

Having any info is still better than no info, but often we need info on variability across individuals within a species or we want to know how the trait might vary with changes in the environment. For this, we need record-level data.

So the framework compiles data into a master database of observations and allows multiple observations per species per trait.

metadata management

A major focus of the package is the extraction, management and compilation of metadata information throughout.

Metadata is managed at a variety of levels including, observation, variable, species and dataset. Metadata are also stored on the matching of observations from different taxonomies though synonyms, on the matching of variables across datasets and on taxonomic information.



package details

data format

Structure it like a mini database.

The idea is to maintain a master database to which new datasets are added.

Current framework structure consists of:


**`[[master]]` object**
  • [data]: long data.frame of individual records. Each row represents unique observations of the trait and columns store information on record metadata and taxonomic matching eg.:

    • species
    • variable
    • value
    • taxonomic matching info
    • original reference
    • observer: if manually mined form literature. Could also describe scrapping procedure
    • quality control (ie confidence in observation)
    • n: if value derived from multiple measurements, number of observations
    • dataset code
    • etc
  • [spp.list]: master species list to which all data are to be matched. In the demo, taxonomic information is stored in [spp.list] but it could also be stored separately as [taxonomic table], perhaps even a tree?

  • [metadata]: variable [metadata] table, each row a variable in the database. Example metadata columns:

    • code: variable short code to use throughout database
    • type: eg. continuous, integer, nominal, categorical, binary, proportion etc
    • var category: eg. ecological, morphological, life history, reproductive etc.
    • descr: longer description for plotting
    • units
    • scores: used for factor/categorical/binary variables in data
    • levels: if variable is factor/categorical/binary
    • method code
  • [vnames]: cross-dataset variable name table. Each row a variable in the master:

    • columns are the corresponding name of the variable across different datasets
  • method table: each row a method referenced in variable metadata:

    • method code
    • method description etc

[[m]] match object

Functions in the package prepare new datasets to be added to the master. This involves collating metadata information, managing and recording synonym matches and compiling and formating the data in the master data format, so it is ready to update the [[master]]. This process is managed through creating, populating and updating a [[m]] match object.

  • "data.ID" dataset identifier

  • [spp.list] master species list to which all datasets are to be matched. Tracks any additions (if allowed) during the matching process.

  • [data] dataframe containing the dataset to be added

  • "sub" "spp.list" or "data". Specifies which [[m]] element contains the smaller set of species. Unmatched species in the subset are attempted to be matched through synonyms to [[m]] element datapoints in the larger species set.

  • "set" Specifies which [[m]] element contains the larger set of species. Automatically determined from m$sub.

  • "status" Records status of [[m]] within the match process. Varies between {"unmatched", "full_match", "incomplete_match: (n unmatched)"}

  • [[meta]] list, structure set through {meta.vars}. Collates observation metadata for each "meta.var".

    • "ref": the reference from which observation has been sourced. This is the only meta.var that MUST be correctly supplied for matching to proceed.
    • "qc": any quality control information regarding individual datapoints. Ideally, a consistent scoring system used across compiled datasets.
    • "observer": The name of the observer of data points. Used to allow assessment of observer bias, particularly in the case of data sourced manually from the literature.
    • "n": if value is based on a summary of multiple observations, the number of original observations the value is based on.
    • "notes": any notes associated with individual observations.
  • "filename" name of the dataset filename. Using the filename consistently throughout the file system enables automating sourcing of data.

  • "unmatched" stores details of unmatched species if species matching incomplete. Can be used to direct manual matching.



Proposed functionality:

see vignette for a demo of current functionality

WARNING!! Functionality in the packages has been significantly developed, rendering parts of this vignette deprecated. However much of the background information remains relevant. See temporary vignette for partial demo of current functionality. Updates to the vignette to follow #15.

Matching and tracking taxonomy metadata

  • integrate taxize #4
    • auto-correct species names input errors with fuzzy matching
    • search for synonyms
  • build project specific database of synonym links
    • use as network of synonyms to track data point matching

Matching and tracking observation metadata

  • QC
  • observer details
  • ref: #5 as with tracking taxonomic metadata, reference data sourcing and handling could also be more formalised and integrated. Link to reference databases.

Matching and tracking variable metadata

  • enforce complete metadata to add variables. Check consistency of units before adding data.
  • metadata readily available for plotting and extraction for publication.

Basic Quality control functionality #10

  • some simple tools to help identify errors, outliers etc

Produce analytical datasets

Package to include functions that allow users to:

  • interrogate database, extract some information on data availability (particularly complete cases resulting from different combination of variables).
  • allow specifying taxonomic and variable subsets
  • produce wide analytical dataset. Contain a selection of functions to summarise duplicate datapoints according to data type.
    • output to include variable subset metadata
    • a list of all references used to create analytical dataset.

Explore data through apps #19

Standardisation of data products allows exploratory apps to be built around them, eg:

  • sex roles in birds exploratory app: app is built around outputs of the rmacroRDM workflow. I'm hoping to adapt it so that rmacroRDM data products containing all data required to power the app can be uploaded.
  • The data visualised is a random sample of a larger dataset we have been working on. The data will be open on publication. Until then, it is represented by a small sample and species names have been randomised.


Future development

A interesting added feature could be functionality for exploring potential data biases of analytical datasets. #9

  • taxonomic biases (ie calculate taxonomic distinctness of subsets of complete case species for different variable combinations)
  • data gap biases
  • basic covariance structure between variables. Could be used to relate to data gaps to understand how missing values might affect results.