This infrastructure code enables comparison of time series from arbitrary data sources using user-defined metrics. The tool is designed to be simple, modulized, and extensible. We call our tool WE-Validate to gear towards forecast validation using observations and simulations for wind energy (“WE”) applications.
The default branch is main
, and active development is under dev
. Pull requests from dev
to main
will be done regularly.
For Mac users, in Terminal, cd
to a destinated directory, then
$ git clone https://github.com/joejoeyjoseph/WE-Validate.git
For Windows users, Git for Windows or running Linux Bash Shell on Windows is an option.
Alternatively, you can also use GitHub clients like GitHub Desktop to clone this repo to your local machine.
This tool is built on Python 3.8. If you do not have Python on your machine, you can install Python directly, or you can use package management software like Anaconda. You can use this tool with your "root" Python, or you can use package and environment management systems like virtual environment or conda environment. Then in Terminal,
$ pip install -r requirements.txt
This would download and install all the Python packages you need for this tool.
If pip
is not installed on your machine, you can visit the pip website.
We use the YAML format for configuration file. An example configuration is provided in config/config.yaml
. Explanations are embedded in config.yaml
, in which the comments started with #
.
In config.yaml
, first, you need to specify the location
(assumed to be the WGS84 latitude, lat
, and longitude, lon
, coordinates) as well as the evaluation duration in time
(the start
and end
times).
To do a comparison, you will need at least one baseline dataset (called base
) and one or more datasets to make comparisons to (called comp
). For each dataset, you need to declare the data directory (path
), data parser (function
), and variable of interest (var
). The function
string must match one of the classes in the inputs
folder.
If the nature of the variable of interest is wind speed (ws
), you can choose a wind turbine power curve, specify its data directory (path
), power curve file (file
), and data parser (function
), and the tool will compute metrics based on derived wind power.
Evaluation at different height levels above ground level is available, as long as the height levels exist in the baseline and comparison datasets.
Beyond the datasets, you can list which metrics to compute. Each must correspond to a metric class in the metrics
folder. You can also specify the variable names (var
) and units (units
) to be displayed in the plots.
Currently, only local datasets are supported. Future versions will fetch data over SFTP (i.e., PNNL DAP) and other protocols.
The main routine in this repo is the compare
function in ivalidate.py
. By calling ivalidate.compare()
, it would run the default configuration listed in config.yaml
. Users can choose a different YAML file for specific data and cases as well. For example, by calling ivalidate.compare('config_test.yaml')
, it would use the configuration in config_test.yaml
, which contains erroneous datasets for testing purposes.
Please refer to /notebooks/demo_notebook.ipynb
, in which we summarize some example cases in the demo Jupyter Notebook.
We encourage and welcome contribution from the wind energy community to this tool.
To add a new metric, create a new file in the metrics
folder. The file name must match the class name. For example, if you wanted to write a script that computes mean absolute error, or MAE, you would label the file mae.py
and the class inside would also be called mae
.
The metric class interface is simple, it defines a single method called compute
which takes two variables x
(baseline) and y
(comparison). Both are datetime-indexed pandas series.
The function compute()
must return a float (single, scalar number).
Unit tests for the metrics are included in test_metrics.py
. Travis CI should handle the software testing via pytest
.
To add a new data format (or source), create a new file in the inputs
folder. As with metrics, the file name must match the class name. The naming convention is {data name}_{data format}.py
. For example, if you wanted to parse an HDF5 file with LiDAR data, you might call it lidar_hdf5.py
and the class name in the file would also be lidar_hdf5
.
The input class interface expects a constructor that takes the path and variable and a single method called get_ts()
which returns the time series as a datetime-indexed pandas dataframe.
To add a new preprocessor or quality control routine that operates on each time series, please visit the qc
folder.
The current implementation is serial, however future versions may exploit local or distributed parallelism by:
- Loading time series data from files (or cache) in parallel
- Computing metrics for each pair of time series in parallel
The original version of the code was first developed by Caleb Phillips in 2016. Joseph Lee has been building onto Phillips's code and implemented further development since 2020. For questions and comments regarding the current version of the code, please contact Joseph Lee at <joseph.lee at pnnl.gov>.
Our contributors in alphabetical order: Larry Berg, Caroline Draxl, Joseph Lee, and Will Shaw.