Merge branch 'master' into feat/add_midas_transformer

unit8co · Nov 14, 2023 · 6ea8ba4 · 6ea8ba4
2 parents b84eb6d + d206055
commit 6ea8ba4
Show file tree

Hide file tree

Showing 11 changed files with 1,270 additions and 37 deletions.
diff --git a/.github/workflows/merge.yml b/.github/workflows/merge.yml
@@ -87,7 +87,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        example-name: [00-quickstart.ipynb, 01-multi-time-series-and-covariates.ipynb, 02-data-processing.ipynb, 03-FFT-examples.ipynb, 04-RNN-examples.ipynb, 05-TCN-examples.ipynb, 06-Transformer-examples.ipynb, 07-NBEATS-examples.ipynb, 08-DeepAR-examples.ipynb, 09-DeepTCN-examples.ipynb, 10-Kalman-filter-examples.ipynb, 11-GP-filter-examples.ipynb, 12-Dynamic-Time-Warping-example.ipynb, 13-TFT-examples.ipynb, 15-static-covariates.ipynb, 16-hierarchical-reconciliation.ipynb, 18-TiDE-examples.ipynb, 19-EnsembleModel-examples.ipynb]
+        example-name: [00-quickstart.ipynb, 01-multi-time-series-and-covariates.ipynb, 02-data-processing.ipynb, 03-FFT-examples.ipynb, 04-RNN-examples.ipynb, 05-TCN-examples.ipynb, 06-Transformer-examples.ipynb, 07-NBEATS-examples.ipynb, 08-DeepAR-examples.ipynb, 09-DeepTCN-examples.ipynb, 10-Kalman-filter-examples.ipynb, 11-GP-filter-examples.ipynb, 12-Dynamic-Time-Warping-example.ipynb, 13-TFT-examples.ipynb, 15-static-covariates.ipynb, 16-hierarchical-reconciliation.ipynb, 18-TiDE-examples.ipynb, 19-EnsembleModel-examples.ipynb, 20-RegressionModel-examples.ipynb]
     steps:
       - name: "1. Clone repository"
         uses: actions/checkout@v2

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,13 +15,15 @@ but cannot always guarantee backwards compatibility. Changes that may **break co
   - 🚀🚀 Optimized `historical_forecasts()` for pre-trained `TorchForecastingModel` running up to 20 times faster than before!. [#2013](https://github.com/unit8co/darts/pull/2013) by [Dennis Bader](https://github.com/dennisbader).
   - Added callback `darts.utils.callbacks.TFMProgressBar` to customize at which model stages to display the progress bar. [#2020](https://github.com/unit8co/darts/pull/2020) by [Dennis Bader](https://github.com/dennisbader).
 - Improvements to documentation:
-  - Adapted the example notebooks to properly apply data transformers and avoid look-ahead bias. [#2020](https://github.com/unit8co/darts/pull/2020) by [Samriddhi Singh](https://github.com/SimTheGreat). 
+  - Adapted the example notebooks to properly apply data transformers and avoid look-ahead bias. [#2020](https://github.com/unit8co/darts/pull/2020) by [Samriddhi Singh](https://github.com/SimTheGreat).
+  - New example notebook for the `RegressionModels` explaining features such as (component-specific) lags, `output_chunk_length` in relation with `multi_models`, multivariate support, and more. [#2039](https://github.com/unit8co/darts/pull/2039) by [Antoine Madrona](https://github.com/madtoinou).
 - Improvements to Regression Models:
   - `XGBModel` now leverages XGBoost's native Quantile Regression support that was released in version 2.0.0 for improved probabilistic forecasts. [#2051](https://github.com/unit8co/darts/pull/2051) by [Dennis Bader](https://github.com/dennisbader).
 - Other improvements:
   - Added support for time index time zone conversion with parameter `tz` before generating/computing holidays and datetime attributes. Support was added to all Time Axis Encoders (standalone encoders and forecasting models' `add_encoders`, time series generation utils functions `holidays_timeseries()` and `datetime_attribute_timeseries()`, and `TimeSeries` methods `add_datetime_attribute()` and `add_holidays()`. [#2054](https://github.com/unit8co/darts/pull/2054) by [Dennis Bader](https://github.com/dennisbader).
   - Added new data transformer: `MIDAS`, which uses mixed-data sampling to convert `TimeSeries` from high frequency to low frequency (and back). [#1820](https://github.com/unit8co/darts/pull/1820) by [Boyd Biersteker](https://github.com/Beerstabr) and [Antoine Madrona](https://github.com/madtoinou).
   - Added optional keyword arguments dict `kwargs` to `ExponentialSmoothing` that will be passed to the constructor of the underlying `statsmodels.tsa.holtwinters.ExponentialSmoothing` model. [#2059](https://github.com/unit8co/darts/pull/2059) by [Antoine Madrona](https://github.com/madtoinou).
+  - Added new dataset `ElectricityConsumptionZurichDataset`: The dataset contains the electricity consumption of households in Zurich, Switzerland from 2015-2022 on different grid levels. We also added weather measurements for Zurich which can be used as covariates for modelling. [#2039](https://github.com/unit8co/darts/pull/2039) by [Antoine Madrona](https://github.com/madtoinou) and [Dennis Bader](https://github.com/dennisbader).
 
 **Fixed**
 - Fixed a bug when calling optimized `historical_forecasts()` for a `RegressionModel` trained with unequal component-specific lags. [#2040](https://github.com/unit8co/darts/pull/2040) by [Antoine Madrona](https://github.com/madtoinou).

diff --git a/darts/datasets/__init__.py b/darts/datasets/__init__.py
@@ -5,8 +5,9 @@
 A few popular time series datasets
 """
 
+import os
 from pathlib import Path
-from typing import List
+from typing import List, Literal, Optional
 
 import numpy as np
 import pandas as pd
@@ -813,3 +814,111 @@ def _to_multi_series(self, series: pd.DataFrame) -> List[TimeSeries]:
         Load the WeatherDataset dataset as a list of univariate timeseries, one for weather indicator.
         """
         return [TimeSeries.from_series(series[label]) for label in series]
+
+
+class ElectricityConsumptionZurichDataset(DatasetLoaderCSV):
+    """
+    Electricity Consumption of households & SMEs (low voltage) and businesses & services (medium voltage) in the
+    city of Zurich [1]_, with values recorded every 15 minutes.
+
+    The electricity consumption is combined with weather measurements recorded by three different
+    stations in the city of Zurich with a hourly frequency [2]_. The missing time stamps are filled with NaN.
+    The original weather data is recorded every hour. Before adding the features to the electricity consumption,
+    the data is resampled to 15 minutes frequency, and missing values are interpolated.
+
+    To simplify the dataset, the measurements from the Zch_Schimmelstrasse and Zch_Rosengartenstrasse weather
+    stations are discarded to keep only the data recorded in the Zch_Stampfenbachstrasse station.
+
+    Both dataset sources are updated continuously, but this dataset only retrains values between 2015 and 2022.
+    The time index was converted from CET time zone to UTC.
+
+    Components Descriptions:
+
+    * Value_NE5 : Households & SMEs electricity consumption (low voltage, grid level 7) in kWh
+    * Value_NE7 : Business and services electricity consumption (medium voltage, grid level 5) in kWh
+    * Hr [%Hr] : Relative humidity
+    * RainDur [min] : Duration of precipitation (divided by 4 for conversion from hourly to quarter-hourly records)
+    * T [°C] : Temperature
+    * WD [°] : Wind direction
+    * WVv [m/s] : Wind vector speed
+    * p [hPa] : Air pressure
+    * WVs [m/s] : Wind scalar speed
+    * StrGlo [W/m2] : Global solar irradiation
+
+    Note: before 2018, the scalar speeds were calculated from the 30 minutes vector data.
+
+    References
+    ----------
+    .. [1] https://data.stadt-zuerich.ch/dataset/ewz_stromabgabe_netzebenen_stadt_zuerich
+    .. [2] https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte
+    """
+
+    def __init__(self):
+        def pre_process_dataset(dataset_path):
+            """Restrict the time axis and add the weather data"""
+            df = pd.read_csv(dataset_path, index_col=0)
+            # convert time index
+            df.index = pd.DatetimeIndex(pd.to_datetime(df.index, utc=True)).tz_localize(
+                None
+            )
+            # extract pre-determined period
+            df = df.loc[
+                (pd.Timestamp("2015-01-01") <= df.index)
+                & (df.index <= pd.Timestamp("2022-12-31"))
+            ]
+            # download and preprocess the weather information
+            df_weather = self._download_weather_data()
+            # add weather data as additional features
+            df = pd.concat([df, df_weather], axis=1)
+            # interpolate weather data
+            df = df.interpolate()
+            # raining duration is given in minutes -> we divide by 4 from hourly to quarter-hourly records
+            df["RainDur [min]"] = df["RainDur [min]"] / 4
+
+            # round Electricity cols to 4 decimals, other columns to 2 decimals
+            cols_precise = ["Value_NE5", "Value_NE7"]
+            df = df.round(
+                decimals={col: (4 if col in cols_precise else 2) for col in df.columns}
+            )
+
+            # export the dataset
+            df.index.name = "Timestamp"
+            df.to_csv(self._get_path_dataset())
+
+        # hash value for dataset with weather data
+        super().__init__(
+            metadata=DatasetLoaderMetadata(
+                "zurich_electricity_consumption.csv",
+                uri=(
+                    "https://data.stadt-zuerich.ch/dataset/"
+                    "ewz_stromabgabe_netzebenen_stadt_zuerich/"
+                    "download/ewz_stromabgabe_netzebenen_stadt_zuerich.csv"
+                ),
+                hash="c2fea1a0974611ff1c276abcc1d34619",
+                header_time="Timestamp",
+                freq="15min",
+                pre_process_csv_fn=pre_process_dataset,
+            )
+        )
+
+    @staticmethod
+    def _download_weather_data():
+        """Concatenate the yearly csv files into a single dataframe and reshape it"""
+        # download the csv from the url
+        base_url = "https://data.stadt-zuerich.ch/dataset/ugz_meteodaten_stundenmittelwerte/download/"
+        filenames = [f"ugz_ogd_meteo_h1_{year}.csv" for year in range(2015, 2023)]
+        df = pd.concat([pd.read_csv(base_url + fname) for fname in filenames])
+        # retain only one weather station
+        df = df.loc[df["Standort"] == "Zch_Stampfenbachstrasse"]
+        # pivot the df to get all measurements as columns
+        df["param_name"] = df["Parameter"] + " [" + df["Einheit"] + "]"
+        df = df.pivot(index="Datum", columns="param_name", values="Wert")
+        # convert time index to from CET to UTC and extract the required time range
+        df.index = pd.DatetimeIndex(pd.to_datetime(df.index, utc=True)).tz_localize(
+            None
+        )
+        df = df.loc[
+            (pd.Timestamp("2015-01-01") <= df.index)
+            & (df.index <= pd.Timestamp("2022-12-31"))
+        ]
+        return df
diff --git a/darts/datasets/dataset_loaders.py b/darts/datasets/dataset_loaders.py
@@ -31,8 +31,10 @@ class DatasetLoaderMetadata:
     format_time: Optional[str] = None
     # used to indicate the freq when we already know it
     freq: Optional[str] = None
-    # a custom function to handling non-csv based datasets
+    # a custom function handling non-csv based datasets
     pre_process_zipped_csv_fn: Optional[Callable] = None
+    # a custom function handling csv based datasets
+    pre_process_csv_fn: Optional[Callable] = None
     # multivariate
     multivariate: Optional[bool] = None
 
@@ -49,7 +51,9 @@ class DatasetLoader(ABC):
 
     _DEFAULT_DIRECTORY = Path(os.path.join(Path.home(), Path(".darts/datasets/")))
 
-    def __init__(self, metadata: DatasetLoaderMetadata, root_path: Path = None):
+    def __init__(
+        self, metadata: DatasetLoaderMetadata, root_path: Optional[Path] = None
+    ):
         self._metadata: DatasetLoaderMetadata = metadata
         if root_path is None:
             self._root_path: Path = DatasetLoader._DEFAULT_DIRECTORY
@@ -131,7 +135,13 @@ def _download_dataset(self):
                 "Could not download the dataset. Reason:" + e.__repr__()
             ) from None
 
+        if self._metadata.pre_process_csv_fn is not None:
+            self._metadata.pre_process_csv_fn(self._get_path_dataset())
+
     def _download_zip_dataset(self):
+        if self._metadata.pre_process_csv_fn:
+            logger.warning("Loading a ZIP file does not use the pre_process_csv_fn")
+
         os.makedirs(self._root_path, exist_ok=True)
         try:
             request = requests.get(self._metadata.uri)
@@ -186,7 +196,9 @@ def _format_time_column(self, df):
 
 
 class DatasetLoaderCSV(DatasetLoader):
-    def __init__(self, metadata: DatasetLoaderMetadata, root_path: Path = None):
+    def __init__(
+        self, metadata: DatasetLoaderMetadata, root_path: Optional[Path] = None
+    ):
         super().__init__(metadata, root_path)
 
     def _load_from_disk(

diff --git a/darts/tests/datasets/test_dataset_loaders.py b/darts/tests/datasets/test_dataset_loaders.py
@@ -10,6 +10,7 @@
     AirPassengersDataset,
     AusBeerDataset,
     AustralianTourismDataset,
+    ElectricityConsumptionZurichDataset,
     ElectricityDataset,
     EnergyDataset,
     ETTh1Dataset,
@@ -40,37 +41,36 @@
     DatasetLoadingException,
 )
 
-datasets = [
-    AirPassengersDataset,
-    AusBeerDataset,
-    AustralianTourismDataset,
-    EnergyDataset,
-    HeartRateDataset,
-    IceCreamHeaterDataset,
-    MonthlyMilkDataset,
-    SunspotsDataset,
-    TaylorDataset,
-    TemperatureDataset,
-    USGasolineDataset,
-    WineDataset,
-    WoolyDataset,
-    GasRateCO2Dataset,
-    MonthlyMilkIncompleteDataset,
-    ETTh1Dataset,
-    ETTh2Dataset,
-    ETTm1Dataset,
-    ETTm2Dataset,
-    ElectricityDataset,
-    UberTLCDataset,
-    ILINetDataset,
-    ExchangeRateDataset,
-    TrafficDataset,
-    WeatherDataset,
-]
-
 _DEFAULT_PATH_TEST = _DEFAULT_PATH + "/tests"
 
-width_datasets = [1, 1, 96, 28, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 7, 7, 7, 7, 370, 262]
+datasets_with_width = [
+    (AirPassengersDataset, 1),
+    (AusBeerDataset, 1),
+    (AustralianTourismDataset, 96),
+    (EnergyDataset, 28),
+    (HeartRateDataset, 1),
+    (IceCreamHeaterDataset, 2),
+    (MonthlyMilkDataset, 1),
+    (SunspotsDataset, 1),
+    (TaylorDataset, 1),
+    (TemperatureDataset, 1),
+    (USGasolineDataset, 1),
+    (WineDataset, 1),
+    (WoolyDataset, 1),
+    (GasRateCO2Dataset, 2),
+    (MonthlyMilkIncompleteDataset, 1),
+    (ETTh1Dataset, 7),
+    (ETTh2Dataset, 7),
+    (ETTm1Dataset, 7),
+    (ETTm2Dataset, 7),
+    (ElectricityDataset, 370),
+    (UberTLCDataset, 262),
+    (ILINetDataset, 11),
+    (ExchangeRateDataset, 8),
+    (TrafficDataset, 862),
+    (WeatherDataset, 21),
+    (ElectricityConsumptionZurichDataset, 10),
+]
 
 wrong_hash_dataset = DatasetLoaderCSV(
     metadata=DatasetLoaderMetadata(
@@ -135,9 +135,9 @@ def tmp_dir_dataset():
 
 class TestDatasetLoader:
     @pytest.mark.slow
-    @pytest.mark.parametrize("dataset_config", zip(width_datasets, datasets))
+    @pytest.mark.parametrize("dataset_config", datasets_with_width)
     def test_ok_dataset(self, dataset_config, tmp_dir_dataset):
-        width, dataset_cls = dataset_config
+        dataset_cls, width = dataset_config
         dataset = dataset_cls()
         assert dataset._DEFAULT_DIRECTORY == tmp_dir_dataset
         ts: TimeSeries = dataset.load()

diff --git a/docs/source/examples.rst b/docs/source/examples.rst
@@ -76,6 +76,16 @@ with Darts using the Optuna library for hyperparameter optimization.
 
    examples/17-hyperparameter-optimization.ipynb
 
+Regression Models
+=================
+
+Regression models example notebook:
+
+.. toctree::
+   :maxdepth: 1
+
+   examples/20-RegressionModel-examples.ipynb
+
 
 Fast Fourier Transform
 ======================

diff --git a/examples/20-RegressionModel-examples.ipynb b/examples/20-RegressionModel-examples.ipynb
diff --git a/examples/static/images/multi_model_ocl2.png b/examples/static/images/multi_model_ocl2.png
diff --git a/examples/static/images/regression_model_train.png b/examples/static/images/regression_model_train.png
diff --git a/examples/static/images/single_model_ocl2.png b/examples/static/images/single_model_ocl2.png
diff --git a/examples/static/images/single_model_ocl3.png b/examples/static/images/single_model_ocl3.png