[QUESTION]Training Loss Much Lower Than Validation Loss in TSMixerModel: Need Help Understanding Why #2558

erl61 · 2024-10-10T15:35:00Z

Issue
I am training a TSMixerModel to forecast multivariate time series. The model performs well overall, but I notice that the training loss is consistently much lower than the validation loss (sometimes by orders of magnitude).

I have already tried different loss functions (MAELoss, MapeLoss), and the issue persists. However, when I forecast using this model, I don’t observe signs of overfitting, and the model predictions look good.

Callback
I use the following setup for logging the losses:

class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []

    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        if not trainer.sanity_checking:
            self.val_loss.append(float(trainer.callback_metrics["val_loss"]))

loss_logger = LossLogger()

Model
This is how I initialize the model:

progress_bar = TFMProgressBar(enable_sanity_check_bar=False, enable_validation_bar=False)

limit_train_batches = 50
limit_val_batches = 50
max_epochs = 30
batch_size = 64

model_tsm = TSMixerModel(
    input_chunk_length=49,  
        output_chunk_length=130, 
        use_reversible_instance_norm=True,
        optimizer_kwargs={"lr": 1e-4},
        nr_epochs_val_period=1, 
        pl_trainer_kwargs={"gradient_clip_val": 1,
                            "max_epochs": max_epochs,
                            "limit_train_batches": limit_train_batches,
                            "limit_val_batches": limit_val_batches,
                            "accelerator": "auto",
                            "callbacks": [progress_bar, loss_logger]},
        lr_scheduler_cls=torch.optim.lr_scheduler.ExponentialLR,
        lr_scheduler_kwargs={"gamma": 0.999},
        likelihood=QuantileRegression(), 
        loss_fn=None, 
        save_checkpoints=True, 
        force_reset=True,
        batch_size=64,
        random_state=42,
        add_encoders={"cyclic": {"future": ['month', 'day', 'weekday','quarter', 'dayofyear', 'week']}},
        use_static_covariates=True,
        model_name="tsm")

Loss curves
Here are the plotted loss curves after training:

loss_df = pd.DataFrame({'epoch':range(0, len(model_tsm.trainer.callbacks[1].train_loss)),
                        'train_loss':model_tsm.trainer.callbacks[1].train_loss,
                        'val_loss':model_tsm.trainer.callbacks[1].val_loss})

plt.plot(loss_df['epoch'],
         loss_df['train_loss'], color='blue', label='train loss: ' + str(loss_df['train_loss'][-1:].item()))


plt.plot(loss_df['epoch'],
         loss_df['val_loss'], color='orange', label='val loss: ' + str(loss_df['val_loss'][-1:].item()))


plt.gcf().set_size_inches(10, 5)
plt.legend()
plt.show()

Data
I create my multivariate time series using from_group_dataframe() as follows:

ts_df = TimeSeries.from_group_dataframe(df, group_cols=['group1', 'group2', 'group3'],
                                time_col='ds', value_cols='y', freq='D')

Question
Why is my training loss significantly lower than the validation loss, sometimes by orders of magnitude? Could it be related to how the data is structured as a list of time series? Is this expected behavior in this scenario, or could there be an issue with scaling or loss calculation?

I appreciate any help or insights!

Thanks!

The text was updated successfully, but these errors were encountered:

dennisbader · 2024-10-10T16:46:09Z

Hi @erl61, could you provide a minimal reproducible example including model training (potentially processing of the data), what series you provide to fit and predict?

erl61 · 2024-10-11T15:42:46Z

Hi @dennisbader, here's an example taken from the documentation but applied to my data. Unfortunately, I cannot share the actual data due to an NDA, but my code looks like this:

%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings

warnings.filterwarnings("ignore")
import logging

logging.disable(logging.CRITICAL)

import torch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

from darts import concatenate
from darts.dataprocessing.transformers.scaler import Scaler
from darts.datasets import ETTh1Dataset, ETTh2Dataset
from darts.metrics import mae, mse, mql
from darts.models import TiDEModel, TSMixerModel
from darts.utils.likelihood_models import QuantileRegression
from darts.utils.callbacks import TFMProgressBar

from darts import TimeSeries
from darts.dataprocessing.transformers import StaticCovariatesTransformer
from pytorch_lightning.callbacks.early_stopping import Callback

df = pd.read_table('data.tsv')
df['ds'] = pd.to_datetime(df['ds'])
df = df.dropna()

df[df.isna().any(axis=1)] #no nan
df[(df == np.inf).any(axis=1)] #no inf

ts_df = TimeSeries.from_group_dataframe(df, group_cols=['group1', 'group2', 'group3'],
                                time_col='ds', value_cols='y', freq='D')

static_transformer = StaticCovariatesTransformer()
ts_df_transformed = static_transformer.fit_transform(ts_df)

train, val, test = [], [], []
for trafo in ts_df_transformed:
    train_, temp = trafo.split_after(0.6)
    val_, test_ = temp.split_after(0.5)
    train.append(train_)
    val.append(val_)
    test.append(test_)
    
scaler = Scaler()  # default uses sklearn's MinMaxScaler
train = scaler.fit_transform(train)
val = scaler.transform(val)
test = scaler.transform(test)

# Callbacks

class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []

    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        if not trainer.sanity_checking:
            self.val_loss.append(float(trainer.callback_metrics["val_loss"]))

loss_logger = LossLogger()


def create_params(
    input_chunk_length: int,
    output_chunk_length: int,
    full_training=True,
):
    # early stopping: this setting stops training once the the validation
    # loss has not decreased by more than 1e-5 for 10 epochs
    early_stopper = EarlyStopping(
        monitor="val_loss",
        patience=10,
        min_delta=1e-5,
        mode="min",
    )

    # PyTorch Lightning Trainer arguments (you can add any custom callback)
    if full_training:
        limit_train_batches = None
        limit_val_batches = None
        max_epochs = 200
        batch_size = 256
    else:
        limit_train_batches = 20
        limit_val_batches = 10
        max_epochs = 40
        batch_size = 64

    # only show the training and prediction progress bars
    progress_bar = TFMProgressBar(
        enable_sanity_check_bar=False, enable_validation_bar=False
    )
    pl_trainer_kwargs = {
        "gradient_clip_val": 1,
        "max_epochs": max_epochs,
        "limit_train_batches": limit_train_batches,
        "limit_val_batches": limit_val_batches,
        "accelerator": "auto",
        "callbacks": [early_stopper, progress_bar, loss_logger],
    }

    # optimizer setup, uses Adam by default
    optimizer_cls = torch.optim.Adam
    optimizer_kwargs = {
        "lr": 1e-4,
    }

    # learning rate scheduler
    lr_scheduler_cls = torch.optim.lr_scheduler.ExponentialLR
    lr_scheduler_kwargs = {"gamma": 0.999}

    # for probabilistic models, we use quantile regression, and set `loss_fn` to `None`
    likelihood = QuantileRegression()
    loss_fn = None

    return {
        "input_chunk_length": input_chunk_length,  # lookback window
        "output_chunk_length": output_chunk_length,  # forecast/lookahead window
        "use_reversible_instance_norm": True,
        "optimizer_kwargs": optimizer_kwargs,
        "pl_trainer_kwargs": pl_trainer_kwargs,
        "lr_scheduler_cls": lr_scheduler_cls,
        "lr_scheduler_kwargs": lr_scheduler_kwargs,
        "likelihood": likelihood,  # use a `likelihood` for probabilistic forecasts
        "loss_fn": loss_fn,  # use a `loss_fn` for determinsitic model
        "save_checkpoints": True,  # checkpoint to retrieve the best performing model state,
        "force_reset": True,
        "batch_size": batch_size,
        "random_state": 42,
        "add_encoders": {
            "cyclic": {
                "future": ["hour", "dayofweek", "month"]
            }  # add cyclic time axis encodings as future covariates
        },
    }


input_chunk_length = 7 * 24
output_chunk_length = 24
use_static_covariates = True
full_training = False


# create the models
model_tsm = TSMixerModel(
    **create_params(
        input_chunk_length,
        output_chunk_length,
        full_training=full_training,
    ),
    use_static_covariates=use_static_covariates,
    model_name="tsm",
)

models = {
    "TSM": model_tsm,
}


for model_name, model in models.items():
    model.fit(
        series=train,
        val_series=val,
    )
    # load from checkpoint returns a new model object, we store it in the models dict
    models[model_name] = model.load_from_checkpoint(
        model_name=model.model_name, best=True
    )
    
    
loss_df = pd.DataFrame({'epoch':range(0, len(model_tsm.trainer.callbacks[2].train_loss)),
                        'train_loss':model_tsm.trainer.callbacks[2].train_loss,
                        'val_loss':model_tsm.trainer.callbacks[2].val_loss})

plt.plot(loss_df['epoch'],
         loss_df['train_loss'], color='blue', label='train loss: ' + str(loss_df['train_loss'][-1:].item()))


plt.plot(loss_df['epoch'],
         loss_df['val_loss'], color='orange', label='val loss: ' + str(loss_df['val_loss'][-1:].item()))


plt.gcf().set_size_inches(10, 5)
plt.legend()
plt.show()

My dataset looks like this:

Loss curves:

Could my issue be related to the large number of zeros in the dataset (10% of data) or the scale of the target variable (which ranges from zero to millions, but I use scaler)? Would these factors affect the loss calculation and result in such a significant discrepancy between training and validation losses?

dennisbader · 2024-10-12T09:22:35Z

The nature of your time series data could indeed be the issue.
I would suggest some thinigs:

try running training on a small subset of your series and see if it changes something in the loss
zero to millions can be a very large range, if the value distribution is long-tailed. Let's say that your values in the million range are "outliers" and most of your values are in the range 0-1k, then the scaler would transform most of your values to values close to 0., that could indeed mess with model performance. So that would require some data processing (outlier removal, potentially other transformation, ...)
also, are the lower values mostly found at the beginning of your series? Then the training set would indeed have much lower values (and errors) than validation and test sets.
how many different "group" combinations do you have? By default, the StaticCovariatesTransformer uses Ordinal transformation for categoricals which would assume that your groups have a numeric relationship. If the number of groups is low, you could think of using a OneHotEncoding instead (transformer_cat).

erl61 · 2024-10-21T17:29:27Z

@dennisbader Thank you!

Yes, the loss curve starts to look correct when I use a subset of data with stable time series behavior and fewer zeros.

In my data, I do have a long tail, with the majority of the values falling between 0 and 100,000. Over time, the values increase, so the beginning of the series has lower values compared to the end.

I have 650 different group combinations, which reflect the complexity of the business model.

I’m using the Temporal Fusion Transformer model from the pytorch-forecasting package, and it performs well when using EncoderNormalizer, which normalizes each individual time series sequence during training. Is there something similar I can use in Darts?

dennisbader · 2024-10-22T06:54:22Z

We have the use_reversible_instance_norm for our torch models which you could try out (at model creation)

erl61 added question Further information is requested triage Issue waiting for triaging labels Oct 10, 2024

dennisbader removed the triage Issue waiting for triaging label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]Training Loss Much Lower Than Validation Loss in TSMixerModel: Need Help Understanding Why #2558

[QUESTION]Training Loss Much Lower Than Validation Loss in TSMixerModel: Need Help Understanding Why #2558

erl61 commented Oct 10, 2024 •

edited

Loading

dennisbader commented Oct 10, 2024 •

edited

Loading

erl61 commented Oct 11, 2024 •

edited

Loading

dennisbader commented Oct 12, 2024 •

edited

Loading

erl61 commented Oct 21, 2024

dennisbader commented Oct 22, 2024 •

edited

Loading

[QUESTION]Training Loss Much Lower Than Validation Loss in TSMixerModel: Need Help Understanding Why #2558

[QUESTION]Training Loss Much Lower Than Validation Loss in TSMixerModel: Need Help Understanding Why #2558

Comments

erl61 commented Oct 10, 2024 • edited Loading

dennisbader commented Oct 10, 2024 • edited Loading

erl61 commented Oct 11, 2024 • edited Loading

dennisbader commented Oct 12, 2024 • edited Loading

erl61 commented Oct 21, 2024

dennisbader commented Oct 22, 2024 • edited Loading

erl61 commented Oct 10, 2024 •

edited

Loading

dennisbader commented Oct 10, 2024 •

edited

Loading

erl61 commented Oct 11, 2024 •

edited

Loading

dennisbader commented Oct 12, 2024 •

edited

Loading

dennisbader commented Oct 22, 2024 •

edited

Loading