Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION]Training Loss Much Lower Than Validation Loss in TSMixerModel: Need Help Understanding Why #2558

Open
erl61 opened this issue Oct 10, 2024 · 5 comments
Labels
question Further information is requested

Comments

@erl61
Copy link

erl61 commented Oct 10, 2024

Issue
I am training a TSMixerModel to forecast multivariate time series. The model performs well overall, but I notice that the training loss is consistently much lower than the validation loss (sometimes by orders of magnitude).

I have already tried different loss functions (MAELoss, MapeLoss), and the issue persists. However, when I forecast using this model, I don’t observe signs of overfitting, and the model predictions look good.

Callback
I use the following setup for logging the losses:

class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []

    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        if not trainer.sanity_checking:
            self.val_loss.append(float(trainer.callback_metrics["val_loss"]))

loss_logger = LossLogger()

Model
This is how I initialize the model:

progress_bar = TFMProgressBar(enable_sanity_check_bar=False, enable_validation_bar=False)

limit_train_batches = 50
limit_val_batches = 50
max_epochs = 30
batch_size = 64

model_tsm = TSMixerModel(
    input_chunk_length=49,  
        output_chunk_length=130, 
        use_reversible_instance_norm=True,
        optimizer_kwargs={"lr": 1e-4},
        nr_epochs_val_period=1, 
        pl_trainer_kwargs={"gradient_clip_val": 1,
                            "max_epochs": max_epochs,
                            "limit_train_batches": limit_train_batches,
                            "limit_val_batches": limit_val_batches,
                            "accelerator": "auto",
                            "callbacks": [progress_bar, loss_logger]},
        lr_scheduler_cls=torch.optim.lr_scheduler.ExponentialLR,
        lr_scheduler_kwargs={"gamma": 0.999},
        likelihood=QuantileRegression(), 
        loss_fn=None, 
        save_checkpoints=True, 
        force_reset=True,
        batch_size=64,
        random_state=42,
        add_encoders={"cyclic": {"future": ['month', 'day', 'weekday','quarter', 'dayofyear', 'week']}},
        use_static_covariates=True,
        model_name="tsm")

Loss curves
Here are the plotted loss curves after training:

loss_df = pd.DataFrame({'epoch':range(0, len(model_tsm.trainer.callbacks[1].train_loss)),
                        'train_loss':model_tsm.trainer.callbacks[1].train_loss,
                        'val_loss':model_tsm.trainer.callbacks[1].val_loss})

plt.plot(loss_df['epoch'],
         loss_df['train_loss'], color='blue', label='train loss: ' + str(loss_df['train_loss'][-1:].item()))


plt.plot(loss_df['epoch'],
         loss_df['val_loss'], color='orange', label='val loss: ' + str(loss_df['val_loss'][-1:].item()))


plt.gcf().set_size_inches(10, 5)
plt.legend()
plt.show()

image

Data
I create my multivariate time series using from_group_dataframe() as follows:

ts_df = TimeSeries.from_group_dataframe(df, group_cols=['group1', 'group2', 'group3'],
                                time_col='ds', value_cols='y', freq='D')

Question
Why is my training loss significantly lower than the validation loss, sometimes by orders of magnitude? Could it be related to how the data is structured as a list of time series? Is this expected behavior in this scenario, or could there be an issue with scaling or loss calculation?

I appreciate any help or insights!

Thanks!

@erl61 erl61 added question Further information is requested triage Issue waiting for triaging labels Oct 10, 2024
@dennisbader
Copy link
Collaborator

dennisbader commented Oct 10, 2024

Hi @erl61, could you provide a minimal reproducible example including model training (potentially processing of the data), what series you provide to fit and predict?

@erl61
Copy link
Author

erl61 commented Oct 11, 2024

Hi @dennisbader, here's an example taken from the documentation but applied to my data. Unfortunately, I cannot share the actual data due to an NDA, but my code looks like this:

%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings

warnings.filterwarnings("ignore")
import logging

logging.disable(logging.CRITICAL)

import torch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

from darts import concatenate
from darts.dataprocessing.transformers.scaler import Scaler
from darts.datasets import ETTh1Dataset, ETTh2Dataset
from darts.metrics import mae, mse, mql
from darts.models import TiDEModel, TSMixerModel
from darts.utils.likelihood_models import QuantileRegression
from darts.utils.callbacks import TFMProgressBar

from darts import TimeSeries
from darts.dataprocessing.transformers import StaticCovariatesTransformer
from pytorch_lightning.callbacks.early_stopping import Callback

df = pd.read_table('data.tsv')
df['ds'] = pd.to_datetime(df['ds'])
df = df.dropna()

df[df.isna().any(axis=1)] #no nan
df[(df == np.inf).any(axis=1)] #no inf

ts_df = TimeSeries.from_group_dataframe(df, group_cols=['group1', 'group2', 'group3'],
                                time_col='ds', value_cols='y', freq='D')

static_transformer = StaticCovariatesTransformer()
ts_df_transformed = static_transformer.fit_transform(ts_df)

train, val, test = [], [], []
for trafo in ts_df_transformed:
    train_, temp = trafo.split_after(0.6)
    val_, test_ = temp.split_after(0.5)
    train.append(train_)
    val.append(val_)
    test.append(test_)
    
scaler = Scaler()  # default uses sklearn's MinMaxScaler
train = scaler.fit_transform(train)
val = scaler.transform(val)
test = scaler.transform(test)

# Callbacks

class LossLogger(Callback):
    def __init__(self):
        self.train_loss = []
        self.val_loss = []

    # will automatically be called at the end of each epoch
    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

    def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        if not trainer.sanity_checking:
            self.val_loss.append(float(trainer.callback_metrics["val_loss"]))

loss_logger = LossLogger()


def create_params(
    input_chunk_length: int,
    output_chunk_length: int,
    full_training=True,
):
    # early stopping: this setting stops training once the the validation
    # loss has not decreased by more than 1e-5 for 10 epochs
    early_stopper = EarlyStopping(
        monitor="val_loss",
        patience=10,
        min_delta=1e-5,
        mode="min",
    )

    # PyTorch Lightning Trainer arguments (you can add any custom callback)
    if full_training:
        limit_train_batches = None
        limit_val_batches = None
        max_epochs = 200
        batch_size = 256
    else:
        limit_train_batches = 20
        limit_val_batches = 10
        max_epochs = 40
        batch_size = 64

    # only show the training and prediction progress bars
    progress_bar = TFMProgressBar(
        enable_sanity_check_bar=False, enable_validation_bar=False
    )
    pl_trainer_kwargs = {
        "gradient_clip_val": 1,
        "max_epochs": max_epochs,
        "limit_train_batches": limit_train_batches,
        "limit_val_batches": limit_val_batches,
        "accelerator": "auto",
        "callbacks": [early_stopper, progress_bar, loss_logger],
    }

    # optimizer setup, uses Adam by default
    optimizer_cls = torch.optim.Adam
    optimizer_kwargs = {
        "lr": 1e-4,
    }

    # learning rate scheduler
    lr_scheduler_cls = torch.optim.lr_scheduler.ExponentialLR
    lr_scheduler_kwargs = {"gamma": 0.999}

    # for probabilistic models, we use quantile regression, and set `loss_fn` to `None`
    likelihood = QuantileRegression()
    loss_fn = None

    return {
        "input_chunk_length": input_chunk_length,  # lookback window
        "output_chunk_length": output_chunk_length,  # forecast/lookahead window
        "use_reversible_instance_norm": True,
        "optimizer_kwargs": optimizer_kwargs,
        "pl_trainer_kwargs": pl_trainer_kwargs,
        "lr_scheduler_cls": lr_scheduler_cls,
        "lr_scheduler_kwargs": lr_scheduler_kwargs,
        "likelihood": likelihood,  # use a `likelihood` for probabilistic forecasts
        "loss_fn": loss_fn,  # use a `loss_fn` for determinsitic model
        "save_checkpoints": True,  # checkpoint to retrieve the best performing model state,
        "force_reset": True,
        "batch_size": batch_size,
        "random_state": 42,
        "add_encoders": {
            "cyclic": {
                "future": ["hour", "dayofweek", "month"]
            }  # add cyclic time axis encodings as future covariates
        },
    }


input_chunk_length = 7 * 24
output_chunk_length = 24
use_static_covariates = True
full_training = False


# create the models
model_tsm = TSMixerModel(
    **create_params(
        input_chunk_length,
        output_chunk_length,
        full_training=full_training,
    ),
    use_static_covariates=use_static_covariates,
    model_name="tsm",
)

models = {
    "TSM": model_tsm,
}


for model_name, model in models.items():
    model.fit(
        series=train,
        val_series=val,
    )
    # load from checkpoint returns a new model object, we store it in the models dict
    models[model_name] = model.load_from_checkpoint(
        model_name=model.model_name, best=True
    )
    
    
loss_df = pd.DataFrame({'epoch':range(0, len(model_tsm.trainer.callbacks[2].train_loss)),
                        'train_loss':model_tsm.trainer.callbacks[2].train_loss,
                        'val_loss':model_tsm.trainer.callbacks[2].val_loss})

plt.plot(loss_df['epoch'],
         loss_df['train_loss'], color='blue', label='train loss: ' + str(loss_df['train_loss'][-1:].item()))


plt.plot(loss_df['epoch'],
         loss_df['val_loss'], color='orange', label='val loss: ' + str(loss_df['val_loss'][-1:].item()))


plt.gcf().set_size_inches(10, 5)
plt.legend()
plt.show()

My dataset looks like this:
Снимок экрана 2024-10-11 в 17 39 50

Loss curves:
image

Could my issue be related to the large number of zeros in the dataset (10% of data) or the scale of the target variable (which ranges from zero to millions, but I use scaler)? Would these factors affect the loss calculation and result in such a significant discrepancy between training and validation losses?

@dennisbader
Copy link
Collaborator

dennisbader commented Oct 12, 2024

The nature of your time series data could indeed be the issue.
I would suggest some thinigs:

  • try running training on a small subset of your series and see if it changes something in the loss
  • zero to millions can be a very large range, if the value distribution is long-tailed. Let's say that your values in the million range are "outliers" and most of your values are in the range 0-1k, then the scaler would transform most of your values to values close to 0., that could indeed mess with model performance. So that would require some data processing (outlier removal, potentially other transformation, ...)
  • also, are the lower values mostly found at the beginning of your series? Then the training set would indeed have much lower values (and errors) than validation and test sets.
  • how many different "group" combinations do you have? By default, the StaticCovariatesTransformer uses Ordinal transformation for categoricals which would assume that your groups have a numeric relationship. If the number of groups is low, you could think of using a OneHotEncoding instead (transformer_cat).

@dennisbader dennisbader removed the triage Issue waiting for triaging label Oct 14, 2024
@erl61
Copy link
Author

erl61 commented Oct 21, 2024

@dennisbader Thank you!

Yes, the loss curve starts to look correct when I use a subset of data with stable time series behavior and fewer zeros.

In my data, I do have a long tail, with the majority of the values falling between 0 and 100,000. Over time, the values increase, so the beginning of the series has lower values compared to the end.

I have 650 different group combinations, which reflect the complexity of the business model.

I’m using the Temporal Fusion Transformer model from the pytorch-forecasting package, and it performs well when using EncoderNormalizer, which normalizes each individual time series sequence during training. Is there something similar I can use in Darts?

@dennisbader
Copy link
Collaborator

dennisbader commented Oct 22, 2024

We have the use_reversible_instance_norm for our torch models which you could try out (at model creation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants