Suspected bug in predict_insample where available_mask is used #1229

Daniel-Wait · 2024-12-17T14:20:22Z

What happened + What you expected to happen

I inherited an NHITS forecaster trained on 284 x unique_id and 1,347,745 datapoints.
Some unique ids have more datapoints than others and I believe the available_mask is being used for training to account for that.

When I was trying to debug issues, I noticed that predict_insample was not working.

model = NeuralForecast.load(path=nhits_chkpt_path)
model.predict_insample(step_size=1)

  File "<path>/python3.10/site-packages/neuralforecast/core.py", line 622, in predict_insample
    fcsts[:, col_idx : (col_idx + output_length)] = model_fcsts
ValueError: could not broadcast input array from shape (2494088,5) into shape (1347745,5)

I ran some code at a breakpoint after I figured out that 2494088 = 284 * 8782 = self.dataset.n_groups * self.dataset.max_size.
The plot looks quite sensible from the code snippet below. I suspect that the resultant array of model.predict(trimmed_dataset, step_size=step_size) needs to be filtered reduced to the dataset size.

import matplotlib.pyplot as plt

dataset = self.dataset

# Assuming `dataset.groups` provides a mapping for unique IDs
target_unique_id = "XYZ"

# Find the index of the specific unique_id
target_index = self.uids.tolist().index(target_unique_id)

# Use indptr to get the data range for this unique_id
start_idx = dataset.indptr[target_index]
end_idx = dataset.indptr[target_index + 1]

# Extract the data for this unique_id
unique_id_data = {
    'data': dataset.temporal.data[start_idx:end_idx],
    'T': dataset.temporal.T[start_idx:end_idx],
    'H': dataset.temporal.H[start_idx:end_idx]
}

# Convert tensor to numpy
data_np = unique_id_data['data'].numpy()

# Extract the 'y' column
y_values = data_np[:, 0]

# Get the forecasts for the target UID
size_train = len(y_values)
idx_des = int(np.where(self.uids.values == target_unique_id)[0])
fcsts_des = model_fcsts[idx_des * trimmed_dataset.max_size : (idx_des+1) * trimmed_dataset.max_size]

plt.figure(figsize=(12,8))
plt.suptitle("Predict insample - " + target_unique_id)
plt.fill_between(np.arange(size_train), fcsts_des[-size_train:,4], fcsts_des[-size_train:,1], color='gold', alpha=.3, label='levels-95')
plt.fill_between(np.arange(size_train), fcsts_des[-size_train:,3], fcsts_des[-size_train:,2], color='tab:orange', alpha=.3, label='levels-80')
plt.plot(fcsts_des[-size_train:,0], linewidth=.7, label='forecast', alpha=.7)
plt.plot(y_values, color='k', linestyle='dashed', label='BQ costs')
plt.legend()

The cause for my investigation is actually because of non-overlapping forecast intervals as highlighted here.
If anyone has additional tips, I'd appreciate it. The y_true of uid in question is 418 samples with a slight step midway and some spikes.

Versions / Dependencies

Python via pip
neuralforecast==1.6.4
statsforecast==1.7.8

Reproduction script

Sorry, I would have to write a unit test for you but I'm working. The data exact data is sensitive too.
Feel free to chuck this out if I'm being stupid, but I just thought it seemed like a plausible bug.

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

marcopeix · 2024-12-18T18:45:29Z

Hello! Yes, insample predictions on multiple series with different lengths is a known issue. We're working on a fix!

Daniel-Wait added the bug label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspected bug in predict_insample where available_mask is used #1229

Suspected bug in predict_insample where available_mask is used #1229

Daniel-Wait commented Dec 17, 2024

marcopeix commented Dec 18, 2024

Suspected bug in predict_insample where available_mask is used #1229

Suspected bug in predict_insample where available_mask is used #1229

Comments

Daniel-Wait commented Dec 17, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

marcopeix commented Dec 18, 2024