Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspected bug in predict_insample where available_mask is used #1229

Open
Daniel-Wait opened this issue Dec 17, 2024 · 1 comment
Open

Suspected bug in predict_insample where available_mask is used #1229

Daniel-Wait opened this issue Dec 17, 2024 · 1 comment
Labels

Comments

@Daniel-Wait
Copy link

What happened + What you expected to happen

I inherited an NHITS forecaster trained on 284 x unique_id and 1,347,745 datapoints.
Some unique ids have more datapoints than others and I believe the available_mask is being used for training to account for that.

When I was trying to debug issues, I noticed that predict_insample was not working.

model = NeuralForecast.load(path=nhits_chkpt_path)
model.predict_insample(step_size=1)
  File "<path>/python3.10/site-packages/neuralforecast/core.py", line 622, in predict_insample
    fcsts[:, col_idx : (col_idx + output_length)] = model_fcsts
ValueError: could not broadcast input array from shape (2494088,5) into shape (1347745,5)

I ran some code at a breakpoint after I figured out that 2494088 = 284 * 8782 = self.dataset.n_groups * self.dataset.max_size.
The plot looks quite sensible from the code snippet below. I suspect that the resultant array of model.predict(trimmed_dataset, step_size=step_size) needs to be filtered reduced to the dataset size.

import matplotlib.pyplot as plt

dataset = self.dataset

# Assuming `dataset.groups` provides a mapping for unique IDs
target_unique_id = "XYZ"

# Find the index of the specific unique_id
target_index = self.uids.tolist().index(target_unique_id)

# Use indptr to get the data range for this unique_id
start_idx = dataset.indptr[target_index]
end_idx = dataset.indptr[target_index + 1]

# Extract the data for this unique_id
unique_id_data = {
    'data': dataset.temporal.data[start_idx:end_idx],
    'T': dataset.temporal.T[start_idx:end_idx],
    'H': dataset.temporal.H[start_idx:end_idx]
}

# Convert tensor to numpy
data_np = unique_id_data['data'].numpy()

# Extract the 'y' column
y_values = data_np[:, 0]

# Get the forecasts for the target UID
size_train = len(y_values)
idx_des = int(np.where(self.uids.values == target_unique_id)[0])
fcsts_des = model_fcsts[idx_des * trimmed_dataset.max_size : (idx_des+1) * trimmed_dataset.max_size]

plt.figure(figsize=(12,8))
plt.suptitle("Predict insample - " + target_unique_id)
plt.fill_between(np.arange(size_train), fcsts_des[-size_train:,4], fcsts_des[-size_train:,1], color='gold', alpha=.3, label='levels-95')
plt.fill_between(np.arange(size_train), fcsts_des[-size_train:,3], fcsts_des[-size_train:,2], color='tab:orange', alpha=.3, label='levels-80')
plt.plot(fcsts_des[-size_train:,0], linewidth=.7, label='forecast', alpha=.7)
plt.plot(y_values, color='k', linestyle='dashed', label='BQ costs')
plt.legend()

The cause for my investigation is actually because of non-overlapping forecast intervals as highlighted here.
If anyone has additional tips, I'd appreciate it. The y_true of uid in question is 418 samples with a slight step midway and some spikes.

image

Versions / Dependencies

Python via pip
neuralforecast==1.6.4
statsforecast==1.7.8

Reproduction script

Sorry, I would have to write a unit test for you but I'm working. The data exact data is sensitive too.
Feel free to chuck this out if I'm being stupid, but I just thought it seemed like a plausible bug.

Issue Severity

Low: It annoys or frustrates me.

@marcopeix
Copy link
Contributor

Hello! Yes, insample predictions on multiple series with different lengths is a known issue. We're working on a fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants