Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window #783

Open
hk63560892 opened this issue Jul 17, 2024 · 3 comments
Labels
bug Something isn't working status/needs-triage

Comments

@hk63560892
Copy link

Bug description

I find out that there is no label in valid.parquet.

Steps/Code to reproduce bug

While I m running this code:
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
# Set data
time_index_train = time_index
time_index_eval = time_index + 1
train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
# Train on day related to time_index
print('*'20)
print("Launch training for day %s are:" %time_index)
print('
'20 + '\n')
trainer.train_dataset_or_path = train_paths
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step +=1
# Evaluate on the following day
trainer.eval_dataset_or_path = eval_paths
train_metrics = trainer.evaluate(metric_key_prefix='eval')
print('
'20)
print("Eval results for day %s are:\t" %time_index_eval)
print('\n' + '
'*20 + '\n')
for key in sorted(train_metrics.keys()):
print(" %s = %s" % (key, str(train_metrics[key])))
wipe_memory()

the error appear:


Launch training for day 1 are:


/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
{'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'train_loss': 10.525657145182292, 'epoch': 60.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:04<00:00, 14.92it/s]
TrainOutput(global_step=60, training_loss=10.525657145182292, metrics={'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'total_flos': 0.0, 'train_loss': 10.525657145182292})
Traceback (most recent call last):
File "", line 17, in
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2932, in evaluate
output = eval_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/trainer.py", line 515, in evaluation_loop
metrics_results_detailed = model.calculate_metrics(preds, labels)
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py", line 616, in calculate_metrics
head.calculate_metrics(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py", line 453, in calculate_metrics
task.calculate_metrics(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/prediction_task.py", line 489, in calculate_metrics
result = metric(predictions, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 301, in forward
self._forward_cache = self._forward_full_state_update(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 316, in _forward_full_state_update
self.update(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 465, in wrapped_func
update(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/ranking_metric.py", line 56, in update
metric = self._metric(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/ranking_metric.py", line 137, in _metric
if rel_indices.shape[0] > 0:
IndexError: tuple index out of range

Expected behavior

I expected there have label for evaluation

Environment details

  • Transformers4Rec version: 23.12
  • Platform:Docker
  • Python version:3.10
  • Huggingface Transformers version:4.27.1
  • PyTorch version (GPU?):2.1.0a0+4136153
  • Tensorflow version (GPU?):

Additional context

@hk63560892 hk63560892 added bug Something isn't working status/needs-triage labels Jul 17, 2024
@rnyak
Copy link
Contributor

rnyak commented Jul 18, 2024

@hk63560892 please share the link to the example notebook you are running? and what docker image you are using?

@hk63560892
Copy link
Author

link:
https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/tutorial/03-Session-based-recsys.ipynb
docker:
docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 -v <path_to_data>:/workspace/data/ nvcr.io/nvidia/merlin/merlin-pytorch:23.XX

thankyou!!

@rnyak
Copy link
Contributor

rnyak commented Jul 22, 2024

@hk63560892 what docker image tag you are using? which 23.XX you are using? we have several ones start with 23. please be specific.

also note that the tutorials have not been maintained for a while so you can refer to other example notebooks in the examples directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

2 participants