[🐛BUG] RuntimeError: Expected all tensors to be on the same device with Random, ADMMSLIM, and SLIMElastic #1984

lukas-wegmeth · 2024-01-26T15:11:57Z

Describe the bug
Training MovieLens-100K on algorithms Random, ADMMSLIM, and SLIMElastic crashes with exception "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!"

CUDA available: True
command line args [--data_set_name MovieLens-100K --model_name Random] will not be used in RecBole
24 Jan 15:52    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 42
state = INFO
reproducibility = True
data_path = ./data_sets/MovieLens-100K
checkpoint_dir = ./data_sets/MovieLens-100K/recbole_checkpoints/
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 50
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 5
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'MAP', 'Precision', 'GAUC', 'ItemCoverage', 'AveragePopularity', 'GiniIndex', 'ShannonEntropy', 'TailPercentage']
topk = [1, 3, 5, 10, 20]
valid_metric = NDCG@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = 	
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = {}
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['user_id', 'item_id', 'rating']}
unload_col = {}
unused_col = {}
additional_feat_suffix = []
rm_dup_inter = None
val_interval = {}
filter_inter_by_user_or_item = True
user_inter_num_interval = [0, inf)
item_inter_num_interval = [0, inf)
alias_of_user_id = None
alias_of_item_id = None
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = {}
normalize_field = []
normalize_all = False
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id
benchmark_filename = None

Other Hyper Parameters: 
worker = 0
wandb_project = recbole
shuffle = True
require_pow = False
enable_amp = False
enable_scaler = False
transform = None
numerical_features = []
discretization = None
kg_reverse_r = False
entity_kg_num_interval = [0, inf)
relation_kg_num_interval = [0, inf)
MODEL_TYPE = ModelType.GENERAL
encoding = utf-8
training_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'dynamic': False, 'candidate_num': 0}
MODEL_INPUT_TYPE = InputType.POINTWISE
eval_type = EvaluatorType.RANKING
single_spec = True
local_rank = 0
device = cuda
valid_neg_sample_args = {'distribution': 'uniform', 'sample_num': 100}
test_neg_sample_args = {'distribution': 'uniform', 'sample_num': 100}


24 Jan 15:52    INFO  MovieLens-100K
The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000
The sparsity of the dataset: 93.70575143257098%
Remain Fields: ['user_id', 'item_id', 'rating']
24 Jan 15:52    INFO  [Training]: train_batch_size = [2048] train_neg_sample_args: [{'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}]
24 Jan 15:52    INFO  [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}]
24 Jan 15:52    INFO  Random()
Trainable parameters: 1
24 Jan 15:52    INFO  epoch 0 training [time: 0.22s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 1 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 2 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 3 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 4 training [time: 0.19s, train loss: 0.0000]
Traceback (most recent call last):
  File "/mnt/./run_recbole_test.py", line 158, in <module>
    best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 464, in fit
    valid_score, valid_result = self._valid_epoch(
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 283, in _valid_epoch
    valid_result = self.evaluate(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 616, in evaluate
    interaction, scores, positive_u, positive_i = eval_func(batched_data)
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 558, in _neg_sample_batch_eval
    scores[row_idx, col_idx] = origin_scores
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

To Reproduce
Steps to reproduce the behavior:

import argparse
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.utils import ModelType, get_model, get_trainer, init_seed, init_logger
import torch

parser = argparse.ArgumentParser("Evaluate RecBole")
parser.add_argument('--data_set_name', dest='data_set_name', type=str, required=True)
parser.add_argument('--model_name', dest='model_name', type=str, required=True)
args = parser.parse_args()
print(f"CUDA available: {torch.cuda.is_available()}")
config_dict = {
    # environment settings
    "gpu_id": 0,  # default: 0
    "worker": 0,  # default: 0
    "seed": 42,  # default: "2020"
    "state": "INFO",  # default: "INFO"
    "encoding": "utf-8",  # default: "utf-8"
    "reproducibility": True,  # default: True
    "data_path": "./data_sets/",  # default: "dataset/"
    "checkpoint_dir": f"./data_sets/{args.data_set_name}/recbole_checkpoints/",  # default: "saved/"
    "show_progress": True,  # default: True
    "save_dataset": False,  # default: False
    "dataset_save_path": None,  # default: None
    "save_dataloaders": False,  # default: False
    "dataloaders_save_path": None,  # default: None
    "log_wandb": False,  # default: False
    "wandb_project": "recbole",  # default: "recbole"
    "shuffle": True,  # default: True
    # data settings
    # atomic file format
    "field_separator": "\t",  # default: "\t"
    "seq_separator": " ",  # default: " "
    # basic information
    # common features
    "USER_ID_FIELD": "user_id",  # default: "user_id"
    "ITEM_ID_FIELD": "item_id",  # default: "item_id"
    "RATING_FIELD": "rating",  # default: "rating"
    "TIME_FIELD": "timestamp",  # default: "timestamp"
    "seq_len": {},  # default: {}
    # label for point-wise dataloader
    "LABEL_FIELD": "label",  # default: "label"
    "threshold": None,  # default: None
    # negative sampling prefix for pair-wise dataloader
    "NEG_PREFIX": "neg_",  # default: "neg_"
    # sequential model needed
    "ITEM_LIST_LENGTH_FIELD": "item_length",  # default: "item_length"
    "LIST_SUFFIX": "_list",  # default: "_list"
    "MAX_ITEM_LIST_LENGTH": 50,  # default: 50
    "POSITION_FIELD": "position_id",  # default: "position_id"
    # knowledge-based model needed
    "HEAD_ENTITY_ID_FIELD": "head_id",  # default: "head_id"
    "TAIL_ENTITY_ID_FIELD": "tail_id",  # default: "tail_id"
    "RELATION_ID_FIELD": "relation_id",  # default: "relation_id"
    "kg_reverse_r": False,  # default: False
    "entity_kg_num_interval": "[0, inf)",  # default: "[0, inf)"
    "relation_kg_num_interval": "[0, inf)",  # default: "[0, inf)"
    # selectively loading
    "load_col": {"inter": ["user_id", "item_id", "rating"]},  # default: {inter: [user_id, item_id]}
    "unload_col": {},  # default: {}
    "unused_col": {},  # default: {}
    "additional_feat_suffix": [],  # default: []
    "numerical_features": [],  # default: []
    # filtering
    # remove duplicated user-item interactions
    "rm_dup_inter": None,  # default: None
    # filter by value
    "val_interval": {},  # default: {}
    # remove interaction by user or item
    "filter_inter_by_user_or_item": True,  # default: True
    # filter by number of interactions
    "user_inter_num_interval": "[0, inf)",  # default: "[0, inf)"
    "item_inter_num_interval": "[0, inf)",  # default: "[0, inf)"
    # preprocessing
    "alias_of_user_id": None,  # default: None
    "alias_of_item_id": None,  # default: None
    "alias_of_entity_id": None,  # default: None
    "alias_of_relation_id": None,  # default: None
    "preload_weight": {},  # default: {}
    "normalize_field": [],  # default: []
    "normalize_all": False,  # default: False
    "discretization": None,  # default: None
    # benchmark file
    "benchmark_filename": None,  # default: None
    # training settings
    "epochs": 50,  # default: 300
    "train_batch_size": 2048,  # default: 2048
    "learner": "adam",  # default: "adam"
    "learning_rate": 0.001,  # default: 0.001
    "training_neg_sample_args":
        {
            "distribution": "uniform",  # default: "uniform"
            "sample_num": 1,  # default: 1
            "dynamic": False,  # default: False
            "candidate_num": 0,  # default: 0
        },
    "eval_step": 5,  # default: 1
    "stopping_step": 10,  # default: 10
    "clip_grad_norm": None,  # default: None
    "loss_decimal_place": 4,  # default: 4
    "weight_decay": 0.0,  # default: 0.0
    "require_pow": False,  # default: False
    "enable_amp": False,  # default: False
    "enable_scaler": False,  # default: False
    # evaluation settings
    "eval_args":
        {
            "group_by": "user",  # default: "user"
            "order": "RO",  # default: "RO"
            "split":
                {
                    # "RS": [8, 1, 1] # default: {"RS": [8, 1, 1]}
                    "LS": "valid_and_test"
                },
            "mode":
                {
                    "valid": "uni100",  # default: "full"
                    "test": "uni100",  # default: "full"
                },
        },
    "repeatable": False,  # default: False
    "metrics": ["Recall", "MRR", "NDCG", "Hit", "MAP", "Precision", "GAUC", "ItemCoverage", "AveragePopularity",
                "GiniIndex", "ShannonEntropy", "TailPercentage"],
    # default: ["Recall", "MRR", "NDCG", "Hit", "Precision"]
    "topk": [1, 3, 5, 10, 20],  # default: 10
    "valid_metric": "NDCG@10",  # default: "MRR@10"
    "eval_batch_size": 4096,  # default: 4096
    "metric_decimal_place": 4,  # default: 4,
    # misc settings
    "model": args.model_name,
    "MODEL_TYPE": ModelType.GENERAL
}
config = Config(model=args.model_name, dataset=args.data_set_name, config_dict=config_dict)
init_seed(config['seed'], config['reproducibility'])
init_logger(config)
logger = getLogger()
logger.info(config)
dataset = create_dataset(config)
logger.info(dataset)
train_data, valid_data, test_data = data_preparation(config, dataset)
model = get_model(config["model"])(config, train_data.dataset).to(config['device'])
logger.info(model)
trainer = get_trainer(config["MODEL_TYPE"], config["model"])(config, model)
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
test_result = trainer.evaluate(test_data)
print(test_result)

Expected behavior
Models from the algorithms Random, ADMMSLIM, and SLIMElastic should be trained and evaluated on the MovieLens-100K data set without crashing.

Desktop (please complete the following information):

OS: Linux
RecBole Version: 1.2.0
Python Version: 3.10
PyTorch Version: 2.1.1
cudatoolkit Version: 12.1

I believe this happens during validation and that the same bug was fixed for different models in #1873.

The text was updated successfully, but these errors were encountered:

BishopLiu · 2024-02-22T08:49:45Z

@lukas-wegmeth Thank you for your attention to RecBole! We have fixed the bugs mentioned above in #1997.

lukas-wegmeth · 2024-02-26T09:43:12Z

@BishopLiu Thank you! #1997 fixed this issue for me.

lukas-wegmeth added the bug Something isn't working label Jan 26, 2024

Sherry-XLL assigned BishopLiu Feb 22, 2024

lukas-wegmeth closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛BUG] RuntimeError: Expected all tensors to be on the same device with Random, ADMMSLIM, and SLIMElastic #1984

[🐛BUG] RuntimeError: Expected all tensors to be on the same device with Random, ADMMSLIM, and SLIMElastic #1984

lukas-wegmeth commented Jan 26, 2024

BishopLiu commented Feb 22, 2024

lukas-wegmeth commented Feb 26, 2024

[🐛BUG] RuntimeError: Expected all tensors to be on the same device with Random, ADMMSLIM, and SLIMElastic #1984

[🐛BUG] RuntimeError: Expected all tensors to be on the same device with Random, ADMMSLIM, and SLIMElastic #1984

Comments

lukas-wegmeth commented Jan 26, 2024

BishopLiu commented Feb 22, 2024

lukas-wegmeth commented Feb 26, 2024