Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] RuntimeError: Expected all tensors to be on the same device with Random, ADMMSLIM, and SLIMElastic #1984

Closed
lukas-wegmeth opened this issue Jan 26, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@lukas-wegmeth
Copy link

Describe the bug
Training MovieLens-100K on algorithms Random, ADMMSLIM, and SLIMElastic crashes with exception "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!"

CUDA available: True
command line args [--data_set_name MovieLens-100K --model_name Random] will not be used in RecBole
24 Jan 15:52    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 42
state = INFO
reproducibility = True
data_path = ./data_sets/MovieLens-100K
checkpoint_dir = ./data_sets/MovieLens-100K/recbole_checkpoints/
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 50
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 5
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'MAP', 'Precision', 'GAUC', 'ItemCoverage', 'AveragePopularity', 'GiniIndex', 'ShannonEntropy', 'TailPercentage']
topk = [1, 3, 5, 10, 20]
valid_metric = NDCG@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = 	
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = {}
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['user_id', 'item_id', 'rating']}
unload_col = {}
unused_col = {}
additional_feat_suffix = []
rm_dup_inter = None
val_interval = {}
filter_inter_by_user_or_item = True
user_inter_num_interval = [0, inf)
item_inter_num_interval = [0, inf)
alias_of_user_id = None
alias_of_item_id = None
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = {}
normalize_field = []
normalize_all = False
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id
benchmark_filename = None

Other Hyper Parameters: 
worker = 0
wandb_project = recbole
shuffle = True
require_pow = False
enable_amp = False
enable_scaler = False
transform = None
numerical_features = []
discretization = None
kg_reverse_r = False
entity_kg_num_interval = [0, inf)
relation_kg_num_interval = [0, inf)
MODEL_TYPE = ModelType.GENERAL
encoding = utf-8
training_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'dynamic': False, 'candidate_num': 0}
MODEL_INPUT_TYPE = InputType.POINTWISE
eval_type = EvaluatorType.RANKING
single_spec = True
local_rank = 0
device = cuda
valid_neg_sample_args = {'distribution': 'uniform', 'sample_num': 100}
test_neg_sample_args = {'distribution': 'uniform', 'sample_num': 100}


24 Jan 15:52    INFO  MovieLens-100K
The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000
The sparsity of the dataset: 93.70575143257098%
Remain Fields: ['user_id', 'item_id', 'rating']
24 Jan 15:52    INFO  [Training]: train_batch_size = [2048] train_neg_sample_args: [{'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}]
24 Jan 15:52    INFO  [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}]
24 Jan 15:52    INFO  Random()
Trainable parameters: 1
24 Jan 15:52    INFO  epoch 0 training [time: 0.22s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 1 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 2 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 3 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52    INFO  epoch 4 training [time: 0.19s, train loss: 0.0000]
Traceback (most recent call last):
  File "/mnt/./run_recbole_test.py", line 158, in <module>
    best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 464, in fit
    valid_score, valid_result = self._valid_epoch(
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 283, in _valid_epoch
    valid_result = self.evaluate(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 616, in evaluate
    interaction, scores, positive_u, positive_i = eval_func(batched_data)
  File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 558, in _neg_sample_batch_eval
    scores[row_idx, col_idx] = origin_scores
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

To Reproduce
Steps to reproduce the behavior:

import argparse
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.utils import ModelType, get_model, get_trainer, init_seed, init_logger
import torch

parser = argparse.ArgumentParser("Evaluate RecBole")
parser.add_argument('--data_set_name', dest='data_set_name', type=str, required=True)
parser.add_argument('--model_name', dest='model_name', type=str, required=True)
args = parser.parse_args()
print(f"CUDA available: {torch.cuda.is_available()}")
config_dict = {
    # environment settings
    "gpu_id": 0,  # default: 0
    "worker": 0,  # default: 0
    "seed": 42,  # default: "2020"
    "state": "INFO",  # default: "INFO"
    "encoding": "utf-8",  # default: "utf-8"
    "reproducibility": True,  # default: True
    "data_path": "./data_sets/",  # default: "dataset/"
    "checkpoint_dir": f"./data_sets/{args.data_set_name}/recbole_checkpoints/",  # default: "saved/"
    "show_progress": True,  # default: True
    "save_dataset": False,  # default: False
    "dataset_save_path": None,  # default: None
    "save_dataloaders": False,  # default: False
    "dataloaders_save_path": None,  # default: None
    "log_wandb": False,  # default: False
    "wandb_project": "recbole",  # default: "recbole"
    "shuffle": True,  # default: True
    # data settings
    # atomic file format
    "field_separator": "\t",  # default: "\t"
    "seq_separator": " ",  # default: " "
    # basic information
    # common features
    "USER_ID_FIELD": "user_id",  # default: "user_id"
    "ITEM_ID_FIELD": "item_id",  # default: "item_id"
    "RATING_FIELD": "rating",  # default: "rating"
    "TIME_FIELD": "timestamp",  # default: "timestamp"
    "seq_len": {},  # default: {}
    # label for point-wise dataloader
    "LABEL_FIELD": "label",  # default: "label"
    "threshold": None,  # default: None
    # negative sampling prefix for pair-wise dataloader
    "NEG_PREFIX": "neg_",  # default: "neg_"
    # sequential model needed
    "ITEM_LIST_LENGTH_FIELD": "item_length",  # default: "item_length"
    "LIST_SUFFIX": "_list",  # default: "_list"
    "MAX_ITEM_LIST_LENGTH": 50,  # default: 50
    "POSITION_FIELD": "position_id",  # default: "position_id"
    # knowledge-based model needed
    "HEAD_ENTITY_ID_FIELD": "head_id",  # default: "head_id"
    "TAIL_ENTITY_ID_FIELD": "tail_id",  # default: "tail_id"
    "RELATION_ID_FIELD": "relation_id",  # default: "relation_id"
    "kg_reverse_r": False,  # default: False
    "entity_kg_num_interval": "[0, inf)",  # default: "[0, inf)"
    "relation_kg_num_interval": "[0, inf)",  # default: "[0, inf)"
    # selectively loading
    "load_col": {"inter": ["user_id", "item_id", "rating"]},  # default: {inter: [user_id, item_id]}
    "unload_col": {},  # default: {}
    "unused_col": {},  # default: {}
    "additional_feat_suffix": [],  # default: []
    "numerical_features": [],  # default: []
    # filtering
    # remove duplicated user-item interactions
    "rm_dup_inter": None,  # default: None
    # filter by value
    "val_interval": {},  # default: {}
    # remove interaction by user or item
    "filter_inter_by_user_or_item": True,  # default: True
    # filter by number of interactions
    "user_inter_num_interval": "[0, inf)",  # default: "[0, inf)"
    "item_inter_num_interval": "[0, inf)",  # default: "[0, inf)"
    # preprocessing
    "alias_of_user_id": None,  # default: None
    "alias_of_item_id": None,  # default: None
    "alias_of_entity_id": None,  # default: None
    "alias_of_relation_id": None,  # default: None
    "preload_weight": {},  # default: {}
    "normalize_field": [],  # default: []
    "normalize_all": False,  # default: False
    "discretization": None,  # default: None
    # benchmark file
    "benchmark_filename": None,  # default: None
    # training settings
    "epochs": 50,  # default: 300
    "train_batch_size": 2048,  # default: 2048
    "learner": "adam",  # default: "adam"
    "learning_rate": 0.001,  # default: 0.001
    "training_neg_sample_args":
        {
            "distribution": "uniform",  # default: "uniform"
            "sample_num": 1,  # default: 1
            "dynamic": False,  # default: False
            "candidate_num": 0,  # default: 0
        },
    "eval_step": 5,  # default: 1
    "stopping_step": 10,  # default: 10
    "clip_grad_norm": None,  # default: None
    "loss_decimal_place": 4,  # default: 4
    "weight_decay": 0.0,  # default: 0.0
    "require_pow": False,  # default: False
    "enable_amp": False,  # default: False
    "enable_scaler": False,  # default: False
    # evaluation settings
    "eval_args":
        {
            "group_by": "user",  # default: "user"
            "order": "RO",  # default: "RO"
            "split":
                {
                    # "RS": [8, 1, 1] # default: {"RS": [8, 1, 1]}
                    "LS": "valid_and_test"
                },
            "mode":
                {
                    "valid": "uni100",  # default: "full"
                    "test": "uni100",  # default: "full"
                },
        },
    "repeatable": False,  # default: False
    "metrics": ["Recall", "MRR", "NDCG", "Hit", "MAP", "Precision", "GAUC", "ItemCoverage", "AveragePopularity",
                "GiniIndex", "ShannonEntropy", "TailPercentage"],
    # default: ["Recall", "MRR", "NDCG", "Hit", "Precision"]
    "topk": [1, 3, 5, 10, 20],  # default: 10
    "valid_metric": "NDCG@10",  # default: "MRR@10"
    "eval_batch_size": 4096,  # default: 4096
    "metric_decimal_place": 4,  # default: 4,
    # misc settings
    "model": args.model_name,
    "MODEL_TYPE": ModelType.GENERAL
}
config = Config(model=args.model_name, dataset=args.data_set_name, config_dict=config_dict)
init_seed(config['seed'], config['reproducibility'])
init_logger(config)
logger = getLogger()
logger.info(config)
dataset = create_dataset(config)
logger.info(dataset)
train_data, valid_data, test_data = data_preparation(config, dataset)
model = get_model(config["model"])(config, train_data.dataset).to(config['device'])
logger.info(model)
trainer = get_trainer(config["MODEL_TYPE"], config["model"])(config, model)
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
test_result = trainer.evaluate(test_data)
print(test_result)

Expected behavior
Models from the algorithms Random, ADMMSLIM, and SLIMElastic should be trained and evaluated on the MovieLens-100K data set without crashing.

Desktop (please complete the following information):

  • OS: Linux
  • RecBole Version: 1.2.0
  • Python Version: 3.10
  • PyTorch Version: 2.1.1
  • cudatoolkit Version: 12.1

I believe this happens during validation and that the same bug was fixed for different models in #1873.

@lukas-wegmeth lukas-wegmeth added the bug Something isn't working label Jan 26, 2024
@BishopLiu
Copy link
Collaborator

@lukas-wegmeth Thank you for your attention to RecBole! We have fixed the bugs mentioned above in #1997.

@lukas-wegmeth
Copy link
Author

@BishopLiu Thank you! #1997 fixed this issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants