Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotImplementedError: Data has duplicate values #18

Open
99sbr opened this issue Mar 25, 2022 · 2 comments
Open

NotImplementedError: Data has duplicate values #18

99sbr opened this issue Mar 25, 2022 · 2 comments

Comments

@99sbr
Copy link

99sbr commented Mar 25, 2022

data_model = ItemColdStartData(
training_data,
*training_data.columns, # userid, itemid
item_features=content_feature_df,
seed=seed)

print(data_model)

HERE IM GETTING ERROR: NotImplementedError: Data has duplicate values

My dataframe has multiple entries for a user. cant drop them. any help here

Screenshot 2022-03-25 at 14 29 38

@evfro
Copy link
Owner

evfro commented Mar 26, 2022

Hi!

The problem is not that your data contains multiple entries for a user, but that your data contains multiple entries of the same user-item pair. It's like having multiple ratings for the same movie from the same user. This is not a standard collaborative filtering scenario.

You need to deduplicate such entries, e.g., like this:

dedup_data = data.drop_duplicates(subset=['userid', 'movieid'])

@99sbr
Copy link
Author

99sbr commented Mar 26, 2022

Understood thanks for the help.

Facing one more blocker. data_model.prepare() kind of takes a lot of time and freezes when I run the step. Any idea why? i know my dataset is big but any optimisation that can be followed?
Screenshot 2022-03-26 at 14 52 36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants