Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with bipartite/monopartite datasets can be improved #7

Open
pedroilidio opened this issue Apr 15, 2023 · 0 comments
Open

Dealing with bipartite/monopartite datasets can be improved #7

pedroilidio opened this issue Apr 15, 2023 · 0 comments
Labels
discussion There are points open to discussion maintenance Facilitates long term maintenance of the project refactor Enhances code design without breaking performance

Comments

@pedroilidio
Copy link
Owner

pedroilidio commented Apr 15, 2023

In different scenarios, we sometimes deal with bipartite formatted data, but sometimes the bipartite datasets are converted to the monopartite form, monopartite meaning that X is formed by pairwise concatenations of all possible X[0] and X[1] rows, and bipartite meaning X = [X[0], X[1]].

As mentioned in #5 (comment), the way of distinguishing between these two formats deserves more careful solutions than what we currently do:

def _X_is_multipartite(X):
# TODO: find a better way of deciding.
return isinstance(X, (tuple, list))

Even more so since some estimators do accept both types of input for predict() (tree-based models in general) while others only accept the bipartite format (the matrix factorization ones, for instance), but all of them should yield flattened predictions for better integration with scikit-learn scoring utilities, which I reckon can be quite confusing.

  1. I suppose an estimator tag would be an appropriate way of signaling that.
  2. Maybe a whole Dataset class would facilitate maintenance in the long term.
@pedroilidio pedroilidio added maintenance Facilitates long term maintenance of the project refactor Enhances code design without breaking performance discussion There are points open to discussion labels Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion There are points open to discussion maintenance Facilitates long term maintenance of the project refactor Enhances code design without breaking performance
Projects
None yet
Development

No branches or pull requests

1 participant