-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first implementation of wide&deep model #301
Conversation
5ab4b69
to
5610f92
Compare
Documentation preview |
blocked by #308 |
Can we merge this and flag the optimizer issue. |
Click to view CI ResultsGitHub pull request #301 of commit c7fbae56a07caa1eb6de1a8533d8f16e024e4778, has merge conflicts. Running as SYSTEM Setting status of c7fbae56a07caa1eb6de1a8533d8f16e024e4778 to PENDING with url https://10.20.13.93:8080/job/merlin_models/235/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_models using credential nvidia-merlin-bot > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/models/ # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/models/ > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/models/ +refs/pull/301/*:refs/remotes/origin/pr/301/* # timeout=10 > git rev-parse c7fbae56a07caa1eb6de1a8533d8f16e024e4778^{commit} # timeout=10 Checking out Revision c7fbae56a07caa1eb6de1a8533d8f16e024e4778 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f c7fbae56a07caa1eb6de1a8533d8f16e024e4778 # timeout=10 Commit message: "Merge branch 'main' into tf/wide_and_deep" > git rev-list --no-walk 736348756714771db0da34cfa43eb995512dc3e4 # timeout=10 [merlin_models] $ /bin/bash /tmp/jenkins175405788639646245.sh Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: testbook in /usr/local/lib/python3.8/dist-packages (0.4.2) Requirement already satisfied: nbformat>=5.0.4 in /usr/local/lib/python3.8/dist-packages (from testbook) (5.3.0) Requirement already satisfied: nbclient>=0.4.0 in /usr/local/lib/python3.8/dist-packages (from testbook) (0.6.0) Requirement already satisfied: traitlets>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from nbclient>=0.4.0->testbook) (5.1.1) Requirement already satisfied: jupyter-client>=6.1.5 in /usr/local/lib/python3.8/dist-packages (from nbclient>=0.4.0->testbook) (7.3.0) Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.8/dist-packages (from nbclient>=0.4.0->testbook) (1.5.5) Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.8/dist-packages (from nbformat>=5.0.4->testbook) (2.15.3) Requirement already satisfied: jupyter-core in /usr/local/lib/python3.8/dist-packages (from nbformat>=5.0.4->testbook) (4.10.0) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.8/dist-packages (from nbformat>=5.0.4->testbook) (4.4.0) Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=2.6->nbformat>=5.0.4->testbook) (21.4.0) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=2.6->nbformat>=5.0.4->testbook) (0.18.1) Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=2.6->nbformat>=5.0.4->testbook) (5.7.1) Requirement already satisfied: pyzmq>=22.3 in /usr/local/lib/python3.8/dist-packages (from jupyter-client>=6.1.5->nbclient>=0.4.0->testbook) (22.3.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.8/dist-packages (from jupyter-client>=6.1.5->nbclient>=0.4.0->testbook) (2.8.2) Requirement already satisfied: tornado>=6.0 in /var/jenkins_home/.local/lib/python3.8/site-packages/tornado-6.1-py3.8-linux-x86_64.egg (from jupyter-client>=6.1.5->nbclient>=0.4.0->testbook) (6.1) Requirement already satisfied: entrypoints in /usr/local/lib/python3.8/dist-packages (from jupyter-client>=6.1.5->nbclient>=0.4.0->testbook) (0.4) Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from importlib-resources>=1.4.0->jsonschema>=2.6->nbformat>=5.0.4->testbook) (3.8.0) Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.8.2->jupyter-client>=6.1.5->nbclient>=0.4.0->testbook) (1.15.0) ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_models/models, configfile: pyproject.toml plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 360 items / 2 skipped |
This PR uses the feature-column API for feature-crosses. This is not something we want to merge now. From talking to @benfred it seems like we actually support feature-crosses in NVT, although sounded like it’s a bit hidden. I think we need to create an example how to do it, and then incorporate it in the wide&deep model. It might also make sense to create a tag for a feature-cross that NVT could output. |
From reviewing the Wide and Deep paper, they perform feature crosses as part of the model architecture, so I don't think we want to move the responsibility for feature crossing to NVT. Seems like what would be required to move this PR forward would be an implementation of feature crossing in Merlin Models that doesn't rely on the feature column API. What would be involved in creating one? EDIT: To make my reasoning more explicit here, forcing people to do feature processing for a particular model violates the promise that we're making with dataset schemas, which is that you can quickly try out a bunch of models to compare them on the same data and we do the work to make that happen behind the scenes. |
I agree with you that the ideal case is to make feature-crosses part of Merlin Models but this will require extra work. Since we already seem to be supporting it in NVT, I was thinking that that would be the minimum scope to have a wide&deep model in Merlin Models. We could create a separate ticket to add feature-crosses to Merlin Models. Not sure as of now, what the priority should be to work on that ticket though. Curiuous to your thoughts (cc @EvenOldridge @benfred) |
I'm working on closing out unfinished scope from 22.04 which includes Wide & Deep, so it's not extra work, it's just the work we previously committed to. If we need to put more people on doing that work in order to finish it, maybe it's something @jperez999 and I can help with? |
I don’t think we ever made a proper definition of done, so not sure it’s competely fair to say that we have comitted to a W&D model that contains feature-crosses (as opposed to doing the feature-crossing in NVT). But if you have time to help, it seems totally doable to add it though. I was under the impression that the string hashing-ops were CPU-only in TF, but it seems like that’s no longer the case (PR). This would mean that we could concat the ids of the columns you want to cross: We have to figure out how to enable a nice API for the user to provide the feature-crosses. |
Until we either have Wide&Deep merged or we've explicitly decided not to add it, whatever is required to make it happen seems to me like part of the scope we've committed ourselves to. 🤷🏻 Do we specifically need to use hashing, or can we exhaustively compute the pairwise feature crosses? |
The ideal end state of the wide&deep model:
We could either commit to do all these things in one go or to add things gradually. Up until this point we have been informally talking about the gradual approach, but totally fine to me if we want to do it all in go. |
I am closing this PR, as a new PR was opened by @Timmy00. This PR solves the first point: |
I agree feature crossing should be done in the modeling side, so that a Data Scientist would not have to preprocess multiple versions of the dataset to try out different feature interactions. |
Fixes #154
Goals ⚽
Testing Details 🔍
ValueError
related to wrong keys names being provided to CrossFeatures.