-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Training With Sparse Matrices #1629
Conversation
…n't do the same calulcation twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this draft! As you mentioned, there would be a separate pass on style, linting, etc. before this could be merged, so I mostly just took a look at the high level structure (except for where I couldn't help myself to comment on individual lines).
I can cc a couple of the original authors of the label model to double check correctness of the new methods once the structure is in a good place. From a structural standpoint, you're right that this doesn't quite match the style of the methods around it. The changes to label_model.py are fairly general, mostly breaking large blocks of code into smaller methods, which isn't a bad idea anyway, especially given the size of that file. But the helper methods aren't object-oriented like the other primary classes in labeling/model/
.
There are two ways that I could see this moving forward:
- Move these helper methods to a sparse_label_model project under
snorkel/contrib
(see the README there for details), where they can be used by others as helpful methods, but are less integrated and not guaranteed the same level of long-term support. - Create a
SparseLabelModel
class that inherits fromLabelModel
and adds these new functionalities as part of specializedfit
methods. This keeps the new complexity away from the simple case where L is dense inlabel_model.py
, while maintaining the object-centric design used in that module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is moving in the right direction—the class abstraction and subdirectory feel pretty good stylistically—no added complexity in the base case, but similar usage patterns.
I added a few hyper-local suggestions just from quick browsing. There are still some linting issues, of course, and I'll do a deeper pass once you feel like most of the big structural movement is done. Explicitly re-request review once tests are passing. I'm also tagging @fredsala for help reviewing the label model logic.
snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py
Outdated
Show resolved
Hide resolved
snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py
Outdated
Show resolved
Hide resolved
snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py
Outdated
Show resolved
Hide resolved
snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py
Outdated
Show resolved
Hide resolved
snorkel/labeling/model/sparse_label_model/sparse_label_model_helpers.py
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #1629 +/- ##
==========================================
- Coverage 97.21% 96.26% -0.95%
==========================================
Files 68 72 +4
Lines 2151 2276 +125
Branches 345 358 +13
==========================================
+ Hits 2091 2191 +100
- Misses 31 52 +21
- Partials 29 33 +4
|
I think the coverage tool isn't picking up on some of the tests.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really close here! There are a handful of .idea files we need to pull out of the commit, and then some small formatting tweaks. I agree—codecov doesn't seem to be giving credit for @pytest.mark.complex methods but I can see that they're actually covered. If you can address the above comments and do one more light pass for typos/spacing/formatting, I think this should be ready to land!
@@ -129,9 +129,10 @@ dmypy.json | |||
# Editors | |||
.vscode/ | |||
.code-workspace* | |||
|
|||
.idea/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cool to add this to .gitignore, but then let's not add all the files in the .idea directory. Ditto with workspace.code-workspace
—let's not add it to the repo, but if you want to add that type of file to .gitignore, that's fine.
|
||
Indexing throughout this module is 0 based, with the assumption that "abstains" are ommited. | ||
|
||
When working with larger datasets, it can be convenient to load the data in sparse format. This module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's move this summary of this file's purpose above the indexing disclaimer.
Case 2: | ||
The user has a list of 3-tuples(i,j,k) such that for document i, labeling function j predicted class k. | ||
|
||
The Case 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop "The" for consistency
and the user has a list of tuples (i,j) that indicate that event j occoured for example i. | ||
|
||
Case 2: | ||
The user has a list of 3-tuples(i,j,k) such that for document i, labeling function j predicted class k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: space before (i,j,k)
|
||
The Case 3: | ||
user has a list of 3-tuples (i,j,c) where i and j range over [0,num_funcs*num_classes] such that | ||
the events i and j were observed to have co-occur c times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra space here and in case 5
rows = [] | ||
cols = [] | ||
data = [] | ||
cliquesets_list = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the parentheses, since it's just a list, not a tuple
if not is_augmented: | ||
# This is the usual mode | ||
L_shift = L + 1 # convert to {0, 1, ..., k} | ||
self._set_constants(L_shift) # TODO - Why do we need this here ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._get_augmented_label_matrix uses at least self.cardinality, which is set in this method. Remove the TODO?
This pull request is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Disclaimer
This PR isn't done. It does what it's supposed to do and has tests, but style and code cleanliness might not be there.
I'm not totally confidant this implementation is a fit. I'd appreciate if someone could take a look and let me know if I'm on track before I polish this. Maybe @bhancock8 who replied to the original issue ?
Description of proposed changes
Adds support for training and inference with sparse matrices.
This PR adds a few convenience functions to help the user work with sparse matrices representations of L_ind / or the objective matrix (do either have a formal name ? ).
I presume most users will call 'train_model_from_sparse_event_cooccurence', which takes a list of tuples representing L_ind indices and value (which is always 1), populate a sparse matrix and runs training.
train_model_from_sparse_event_cooccurence calls 'train_model_from_known_objective' which gets a dense numpy representation of O and trains. When I use Snorkel I call this function and calculate O elsewhere, it's faster.
Internally, there is some refactoring in LabelModel to support
train_model_from_known_objective
, constants are set differently and the tree and clique data calculations are moved a little.Related issue(s)
Fixes #1625
Test plan
I wrote tests in test_sparse_data_helpers.
Basically the tests create an L matrix in standard format, and then compare the output of normal Snorkel to Sparse Snorkel.
Checklist
Need help on these? Just ask!
tox -e complex
and/ortox -e spark
if appropriate.