-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split up Datamodel into predicates, rename to Featurizer #1088
Comments
i think that's a nice idea, nick! i would love to see that. |
If we remove the DataModel class, it's going to be difficult to remain backwards compatible with the settings file loading in api.py. eg the What do you think @fgregg, would this change need to be backwards compatible? |
Thinking about this more, I think it might make sense to combine the Featurizer and Classifier into one object, FeatureScorer, which is a concrete implementation of a Scorer protocol. Dedupe and the higher classes should only operate on the class Scorer(Protocol):
def score(self, pairs: Pairs) -> Scores:
...
class FeatureScorer(Scorer):
def __init__(self, featurizer, classifier):
self.featurizer = featurizer
self.classifier = classifier
def score(self, pairs):
# or
# features = self.featurizer.featurize(pairs)
# not sure if it's worth defining an entire protocol
# for featurizing
features = self.featurizer(pairs)
return self.classifier(features) By doing this, it allows someone (ie me :) ) to plug in their own scorer that operates in a different way. For instance, I have some hardcoded rules that I want to enforce ("if first names don't match, then give a score of 0") and this logic is impossible with the current featurize/classify pipeline we have now. |
i think it's right that the featurizer and classifer are always going to be very coupled. so that makes sense! |
one thing i have been thinking about, increasingly, is blocking as a valuable feature for scoring #1103, so that's a place where you might want to loosen the coupling. |
Branching off from thoughts at #1079 (comment)
Datamodel currently has two responsibilities:
We should split this up. The predicate holder I think could just be a simple python set of predicates. The feature generator could be called Featurizer or something (not Comparator or similar, since that sounds too much like it generates a 0-1 score, which is the Classifier's job).
We could still generate these two things from variable definitions, but they should be stored separately. This would help for instance in labeler.py, when we are passing around a Datamodel willy nilly, when really what we care about is either the predicates for the BlockLearner, or the feature vectors for the RLRLearner. We should only pass around the banana, not the entire gorilla holding the banana.
By doing this, it would also allow us to better abstract the two separate steps of
For instance, I would like to be able to modify one of these steps, and it currently is a bit difficult to get in the middle and modify what's happening. To be precise, I want to add hard rules, such as "if both first name and address match, then mark similarity as 1" (I can add a custom variable for this that helps, but this doesn't guarantee that they get a score of 1).
The text was updated successfully, but these errors were encountered: