-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Model6 #172
Implement Model6 #172
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice solution and basically looks good -- my biggest question is what happens when we have a class with two reference fields targeting the same parent class, e.g.
class Practice
location ~ City
previous_location ~ City
If I'm reading the code right, it looks like those would give rise to the same noisy relation name (for the name of the city e.g.).
Also, at some point we should update the README. I'll file a Github issue.
I'll be offline for the next couple weeks so I'll hand this off to Srinivas to continue the review.
cxx/distributions/beta_bernoulli.cc
Outdated
alpha = hypers[i].first; | ||
beta = hypers[i].second; | ||
if (logps.empty()) { | ||
printf("Warning! All hyperparamters for BetaBernoulli give nans!\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why this is the behavior we want, rather than exiting or letting the nans propagate? Can the nans disappear once more data is incorporated?
also sp: hyperparamEters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this is the behavior we want because it fulfills the implicit contract of transition_hyperparameters, which I take to be "decrease the logp_score by changing the hyperparameters, if you can". If all the evaluated hyperparamter values give nans, well then you can't.
It's theoretically possible that the nans could go away when more data is incorporated, but I don't think it is likely. A more likely possibility is that the nans go away when the data is reclustered, which is a good reason not to exit.
I'm not sure what it would mean here to let the nans propagate. We aren't doing an early exit because we saw some nans, we are doing an early exit because the logps vector is empty and we can't access an item from an empty vector.
Fixed the typo.
cxx/pclean/pclean.cc
Outdated
@@ -31,6 +31,11 @@ int main(int argc, char** argv) { | |||
("i,iters", "Number of inference iterations", | |||
cxxopts::value<int>()->default_value("10")) | |||
("seed", "Random seed", cxxopts::value<int>()->default_value("10")) | |||
("only_final_emissions", "Only create one layer of emissions", | |||
cxxopts::value<bool>()->default_value("false")) | |||
("query_class_is_clean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this record_class
instead of query_class
, for consistency with the schema terminology and to help differentiate the record class vs. the query output?
Also, consider flipping the semantics of both of these (IMO that's more intuitive, that "true" means "add extra noise") -- i.e. latent_attributes_are_noisy
and record_class_is_noisy
/record_attributes_are_noisy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the s/query_class/record_class/
BOOST_TEST(tschema.contains("Physician:school::School:name")); | ||
BOOST_TEST(tschema.contains("Practice:city::City:name")); | ||
BOOST_TEST(tschema.contains("Practice:city::City:state")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also contain Physician:practice:city::City:{name, state}
, right? If so could you test that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the stuff before the :: is only a "[Observing_class]:[observing_variable]" not a full path. So there is no Physician:practice:city::City:name.
@@ -218,6 +247,83 @@ BOOST_AUTO_TEST_CASE(test_make_hirm_schmea) { | |||
// "City" moved to the front of the list. | |||
expected_domains = {"City", "School", "Physician", "Practice", "Record"}; | |||
BOOST_TEST(nr5.domains == expected_domains, tt::per_element()); | |||
|
|||
BOOST_TEST(tschema.contains("Physician:school::School:name")); | |||
BOOST_TEST(tschema.contains("Practice:city::City:name")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have something like
class Practice
location ~ City
previous_location ~ City
? That seems like something we should support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do support it! Assuming these are both observed in the query section, they will generate intermediate noisy relations named "Practice:location::City:name" and "Practice:previous_location::City:name".
@@ -98,6 +195,9 @@ std::vector<std::string> reorder_domains( | |||
|
|||
T_schema PCleanSchemaHelper::make_hirm_schema() { | |||
T_schema tschema; | |||
|
|||
// For every scalar variable, make a clean relation with the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a check that all scalar variables appear in the query? If we define clean relations that aren't observed and aren't the base relation for a noisy relation that's observed, I'm not sure what happens downstream (and that seems unintended).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should support scalar variables not appearing in the query, because I think we should support model specification being an independent thing from query specification. That is, if you spend a lot of time developing a fancy probabilistic model of some domain, it would be great if you could just hand it to me and I could use it just for the fewer columns that interest me (or that I have available) without having to edit the model down.
It would be great if the IRM/HIRM code could support "totally unobserved" clean relations, because that is the most natural place for it. (For example, the same issue arises when not using schema generated by pclean). I'll create an issue so we can discuss this further and figure out a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies I'm still working through this (since there's a lot for me to catch up on with the DD and some other PRs). Hopefully should be clear on this later today.
alpha = hypers[i].first; | ||
beta = hypers[i].second; | ||
if (logps.empty()) { | ||
printf("Warning! All hyperparameters for BetaBernoulli give nans!\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess practically inference gets stuck, because the hyperparameters don't move, but there might be some hope for other parameters to move and get unstuck? My thinking is these should be asserts since it would be hard to get out of here (you need the observations to change, so the cluster assignments to change) and highlights some suboptimality of inference that should be fixed somewhere else (either numerical stability, bad preconditioning of some sort, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience!
Model6 behavior is now the default for pclean; use --only_final_emissions to recover Model5.
This also adds --query_class_is_clean (which defaults to true), for not adding emissions noise when the query is directly from the query class.
To make the integration tests pass, I also had to fix a bug in all the distributions where the code had assumed that at least one of the hyperparameters wouldn't give nan.
I didn't end up using the IdentityEmission. I think we should still keep it for clean_string etc. fields.