Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Model6 #172

Merged
merged 15 commits into from
Aug 21, 2024
Merged

Implement Model6 #172

merged 15 commits into from
Aug 21, 2024

Conversation

ThomasColthurst
Copy link
Collaborator

Model6 behavior is now the default for pclean; use --only_final_emissions to recover Model5.

This also adds --query_class_is_clean (which defaults to true), for not adding emissions noise when the query is directly from the query class.

To make the integration tests pass, I also had to fix a bug in all the distributions where the code had assumed that at least one of the hyperparameters wouldn't give nan.

I didn't end up using the IdentityEmission. I think we should still keep it for clean_string etc. fields.

Copy link
Collaborator

@emilyfertig emilyfertig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice solution and basically looks good -- my biggest question is what happens when we have a class with two reference fields targeting the same parent class, e.g.

class Practice
  location ~ City
  previous_location ~ City

If I'm reading the code right, it looks like those would give rise to the same noisy relation name (for the name of the city e.g.).

Also, at some point we should update the README. I'll file a Github issue.

I'll be offline for the next couple weeks so I'll hand this off to Srinivas to continue the review.

alpha = hypers[i].first;
beta = hypers[i].second;
if (logps.empty()) {
printf("Warning! All hyperparamters for BetaBernoulli give nans!\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this is the behavior we want, rather than exiting or letting the nans propagate? Can the nans disappear once more data is incorporated?

also sp: hyperparamEters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is the behavior we want because it fulfills the implicit contract of transition_hyperparameters, which I take to be "decrease the logp_score by changing the hyperparameters, if you can". If all the evaluated hyperparamter values give nans, well then you can't.

It's theoretically possible that the nans could go away when more data is incorporated, but I don't think it is likely. A more likely possibility is that the nans go away when the data is reclustered, which is a good reason not to exit.

I'm not sure what it would mean here to let the nans propagate. We aren't doing an early exit because we saw some nans, we are doing an early exit because the logps vector is empty and we can't access an item from an empty vector.

Fixed the typo.

@@ -31,6 +31,11 @@ int main(int argc, char** argv) {
("i,iters", "Number of inference iterations",
cxxopts::value<int>()->default_value("10"))
("seed", "Random seed", cxxopts::value<int>()->default_value("10"))
("only_final_emissions", "Only create one layer of emissions",
cxxopts::value<bool>()->default_value("false"))
("query_class_is_clean",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this record_class instead of query_class, for consistency with the schema terminology and to help differentiate the record class vs. the query output?

Also, consider flipping the semantics of both of these (IMO that's more intuitive, that "true" means "add extra noise") -- i.e. latent_attributes_are_noisy and record_class_is_noisy/record_attributes_are_noisy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the s/query_class/record_class/

cxx/pclean/schema_helper.cc Outdated Show resolved Hide resolved
cxx/pclean/schema_helper.cc Outdated Show resolved Hide resolved
cxx/pclean/schema_helper_test.cc Outdated Show resolved Hide resolved
cxx/pclean/schema_helper_test.cc Show resolved Hide resolved
BOOST_TEST(tschema.contains("Physician:school::School:name"));
BOOST_TEST(tschema.contains("Practice:city::City:name"));
BOOST_TEST(tschema.contains("Practice:city::City:state"));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also contain Physician:practice:city::City:{name, state}, right? If so could you test that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the stuff before the :: is only a "[Observing_class]:[observing_variable]" not a full path. So there is no Physician:practice:city::City:name.

@@ -218,6 +247,83 @@ BOOST_AUTO_TEST_CASE(test_make_hirm_schmea) {
// "City" moved to the front of the list.
expected_domains = {"City", "School", "Physician", "Practice", "Record"};
BOOST_TEST(nr5.domains == expected_domains, tt::per_element());

BOOST_TEST(tschema.contains("Physician:school::School:name"));
BOOST_TEST(tschema.contains("Practice:city::City:name"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have something like

class Practice
  location ~ City
  previous_location ~ City

? That seems like something we should support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support it! Assuming these are both observed in the query section, they will generate intermediate noisy relations named "Practice:location::City:name" and "Practice:previous_location::City:name".

@@ -98,6 +195,9 @@ std::vector<std::string> reorder_domains(

T_schema PCleanSchemaHelper::make_hirm_schema() {
T_schema tschema;

// For every scalar variable, make a clean relation with the name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a check that all scalar variables appear in the query? If we define clean relations that aren't observed and aren't the base relation for a noisy relation that's observed, I'm not sure what happens downstream (and that seems unintended).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support scalar variables not appearing in the query, because I think we should support model specification being an independent thing from query specification. That is, if you spend a lot of time developing a fancy probabilistic model of some domain, it would be great if you could just hand it to me and I could use it just for the fewer columns that interest me (or that I have available) without having to edit the model down.

It would be great if the IRM/HIRM code could support "totally unobserved" clean relations, because that is the most natural place for it. (For example, the same issue arises when not using schema generated by pclean). I'll create an issue so we can discuss this further and figure out a fix.

Copy link
Collaborator

@srvasude srvasude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I'm still working through this (since there's a lot for me to catch up on with the DD and some other PRs). Hopefully should be clear on this later today.

alpha = hypers[i].first;
beta = hypers[i].second;
if (logps.empty()) {
printf("Warning! All hyperparameters for BetaBernoulli give nans!\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess practically inference gets stuck, because the hyperparameters don't move, but there might be some hope for other parameters to move and get unstuck? My thinking is these should be asserts since it would be hard to get out of here (you need the observations to change, so the cluster assignments to change) and highlights some suboptimality of inference that should be fixed somewhere else (either numerical stability, bad preconditioning of some sort, etc).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@srvasude srvasude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience!

@ThomasColthurst ThomasColthurst merged commit 9a2cccb into master Aug 21, 2024
2 checks passed
@ThomasColthurst ThomasColthurst deleted the 080724-thomaswc-model6 branch August 21, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants