Implement Model6 #172

ThomasColthurst · 2024-08-16T16:29:40Z

Model6 behavior is now the default for pclean; use --only_final_emissions to recover Model5.

This also adds --query_class_is_clean (which defaults to true), for not adding emissions noise when the query is directly from the query class.

To make the integration tests pass, I also had to fix a bug in all the distributions where the code had assumed that at least one of the hyperparameters wouldn't give nan.

I didn't end up using the IdentityEmission. I think we should still keep it for clean_string etc. fields.

emilyfertig

This is a nice solution and basically looks good -- my biggest question is what happens when we have a class with two reference fields targeting the same parent class, e.g.

class Practice
  location ~ City
  previous_location ~ City

If I'm reading the code right, it looks like those would give rise to the same noisy relation name (for the name of the city e.g.).

Also, at some point we should update the README. I'll file a Github issue.

I'll be offline for the next couple weeks so I'll hand this off to Srinivas to continue the review.

emilyfertig · 2024-08-16T21:03:04Z

cxx/distributions/beta_bernoulli.cc

-  alpha = hypers[i].first;
-  beta = hypers[i].second;
+  if (logps.empty()) {
+    printf("Warning! All hyperparamters for BetaBernoulli give nans!\n");


Could you explain why this is the behavior we want, rather than exiting or letting the nans propagate? Can the nans disappear once more data is incorporated?

also sp: hyperparamEters

So I think this is the behavior we want because it fulfills the implicit contract of transition_hyperparameters, which I take to be "decrease the logp_score by changing the hyperparameters, if you can". If all the evaluated hyperparamter values give nans, well then you can't.

It's theoretically possible that the nans could go away when more data is incorporated, but I don't think it is likely. A more likely possibility is that the nans go away when the data is reclustered, which is a good reason not to exit.

I'm not sure what it would mean here to let the nans propagate. We aren't doing an early exit because we saw some nans, we are doing an early exit because the logps vector is empty and we can't access an item from an empty vector.

Fixed the typo.

emilyfertig · 2024-08-16T21:57:01Z

cxx/pclean/pclean.cc

@@ -31,6 +31,11 @@ int main(int argc, char** argv) {
      ("i,iters", "Number of inference iterations",
       cxxopts::value<int>()->default_value("10"))
      ("seed", "Random seed", cxxopts::value<int>()->default_value("10"))
+      ("only_final_emissions", "Only create one layer of emissions",
+       cxxopts::value<bool>()->default_value("false"))
+      ("query_class_is_clean",


Can we call this record_class instead of query_class, for consistency with the schema terminology and to help differentiate the record class vs. the query output?

Also, consider flipping the semantics of both of these (IMO that's more intuitive, that "true" means "add extra noise") -- i.e. latent_attributes_are_noisy and record_class_is_noisy/record_attributes_are_noisy.

Did the s/query_class/record_class/

cxx/pclean/schema_helper.cc

cxx/pclean/schema_helper_test.cc

emilyfertig · 2024-08-16T23:44:55Z

cxx/pclean/schema_helper_test.cc

+  BOOST_TEST(tschema.contains("Physician:school::School:name"));
+  BOOST_TEST(tschema.contains("Practice:city::City:name"));
+  BOOST_TEST(tschema.contains("Practice:city::City:state"));
+}


This should also contain Physician:practice:city::City:{name, state}, right? If so could you test that?

No, the stuff before the :: is only a "[Observing_class]:[observing_variable]" not a full path. So there is no Physician:practice:city::City:name.

emilyfertig · 2024-08-17T00:05:14Z

cxx/pclean/schema_helper_test.cc

@@ -218,6 +247,83 @@ BOOST_AUTO_TEST_CASE(test_make_hirm_schmea) {
  // "City" moved to the front of the list.
  expected_domains = {"City", "School", "Physician", "Practice", "Record"};
  BOOST_TEST(nr5.domains == expected_domains, tt::per_element());
+
+  BOOST_TEST(tschema.contains("Physician:school::School:name"));
+  BOOST_TEST(tschema.contains("Practice:city::City:name"));


What if we have something like

class Practice location ~ City previous_location ~ City

? That seems like something we should support.

We do support it! Assuming these are both observed in the query section, they will generate intermediate noisy relations named "Practice:location::City:name" and "Practice:previous_location::City:name".

emilyfertig · 2024-08-17T00:19:49Z

cxx/pclean/schema_helper.cc

@@ -98,6 +195,9 @@ std::vector<std::string> reorder_domains(

 T_schema PCleanSchemaHelper::make_hirm_schema() {
  T_schema tschema;
+
+  // For every scalar variable, make a clean relation with the name


Could you add a check that all scalar variables appear in the query? If we define clean relations that aren't observed and aren't the base relation for a noisy relation that's observed, I'm not sure what happens downstream (and that seems unintended).

I think we should support scalar variables not appearing in the query, because I think we should support model specification being an independent thing from query specification. That is, if you spend a lot of time developing a fancy probabilistic model of some domain, it would be great if you could just hand it to me and I could use it just for the fewer columns that interest me (or that I have available) without having to edit the model down.

It would be great if the IRM/HIRM code could support "totally unobserved" clean relations, because that is the most natural place for it. (For example, the same issue arises when not using schema generated by pclean). I'll create an issue so we can discuss this further and figure out a fix.

srvasude

Apologies I'm still working through this (since there's a lot for me to catch up on with the DD and some other PRs). Hopefully should be clear on this later today.

srvasude · 2024-08-20T15:18:23Z

cxx/distributions/beta_bernoulli.cc

-  alpha = hypers[i].first;
-  beta = hypers[i].second;
+  if (logps.empty()) {
+    printf("Warning! All hyperparameters for BetaBernoulli give nans!\n");


I guess practically inference gets stuck, because the hyperparameters don't move, but there might be some hope for other parameters to move and get unstuck? My thinking is these should be asserts since it would be hard to get out of here (you need the observations to change, so the cluster assignments to change) and highlights some suboptimality of inference that should be fixed somewhere else (either numerical stability, bad preconditioning of some sort, etc).

srvasude

Thanks for your patience!

ThomasColthurst added 8 commits August 7, 2024 16:28

Add --only_final_emissions

6ac409c

Fix merge conflicts

86fc810

Initial commit of Model 6

f49bb5f

fix some build errors

f7b84ba

Finish prefix_path computation

989fa31

Fix remaining build errors

3d5a438

Fix tests; add tests

8acf4b7

Fix transition_hyperparameter bugs

ca24eaf

ThomasColthurst requested a review from emilyfertig August 16, 2024 16:29

emilyfertig reviewed Aug 17, 2024

View reviewed changes

emilyfertig requested a review from srvasude August 17, 2024 00:28

ThomasColthurst added 4 commits August 19, 2024 19:12

Respond to reviewer comments

e4782f8

Fix build errors

a8778c3

Fix test errors

6d7008a

Add tests for record_class_is_clean

dba9466

srvasude reviewed Aug 20, 2024

View reviewed changes

ThomasColthurst added 3 commits August 20, 2024 17:55

Added assert(false) when all hyperparameters give nanas

37fed68

include cassert

0d2df82

Fix integration tests

7c01bd6

srvasude approved these changes Aug 21, 2024

View reviewed changes

ThomasColthurst merged commit 9a2cccb into master Aug 21, 2024
2 checks passed

ThomasColthurst deleted the 080724-thomaswc-model6 branch August 21, 2024 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Model6 #172

Implement Model6 #172

ThomasColthurst commented Aug 16, 2024

emilyfertig left a comment

emilyfertig Aug 16, 2024

ThomasColthurst Aug 19, 2024

emilyfertig Aug 16, 2024

ThomasColthurst Aug 19, 2024

emilyfertig Aug 16, 2024

ThomasColthurst Aug 19, 2024

emilyfertig Aug 17, 2024

ThomasColthurst Aug 19, 2024

emilyfertig Aug 17, 2024

ThomasColthurst Aug 19, 2024

srvasude left a comment

srvasude Aug 20, 2024

ThomasColthurst Aug 20, 2024

srvasude left a comment

Implement Model6 #172

Implement Model6 #172

Conversation

ThomasColthurst commented Aug 16, 2024

emilyfertig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srvasude left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srvasude left a comment

Choose a reason for hiding this comment