Cleanup interfaces to Distributions and Emissions #87

ThomasColthurst · 2024-07-15T17:25:52Z

I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.

Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh
Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.
Replaces DistributionSpec and EmissionSpec with strings.
Replaces std:visit calls with std:get<>() when possible.
Gets rid of EmissionEnum and ObservationEnum.
Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.
Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81

…bservation and util_distribution_variant

ThomasColthurst · 2024-07-15T17:51:43Z

Tests now pass.

emilyfertig · 2024-07-15T23:09:45Z

I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.

Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh

Thanks, this is a better name/location.

Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.

Nice refactor.

Replaces DistributionSpec and EmissionSpec with strings.

Now having seen this PR, I'm still not on board with this change, for the reasons I listed in a comment on #81 as well as some additional reasons listed in comments below. By my count in the Colab , adding a new distribution requires modifying 5 different places in 3 different files for the solution in this PR, vs. 6 different places in 4 different files if we keep the enums (assuming we still get rid of observation_string_to_value). I don't think this is worth it. (Plus the enums are more self-documenting in terms of where in the code needs to be modified, and the compiler tells you if you add to an enum and don't handle the corresponding case).

Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

Gets rid of EmissionEnum and ObservationEnum.

Same comment as 3.

Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.

Thanks -- we're still doing the same number of string conversions, right?

Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81

LG

emilyfertig · 2024-07-15T22:37:35Z

cxx/clean_relation.hh

-  const std::variant<DistributionSpec, EmissionSpec> prior_spec;
+  const std::string prior_spec;
+  // Is the codomain an emission?
+  const bool codomain_is_emission;


codomain_is_emission is redundant information with ValueType and prior_spec. This now seems more bug prone, since we have to pass the matching bool when we build a CleanRelation from a spec. It also adds complexity in that previously, CleanRelation didn't need to be aware of whether its model was a distribution or emission. This is an instance of the extra complexity incurred by getting rid of enums, which I don't think is a good tradeoff.

I'm open to replacing codomain_is_emission with something else. Here are some possibilities:

We resurrect DistributionSpec and EmissionSpec as just wrappers around a string. (This is clearly the same complexity as before.)

If we think we will eventually need or want run time dynamism, we could allow get_emission and get_distribution to return nullptrs, which would let us run the string prior_spec through both of them to see which one succeeds.

We replace CleanRelation's constructor with two static make methods: make_clean_relation_from_distribution and make_clean_relation_from_emission. This is basically Add DirichletCategorical distribution. #1 with different syntax.

We make CleanRelation into a non-concrete base class, and give it two children: CleanDistributionRelation and CleanEmissionRelation. Almost all of the code except make_new_distribution still lives in CleanRelation.

I like 1. (For 2, it doesn't strike me that we'd need/want runtime dynamism, and for 3/4, IMO we don't want to be treating distributions and emissions separately in CleanRelation -- that adds unnecessary complexity, since CleanRelation can/does happily cast Emissions to Distributions when using them as clusters).

If we use 1, then DistributionSpec and EmissionSpec are storing strings that are used as enums, which again, IMO, should be enums.

I changed the code to use #1.

emilyfertig · 2024-07-15T22:41:07Z

cxx/distributions/get_distribution_test.cc

+BOOST_AUTO_TEST_CASE(test_get_bernoulli) {
+  std::mt19937 prng;
+  DistributionVariant dv = get_distribution("bernoulli", &prng);
+  BOOST_TEST(dv.index() == 0);


Enums would facilitate a more robust and readable test here, since comparing with an enum is more meaningful than comparing with a variant index (which depends on the arbitrary ordering of the types in the variant).

Upgraded to using run time type information to test versus the name of the returned type.

emilyfertig · 2024-07-15T22:48:28Z

cxx/irm.cc

-      assert(false && "Unsupported observation type.");
-  }
+  std::mt19937 prng;
+  DistributionVariant dv = get_distribution(distribution_spec, &prng);


Here we're creating a distribution on the heap for the sole purpose of reading its template parameter, right? It looks like it isn't deleted again, so that's a memory leak.

This is another example of additional complexity/error prone-ness caused by avoiding enums, which I don't think is worth it.

(Clarification: I support getting rid of ObservationEnum, I don't like that it's redundant with the distribution data type. As of now, I still think it's the least-bad solution we have, and we can continue to think about better ways to work around it.)

Fixed the memory leak.

With either the current or proposed approach, we are dealing with lots of raw pointers to Distributions and Emissions, and the ownership of those is rarely annotated in the code. So I think we will want to think about ways to avoid that in the future (maybe switch to Rust? :) ), but I think it is almost entirely orthogonal to using ObservationEnums or not.

More raw pointers creates more bug surface than fewer raw pointers, so I don't think it's orthogonal (I think often about switching to Rust :) )

cxx/util_io.cc

ThomasColthurst · 2024-07-16T14:56:41Z

I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.

Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh

Thanks, this is a better name/location.

Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.

Nice refactor.

Replaces DistributionSpec and EmissionSpec with strings.

Now having seen this PR, I'm still not on board with this change, for the reasons I listed in a comment on #81 as well as some additional reasons listed in comments below. By my count in the Colab , adding a new distribution requires modifying 5 different places in 3 different files for the solution in this PR, vs. 6 different places in 4 different files if we keep the enums (assuming we still get rid of observation_string_to_value). I don't think this is worth it. (Plus the enums are more self-documenting in terms of where in the code needs to be modified, and the compiler tells you if you add to an enum and don't handle the corresponding case).

Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.)

This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.

Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

I think it is just in CleanRelation. It is an advantage because it is generally much clearer.

Gets rid of EmissionEnum and ObservationEnum.

Same comment as 3.

Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.

Thanks -- we're still doing the same number of string conversions, right?

Yes.

Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81

LG

emilyfertig · 2024-07-16T16:29:09Z

Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.)

This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.

Creating a CleanRelation("R1", EmissionSpec("sometimes_bitflip") is a feature not a bug, and in fact we rely on it in NoisyRelation (see its emission_relation member).

(As posted on chat) If we rename util_distribution_variant to get_distribution (which I think was a good change in the PR), then it looks to me like the improved modularity is that we don't have to modify irm.cc when adding a new data type? (which, if we missed, the compiler would warn us about after we modified ObservationEnum). In exchange for this, we have all of the get_distribution_from_distribution_variant functions. This still doesn't seem to me like a good trade.

Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

I think it is just in CleanRelation. It is an advantage because it is generally much clearer.

Maybe this is subjective, but I disagree that handling all of the cases manually with std::get is cleaner than std::visit here.

ThomasColthurst · 2024-07-16T17:34:31Z

Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.)
This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.

Creating a CleanRelation("R1", EmissionSpec("sometimes_bitflip") is a feature not a bug, and in fact we rely on it in NoisyRelation (see its emission_relation member).

Sorry, github removed the bool from what I meant to say, which is
CleanRelation<bool>("R1", EmissionSpec("sometimes_bitflip"), ...)
That is the object that is badly formed but raises no errors, as opposed to
CleanRelation<std::pair<bool, bool>>("R1", EmissionSpec("sometimes_bitflip"), ...)
which I agree we need to support.

(As posted on chat) If we rename util_distribution_variant to get_distribution (which I think was a good change in the PR), then it looks to me like the improved modularity is that we don't have to modify irm.cc when adding a new data type? (which, if we missed, the compiler would warn us about after we modified ObservationEnum). In exchange for this, we have all of the get_distribution_from_distribution_variant functions. This still doesn't seem to me like a good trade.

Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

I think it is just in CleanRelation. It is an advantage because it is generally much clearer.

Maybe this is subjective, but I disagree that handling all of the cases manually with std::get is cleaner than std::visit here.

…}Variant

ThomasColthurst · 2024-07-16T19:40:26Z

What do you think of this latest commit, which defines a single list of sample types, and then uses some metaprogramming tricks to use that list to define ObservationVariant, DistributionVariant and EmissionVariant? That way we never have to worry about them getting out of sync.

Also, with a little more work, I'm fairly sure I can use the same list of sample types to automatically generate all of the get_distribution_from_distribution_variant functions.

ThomasColthurst · 2024-07-16T20:27:24Z

The little more work is done, and now the get_distribution_from_distribution_variant are generated via metaprogramming.

(P.S. What little I know about metaprogramming in C++ using boost is from https://www.boost.org/doc/libs/1_85_0/libs/mp11/doc/html/simple_cxx11_metaprogramming.html)

emilyfertig · 2024-07-16T20:52:18Z

I don’t have strong feelings about this change in isolation, but I do think this is less readable and that metaprogramming here is overkill (though it’s cool to read about). There’s a lot going on in this PR now and I’m having trouble separating which changes can be made in isolation and which are dependent on one another.

Overall, my view remains that removing enums is a regression in terms of overall complexity, readability, maintainability, and bug surface. If we agree that Srinivas is the tiebreaker, he’s said a couple times that he also prefers enums, so I think we should go with that and move on. If Srinivas feels he needs more information to make a final decision, then I’m happy to discuss further.

…eanRelation constructor

srvasude · 2024-07-17T18:08:31Z

In general I feel the same way as Emily. To me, it feels harder to read and reason about (some of that due to the metaprogramming) in a way that makes me prefer the enum-state of the world.

ThomasColthurst added 13 commits July 11, 2024 19:06

Kill enums part 1: add get_distribution and util_parse, remove util_o…

1e53a8b

…bservation and util_distribution_variant

More updates

891a0a1

More cleanup

826b135

Add observation_variant.hh

2a6b2ab

Fix clean_relation_from_spec

201f1b6

Merge with master

d1d217d

Add from_string to Relation base class

93ae299

More fixes

deb2f1d

oh boy more template instantiation fixes

8f39db3

Drop Spec

acc6b40

Fix clean_relation::make_new_distribution

3ce10f6

Fix all build errors

8ce6522

Fix tests

4ebabe7

ThomasColthurst added 2 commits July 15, 2024 18:48

Use HIRM::get_relation instead of hack

0de818f

Fix test_misc

1514715

emilyfertig reviewed Jul 15, 2024

View reviewed changes

ThomasColthurst added 2 commits July 16, 2024 14:37

Test get_distribution by rtti name instead of index

50baf6b

Fix memory leak

03a2287

Remove commented line

0309e2e

Use list of sample types to define {Observation,Distribution,Emission…

08ea680

…}Variant

ThomasColthurst added 2 commits July 16, 2024 19:51

Use boost:mp11 to make the metaprogramming cleaner

7928f0f

Simply get_distribution_from_distribution_variant using metaprogramming

6aff849

ThomasColthurst requested a review from srvasude July 16, 2024 20:27

ThomasColthurst added 3 commits July 17, 2024 15:12

Add test for test_get_distribution_from_distribution_variant

2693059

Fix templating errors

29db256

Use DistributionSpec/EmissionSpec wrappers instead of bool flag to Cl…

f39999d

…eanRelation constructor

ThomasColthurst mentioned this pull request Jul 25, 2024

Move util_distribution_variant to distributions/get_distribution #104

Merged

ThomasColthurst closed this Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup interfaces to Distributions and Emissions #87

Cleanup interfaces to Distributions and Emissions #87

ThomasColthurst commented Jul 15, 2024

ThomasColthurst commented Jul 15, 2024

emilyfertig commented Jul 15, 2024 •

edited

Loading

emilyfertig Jul 15, 2024

ThomasColthurst Jul 16, 2024

emilyfertig Jul 16, 2024

ThomasColthurst Jul 17, 2024

emilyfertig Jul 15, 2024

ThomasColthurst Jul 16, 2024

emilyfertig Jul 15, 2024

emilyfertig Jul 16, 2024

ThomasColthurst Jul 16, 2024

emilyfertig Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

emilyfertig commented Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

emilyfertig commented Jul 16, 2024

srvasude commented Jul 17, 2024

Cleanup interfaces to Distributions and Emissions #87

Cleanup interfaces to Distributions and Emissions #87

Conversation

ThomasColthurst commented Jul 15, 2024

ThomasColthurst commented Jul 15, 2024

emilyfertig commented Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasColthurst commented Jul 16, 2024

emilyfertig commented Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

ThomasColthurst commented Jul 16, 2024

emilyfertig commented Jul 16, 2024

srvasude commented Jul 17, 2024

emilyfertig commented Jul 15, 2024 •

edited

Loading