Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup interfaces to Distributions and Emissions #87

Closed
wants to merge 24 commits into from

Conversation

ThomasColthurst
Copy link
Collaborator

I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.

  1. Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh

  2. Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.

  3. Replaces DistributionSpec and EmissionSpec with strings.

  4. Replaces std:visit calls with std:get<>() when possible.

  5. Gets rid of EmissionEnum and ObservationEnum.

  6. Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.

  7. Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81

@ThomasColthurst
Copy link
Collaborator Author

Tests now pass.

@emilyfertig
Copy link
Collaborator

emilyfertig commented Jul 15, 2024

I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.

  1. Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh

Thanks, this is a better name/location.

  1. Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.

Nice refactor.

  1. Replaces DistributionSpec and EmissionSpec with strings.

Now having seen this PR, I'm still not on board with this change, for the reasons I listed in a comment on #81 as well as some additional reasons listed in comments below. By my count in the Colab , adding a new distribution requires modifying 5 different places in 3 different files for the solution in this PR, vs. 6 different places in 4 different files if we keep the enums (assuming we still get rid of observation_string_to_value). I don't think this is worth it. (Plus the enums are more self-documenting in terms of where in the code needs to be modified, and the compiler tells you if you add to an enum and don't handle the corresponding case).

  1. Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

  1. Gets rid of EmissionEnum and ObservationEnum.

Same comment as 3.

  1. Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.

Thanks -- we're still doing the same number of string conversions, right?

  1. Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81

LG

const std::variant<DistributionSpec, EmissionSpec> prior_spec;
const std::string prior_spec;
// Is the codomain an emission?
const bool codomain_is_emission;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codomain_is_emission is redundant information with ValueType and prior_spec. This now seems more bug prone, since we have to pass the matching bool when we build a CleanRelation from a spec. It also adds complexity in that previously, CleanRelation didn't need to be aware of whether its model was a distribution or emission. This is an instance of the extra complexity incurred by getting rid of enums, which I don't think is a good tradeoff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to replacing codomain_is_emission with something else. Here are some possibilities:

  1. We resurrect DistributionSpec and EmissionSpec as just wrappers around a string. (This is clearly the same complexity as before.)

  2. If we think we will eventually need or want run time dynamism, we could allow get_emission and get_distribution to return nullptrs, which would let us run the string prior_spec through both of them to see which one succeeds.

  3. We replace CleanRelation's constructor with two static make methods: make_clean_relation_from_distribution and make_clean_relation_from_emission. This is basically Add DirichletCategorical distribution. #1 with different syntax.

  4. We make CleanRelation into a non-concrete base class, and give it two children: CleanDistributionRelation and CleanEmissionRelation. Almost all of the code except make_new_distribution still lives in CleanRelation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 1. (For 2, it doesn't strike me that we'd need/want runtime dynamism, and for 3/4, IMO we don't want to be treating distributions and emissions separately in CleanRelation -- that adds unnecessary complexity, since CleanRelation can/does happily cast Emissions to Distributions when using them as clusters).

If we use 1, then DistributionSpec and EmissionSpec are storing strings that are used as enums, which again, IMO, should be enums.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to use #1.

BOOST_AUTO_TEST_CASE(test_get_bernoulli) {
std::mt19937 prng;
DistributionVariant dv = get_distribution("bernoulli", &prng);
BOOST_TEST(dv.index() == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enums would facilitate a more robust and readable test here, since comparing with an enum is more meaningful than comparing with a variant index (which depends on the arbitrary ordering of the types in the variant).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgraded to using run time type information to test versus the name of the returned type.

cxx/irm.cc Outdated
assert(false && "Unsupported observation type.");
}
std::mt19937 prng;
DistributionVariant dv = get_distribution(distribution_spec, &prng);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're creating a distribution on the heap for the sole purpose of reading its template parameter, right? It looks like it isn't deleted again, so that's a memory leak.

This is another example of additional complexity/error prone-ness caused by avoiding enums, which I don't think is worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Clarification: I support getting rid of ObservationEnum, I don't like that it's redundant with the distribution data type. As of now, I still think it's the least-bad solution we have, and we can continue to think about better ways to work around it.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the memory leak.

With either the current or proposed approach, we are dealing with lots of raw pointers to Distributions and Emissions, and the ownership of those is rarely annotated in the code. So I think we will want to think about ways to avoid that in the future (maybe switch to Rust? :) ), but I think it is almost entirely orthogonal to using ObservationEnums or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More raw pointers creates more bug surface than fewer raw pointers, so I don't think it's orthogonal (I think often about switching to Rust :) )

cxx/util_io.cc Show resolved Hide resolved
@ThomasColthurst
Copy link
Collaborator Author

I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.

  1. Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh

Thanks, this is a better name/location.

  1. Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.

Nice refactor.

  1. Replaces DistributionSpec and EmissionSpec with strings.

Now having seen this PR, I'm still not on board with this change, for the reasons I listed in a comment on #81 as well as some additional reasons listed in comments below. By my count in the Colab , adding a new distribution requires modifying 5 different places in 3 different files for the solution in this PR, vs. 6 different places in 4 different files if we keep the enums (assuming we still get rid of observation_string_to_value). I don't think this is worth it. (Plus the enums are more self-documenting in terms of where in the code needs to be modified, and the compiler tells you if you add to an enum and don't handle the corresponding case).

Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.)

This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.

  1. Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

I think it is just in CleanRelation. It is an advantage because it is generally much clearer.

  1. Gets rid of EmissionEnum and ObservationEnum.

Same comment as 3.

  1. Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.

Thanks -- we're still doing the same number of string conversions, right?

Yes.

  1. Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81

LG

@emilyfertig
Copy link
Collaborator

Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.)

This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.

Creating a CleanRelation("R1", EmissionSpec("sometimes_bitflip") is a feature not a bug, and in fact we rely on it in NoisyRelation (see its emission_relation member).

(As posted on chat) If we rename util_distribution_variant to get_distribution (which I think was a good change in the PR), then it looks to me like the improved modularity is that we don't have to modify irm.cc when adding a new data type? (which, if we missed, the compiler would warn us about after we modified ObservationEnum). In exchange for this, we have all of the get_distribution_from_distribution_variant functions. This still doesn't seem to me like a good trade.

  1. Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

I think it is just in CleanRelation. It is an advantage because it is generally much clearer.

Maybe this is subjective, but I disagree that handling all of the cases manually with std::get is cleaner than std::visit here.

@ThomasColthurst
Copy link
Collaborator Author

Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.)
This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.

Creating a CleanRelation("R1", EmissionSpec("sometimes_bitflip") is a feature not a bug, and in fact we rely on it in NoisyRelation (see its emission_relation member).

Sorry, github removed the bool from what I meant to say, which is
CleanRelation<bool>("R1", EmissionSpec("sometimes_bitflip"), ...)
That is the object that is badly formed but raises no errors, as opposed to
CleanRelation<std::pair<bool, bool>>("R1", EmissionSpec("sometimes_bitflip"), ...)
which I agree we need to support.

(As posted on chat) If we rename util_distribution_variant to get_distribution (which I think was a good change in the PR), then it looks to me like the improved modularity is that we don't have to modify irm.cc when adding a new data type? (which, if we missed, the compiler would warn us about after we modified ObservationEnum). In exchange for this, we have all of the get_distribution_from_distribution_variant functions. This still doesn't seem to me like a good trade.

  1. Replaces std:visit calls with std:get<>() when possible.

Why is this an advantage? I see this in CleanRelation, were there other places?

I think it is just in CleanRelation. It is an advantage because it is generally much clearer.

Maybe this is subjective, but I disagree that handling all of the cases manually with std::get is cleaner than std::visit here.

@ThomasColthurst
Copy link
Collaborator Author

What do you think of this latest commit, which defines a single list of sample types, and then uses some metaprogramming tricks to use that list to define ObservationVariant, DistributionVariant and EmissionVariant? That way we never have to worry about them getting out of sync.

Also, with a little more work, I'm fairly sure I can use the same list of sample types to automatically generate all of the get_distribution_from_distribution_variant functions.

@ThomasColthurst
Copy link
Collaborator Author

The little more work is done, and now the get_distribution_from_distribution_variant are generated via metaprogramming.

(P.S. What little I know about metaprogramming in C++ using boost is from https://www.boost.org/doc/libs/1_85_0/libs/mp11/doc/html/simple_cxx11_metaprogramming.html)

@emilyfertig
Copy link
Collaborator

I don’t have strong feelings about this change in isolation, but I do think this is less readable and that metaprogramming here is overkill (though it’s cool to read about). There’s a lot going on in this PR now and I’m having trouble separating which changes can be made in isolation and which are dependent on one another.

Overall, my view remains that removing enums is a regression in terms of overall complexity, readability, maintainability, and bug surface. If we agree that Srinivas is the tiebreaker, he’s said a couple times that he also prefers enums, so I think we should go with that and move on. If Srinivas feels he needs more information to make a final decision, then I’m happy to discuss further.

@srvasude
Copy link
Collaborator

In general I feel the same way as Emily. To me, it feels harder to read and reason about (some of that due to the metaprogramming) in a way that makes me prefer the enum-state of the world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants