-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup interfaces to Distributions and Emissions #87
Conversation
…bservation and util_distribution_variant
Tests now pass. |
Thanks, this is a better name/location.
Nice refactor.
Now having seen this PR, I'm still not on board with this change, for the reasons I listed in a comment on #81 as well as some additional reasons listed in comments below. By my count in the Colab , adding a new distribution requires modifying 5 different places in 3 different files for the solution in this PR, vs. 6 different places in 4 different files if we keep the enums (assuming we still get rid of
Why is this an advantage? I see this in
Same comment as 3.
Thanks -- we're still doing the same number of string conversions, right?
LG |
cxx/clean_relation.hh
Outdated
const std::variant<DistributionSpec, EmissionSpec> prior_spec; | ||
const std::string prior_spec; | ||
// Is the codomain an emission? | ||
const bool codomain_is_emission; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codomain_is_emission
is redundant information with ValueType
and prior_spec
. This now seems more bug prone, since we have to pass the matching bool when we build a CleanRelation from a spec. It also adds complexity in that previously, CleanRelation didn't need to be aware of whether its model was a distribution or emission. This is an instance of the extra complexity incurred by getting rid of enums, which I don't think is a good tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to replacing codomain_is_emission with something else. Here are some possibilities:
-
We resurrect DistributionSpec and EmissionSpec as just wrappers around a string. (This is clearly the same complexity as before.)
-
If we think we will eventually need or want run time dynamism, we could allow get_emission and get_distribution to return nullptrs, which would let us run the string prior_spec through both of them to see which one succeeds.
-
We replace CleanRelation's constructor with two static make methods: make_clean_relation_from_distribution and make_clean_relation_from_emission. This is basically Add DirichletCategorical distribution. #1 with different syntax.
-
We make CleanRelation into a non-concrete base class, and give it two children: CleanDistributionRelation and CleanEmissionRelation. Almost all of the code except make_new_distribution still lives in CleanRelation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like 1
. (For 2
, it doesn't strike me that we'd need/want runtime dynamism, and for 3/4, IMO we don't want to be treating distributions and emissions separately in CleanRelation -- that adds unnecessary complexity, since CleanRelation can/does happily cast Emissions to Distributions when using them as clusters).
If we use 1
, then DistributionSpec and EmissionSpec are storing strings that are used as enums, which again, IMO, should be enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code to use #1.
BOOST_AUTO_TEST_CASE(test_get_bernoulli) { | ||
std::mt19937 prng; | ||
DistributionVariant dv = get_distribution("bernoulli", &prng); | ||
BOOST_TEST(dv.index() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enums would facilitate a more robust and readable test here, since comparing with an enum is more meaningful than comparing with a variant index (which depends on the arbitrary ordering of the types in the variant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgraded to using run time type information to test versus the name of the returned type.
cxx/irm.cc
Outdated
assert(false && "Unsupported observation type."); | ||
} | ||
std::mt19937 prng; | ||
DistributionVariant dv = get_distribution(distribution_spec, &prng); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're creating a distribution on the heap for the sole purpose of reading its template parameter, right? It looks like it isn't deleted again, so that's a memory leak.
This is another example of additional complexity/error prone-ness caused by avoiding enums, which I don't think is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Clarification: I support getting rid of ObservationEnum
, I don't like that it's redundant with the distribution data type. As of now, I still think it's the least-bad solution we have, and we can continue to think about better ways to work around it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the memory leak.
With either the current or proposed approach, we are dealing with lots of raw pointers to Distributions and Emissions, and the ownership of those is rarely annotated in the code. So I think we will want to think about ways to avoid that in the future (maybe switch to Rust? :) ), but I think it is almost entirely orthogonal to using ObservationEnums or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More raw pointers creates more bug surface than fewer raw pointers, so I don't think it's orthogonal (I think often about switching to Rust :) )
Thanks again for the colab. I think it nicely demonstrates the improved modularity that this pull request creates. When adding a new distribution, the natural place to change is distributions/get_distribution.{cc,hh} and with this, that's where almost all of the change needs to take place. (Plus a change in observation_variant.hh if and only if you need to dd a new observation type.) This approach has additional error-checking benefits, as well. Right now at head, you can create a CleanRelation("R1", EmissionSpec("sometimes_bitflip"), ...) and nothing will complain, despite that it is doing a reinterpret_cast of an Emission* to a Distribution* internally.
I think it is just in CleanRelation. It is an advantage because it is generally much clearer.
Yes.
|
Creating a (As posted on chat) If we rename util_distribution_variant to get_distribution (which I think was a good change in the PR), then it looks to me like the improved modularity is that we don't have to modify irm.cc when adding a new data type? (which, if we missed, the compiler would warn us about after we modified ObservationEnum). In exchange for this, we have all of the get_distribution_from_distribution_variant functions. This still doesn't seem to me like a good trade.
Maybe this is subjective, but I disagree that handling all of the cases manually with |
Sorry, github removed the bool from what I meant to say, which is
|
What do you think of this latest commit, which defines a single list of sample types, and then uses some metaprogramming tricks to use that list to define ObservationVariant, DistributionVariant and EmissionVariant? That way we never have to worry about them getting out of sync. Also, with a little more work, I'm fairly sure I can use the same list of sample types to automatically generate all of the get_distribution_from_distribution_variant functions. |
The little more work is done, and now the get_distribution_from_distribution_variant are generated via metaprogramming. (P.S. What little I know about metaprogramming in C++ using boost is from https://www.boost.org/doc/libs/1_85_0/libs/mp11/doc/html/simple_cxx11_metaprogramming.html) |
I don’t have strong feelings about this change in isolation, but I do think this is less readable and that metaprogramming here is overkill (though it’s cool to read about). There’s a lot going on in this PR now and I’m having trouble separating which changes can be made in isolation and which are dependent on one another. Overall, my view remains that removing enums is a regression in terms of overall complexity, readability, maintainability, and bug surface. If we agree that Srinivas is the tiebreaker, he’s said a couple times that he also prefers enums, so I think we should go with that and move on. If Srinivas feels he needs more information to make a final decision, then I’m happy to discuss further. |
In general I feel the same way as Emily. To me, it feels harder to read and reason about (some of that due to the metaprogramming) in a way that makes me prefer the enum-state of the world. |
I'm still fixing a few tests in this pull request, but this should give you enough of an idea to evaluate it.
Replaces util_distribution_variant with distributions/get_distribution and observation_variant.hh
Adds util_parse with parse_name_and_parameters function that both get_distribution and get_emission can use to extract names with optional parameter values.
Replaces DistributionSpec and EmissionSpec with strings.
Replaces std:visit calls with std:get<>() when possible.
Gets rid of EmissionEnum and ObservationEnum.
Replaces observation_string_to_value with from_string method on Relation. As part of this, strings are converted into ObservationVariant's slightly later than before -- in the util_io incorporate_observation functions rather than in load_observations.
Some of the Sometimes<> and get_emission.hh changes from Add "sometimes_categorical" to get_emission #81