-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "sometimes_categorical" to get_emission #81
Conversation
cxx/emissions/get_emission.cc
Outdated
EmissionVariant get_emission(const std::string& emission_name) { | ||
EmissionVariant get_emission( | ||
const std::string& emission_name, | ||
std::map<std::string, std::string> distribution_args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about how this fits together with NoisyRelation
etc, I think we want to handle this earlier and read the number of categories into an emission spec (around the time the schema is parsed) instead of passing distribution_args
here, otherwise we'll have to grab distribution_args
inside of NoisyRelation
whenever we need a new emission cluster, which will substantially complicate the code. I can add an emission_args
field to EmissionSpec
in #79 (and I think it's fine to leave this code as is for now).
Edit: I see you added emission_args
to EmissionSpec
, could you rename this to emission_args
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cxx/emissions/sometimes.hh
Outdated
Sometimes(){}; | ||
// By default, Sometimes only works with BaseEmissors that assign zero | ||
// probability to dirty == clean. Pass _can_dirty_equal_clean=true to | ||
// override that assumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make a note that in particular, this must be set to false
when logp is a density instead of a discrete probability (otherwise we're combining densities and discrete probabilities in ways that don't make sense, e.g. on line 37, IIUC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cxx/emissions/sometimes.hh
Outdated
BetaBernoulli bb; | ||
BaseEmissor be; | ||
Emission<SampleType>* be; // We own be. | ||
bool can_dirty_equal_clean; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: this is more readable to me when phrased as a statement (e.g. dirty_can_equal_clean
) rather than a question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cxx/emissions/sometimes.hh
Outdated
} | ||
double lp = bb.logp(false); | ||
if (can_dirty_equal_clean) { | ||
lp += bb.logp(true) + be->logp(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want p_bb(~came_from_be) or (p_bb(came_from_be) and p_be(X=x))
, shouldn't this be log(exp(bb.logp(false)) + exp(bb.logp(true) + be->logp(x)))
? (since the mutually exclusive probabilities combined by or
should be added in probability space and not logp space)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Done.
struct DistributionSpec { | ||
DistributionEnum distribution; | ||
std::map<std::string, std::string> distribution_args = {}; | ||
}; | ||
|
||
struct EmissionSpec { | ||
EmissionEnum emission; | ||
std::string emission_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this an enum, consistent with DistributionSpec, so we can treat emission/distribution more consistently (which becomes convenient once NoisyRelation is integrated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather get rid of DistributionEnum than make this an enum. Having them be enum's just means that the logic for getting a distribution or emission becomes spread out over two functions instead of concentrated in a single get_ function, and we have another place to modify (the list of enum values) when we want to add a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we have a field that can take one of a limited set of values, that should be an enum and not a string. If we keep ObservationEnum
in #79 (which I introduced for reasons described in the PR description), then we're modifying the logic in two places anyway when we add a new distribution/emission (which seems to me like not much additional burden). I would rather keep this as an enum and continue the discussion on #79, and then maybe replace the distribution/emission enums with strings simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+Srinivas for his vote on whether the first field of EmissionSpec should be an enum or string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasons to use enum here, to represent one of a limited set of values:
- Enum comparison is faster than string comparison when we're using this to build a cluster prior of the right type.
- Strings seem more bug prone, where an invalid value could be introduced somewhere along the line and caught only when the if/else bottoms out (this issue doesn't exist for enums).
- The compiler tells us if we failed to account for an enum value in a switch/case statement, unlike for strings.
- The (minimal IMO) extra complexity of using enums instead of strings is hidden from the user.
- If we keep
ObservationEnum
, the gain in simplicity of using string instead of enum for distributions/emissions is erased (which is why I think we should discuss this on Incorporate NoisyRelation into (H)IRM. #79, since that PR better illustrates the rationale for using enum).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer enum here (for the reasons Emily mentioned).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that using an enum here is a premature optimization.
We have two functions here: one that takes a string and returns a spec, and another that takes a spec and returns an Emission object. We have the choice between exposing both functions to the user (= one of us programmers, the only users relevant to this discussion), or only exposing their composition. By exposing both functions, we introduce a new data type and the user now has two call two functions instead of one.
It is true that the user can cache the result of the first function and gain some small, probably insignificant speed improvement. But it is my feeling that whenever we can reduce code complexity at the cost of a little speed, we should happily do so.
Note that this way of viewing things make it clear that there is no net gain in terms of code bugy-ness. Any bugs in the composition of the two functions would still be there in one of the two. In particular, any bad string value would still cause problems when converting from the string to spec.
Regarding ObservationEnum, it appears that it is only there to support observation_string_to_value, and I think we can support that functionality in other ways. For example, adding the following lines to Distribution in distributions/base.hh:
ObservationVariant from_string(const string&s) {
SampleType t;
stringstream ss(s);
ss >> t;
return t;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObservationEnum
is also used at the top of irm.cc
to build a CleanRelation
with the right template parameter.
If you're confident that strings are better here, I'd ask that you let me merge #79 and then you send a follow-up PR that removes DistributionSpec
and EmissionSpec
(and ObservationSpec
if you want). Considering the overall system, I don't see how to make this work nicely with strings (though I see your argument that it's nicer locally in get_emission.hh
). Then we can look at both proposals more concretely.
To add "sometimes_categorical" entailed the following three other changes included in this pull request:
The Sometimes adapter needs to handle base emission classes that don't assign probability zero to clean == dirty,
get_emission needs to take a dictionary of arguments so that (for example) we can specify the number of classes in the categorical emission, and
Sometimes needs to be changed from templating based on the base emissor's type to templating on the sample type. This is because we need to be able to pass parameters to the base emissor's constructor.