Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "sometimes_categorical" to get_emission #81

Closed
wants to merge 10 commits into from

Conversation

ThomasColthurst
Copy link
Collaborator

To add "sometimes_categorical" entailed the following three other changes included in this pull request:

  1. The Sometimes adapter needs to handle base emission classes that don't assign probability zero to clean == dirty,

  2. get_emission needs to take a dictionary of arguments so that (for example) we can specify the number of classes in the categorical emission, and

  3. Sometimes needs to be changed from templating based on the base emissor's type to templating on the sample type. This is because we need to be able to pass parameters to the base emissor's constructor.

EmissionVariant get_emission(const std::string& emission_name) {
EmissionVariant get_emission(
const std::string& emission_name,
std::map<std::string, std::string> distribution_args) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about how this fits together with NoisyRelation etc, I think we want to handle this earlier and read the number of categories into an emission spec (around the time the schema is parsed) instead of passing distribution_args here, otherwise we'll have to grab distribution_args inside of NoisyRelation whenever we need a new emission cluster, which will substantially complicate the code. I can add an emission_args field to EmissionSpec in #79 (and I think it's fine to leave this code as is for now).

Edit: I see you added emission_args to EmissionSpec, could you rename this to emission_args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sometimes(){};
// By default, Sometimes only works with BaseEmissors that assign zero
// probability to dirty == clean. Pass _can_dirty_equal_clean=true to
// override that assumption.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make a note that in particular, this must be set to false when logp is a density instead of a discrete probability (otherwise we're combining densities and discrete probabilities in ways that don't make sense, e.g. on line 37, IIUC)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BetaBernoulli bb;
BaseEmissor be;
Emission<SampleType>* be; // We own be.
bool can_dirty_equal_clean;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: this is more readable to me when phrased as a statement (e.g. dirty_can_equal_clean) rather than a question

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
double lp = bb.logp(false);
if (can_dirty_equal_clean) {
lp += bb.logp(true) + be->logp(x);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want p_bb(~came_from_be) or (p_bb(came_from_be) and p_be(X=x)), shouldn't this be log(exp(bb.logp(false)) + exp(bb.logp(true) + be->logp(x)))? (since the mutually exclusive probabilities combined by or should be added in probability space and not logp space)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Done.

struct DistributionSpec {
DistributionEnum distribution;
std::map<std::string, std::string> distribution_args = {};
};

struct EmissionSpec {
EmissionEnum emission;
std::string emission_name;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this an enum, consistent with DistributionSpec, so we can treat emission/distribution more consistently (which becomes convenient once NoisyRelation is integrated).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather get rid of DistributionEnum than make this an enum. Having them be enum's just means that the logic for getting a distribution or emission becomes spread out over two functions instead of concentrated in a single get_ function, and we have another place to modify (the list of enum values) when we want to add a new one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have a field that can take one of a limited set of values, that should be an enum and not a string. If we keep ObservationEnum in #79 (which I introduced for reasons described in the PR description), then we're modifying the logic in two places anyway when we add a new distribution/emission (which seems to me like not much additional burden). I would rather keep this as an enum and continue the discussion on #79, and then maybe replace the distribution/emission enums with strings simultaneously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+Srinivas for his vote on whether the first field of EmissionSpec should be an enum or string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasons to use enum here, to represent one of a limited set of values:

  • Enum comparison is faster than string comparison when we're using this to build a cluster prior of the right type.
  • Strings seem more bug prone, where an invalid value could be introduced somewhere along the line and caught only when the if/else bottoms out (this issue doesn't exist for enums).
  • The compiler tells us if we failed to account for an enum value in a switch/case statement, unlike for strings.
  • The (minimal IMO) extra complexity of using enums instead of strings is hidden from the user.
  • If we keep ObservationEnum, the gain in simplicity of using string instead of enum for distributions/emissions is erased (which is why I think we should discuss this on Incorporate NoisyRelation into (H)IRM. #79, since that PR better illustrates the rationale for using enum).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer enum here (for the reasons Emily mentioned).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that using an enum here is a premature optimization.

We have two functions here: one that takes a string and returns a spec, and another that takes a spec and returns an Emission object. We have the choice between exposing both functions to the user (= one of us programmers, the only users relevant to this discussion), or only exposing their composition. By exposing both functions, we introduce a new data type and the user now has two call two functions instead of one.

It is true that the user can cache the result of the first function and gain some small, probably insignificant speed improvement. But it is my feeling that whenever we can reduce code complexity at the cost of a little speed, we should happily do so.

Note that this way of viewing things make it clear that there is no net gain in terms of code bugy-ness. Any bugs in the composition of the two functions would still be there in one of the two. In particular, any bad string value would still cause problems when converting from the string to spec.

Regarding ObservationEnum, it appears that it is only there to support observation_string_to_value, and I think we can support that functionality in other ways. For example, adding the following lines to Distribution in distributions/base.hh:

ObservationVariant from_string(const string&s) {
SampleType t;
stringstream ss(s);
ss >> t;
return t;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObservationEnum is also used at the top of irm.cc to build a CleanRelation with the right template parameter.

If you're confident that strings are better here, I'd ask that you let me merge #79 and then you send a follow-up PR that removes DistributionSpec and EmissionSpec (and ObservationSpec if you want). Considering the overall system, I don't see how to make this work nicely with strings (though I see your argument that it's nicer locally in get_emission.hh). Then we can look at both proposals more concretely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants