Add "sometimes_categorical" to get_emission #81

ThomasColthurst · 2024-07-10T13:58:38Z

To add "sometimes_categorical" entailed the following three other changes included in this pull request:

The Sometimes adapter needs to handle base emission classes that don't assign probability zero to clean == dirty,
get_emission needs to take a dictionary of arguments so that (for example) we can specify the number of classes in the categorical emission, and
Sometimes needs to be changed from templating based on the base emissor's type to templating on the sample type. This is because we need to be able to pass parameters to the base emissor's constructor.

…ts members!

emilyfertig · 2024-07-10T17:50:17Z

cxx/emissions/get_emission.cc

-EmissionVariant get_emission(const std::string& emission_name) {
+EmissionVariant get_emission(
+    const std::string& emission_name,
+    std::map<std::string, std::string> distribution_args) {


Thinking about how this fits together with NoisyRelation etc, I think we want to handle this earlier and read the number of categories into an emission spec (around the time the schema is parsed) instead of passing distribution_args here, otherwise we'll have to grab distribution_args inside of NoisyRelation whenever we need a new emission cluster, which will substantially complicate the code. I can add an emission_args field to EmissionSpec in #79 (and I think it's fine to leave this code as is for now).

Edit: I see you added emission_args to EmissionSpec, could you rename this to emission_args?

emilyfertig · 2024-07-10T18:40:07Z

cxx/emissions/sometimes.hh

-  Sometimes(){};
+  // By default, Sometimes only works with BaseEmissors that assign zero
+  // probability to dirty == clean.  Pass _can_dirty_equal_clean=true to
+  // override that assumption.


Maybe make a note that in particular, this must be set to false when logp is a density instead of a discrete probability (otherwise we're combining densities and discrete probabilities in ways that don't make sense, e.g. on line 37, IIUC)

emilyfertig · 2024-07-10T18:40:37Z

cxx/emissions/sometimes.hh

  BetaBernoulli bb;
-  BaseEmissor be;
+  Emission<SampleType>* be;  // We own be.
+  bool can_dirty_equal_clean;


optional: this is more readable to me when phrased as a statement (e.g. dirty_can_equal_clean) rather than a question

emilyfertig · 2024-07-10T18:45:46Z

cxx/emissions/sometimes.hh

+    }
+    double lp = bb.logp(false);
+    if (can_dirty_equal_clean) {
+      lp += bb.logp(true) + be->logp(x);


If we want p_bb(~came_from_be) or (p_bb(came_from_be) and p_be(X=x)), shouldn't this be log(exp(bb.logp(false)) + exp(bb.logp(true) + be->logp(x)))? (since the mutually exclusive probabilities combined by or should be added in probability space and not logp space)

Good catch! Done.

emilyfertig · 2024-07-10T18:57:19Z

cxx/util_distribution_variant.hh

 struct DistributionSpec {
  DistributionEnum distribution;
  std::map<std::string, std::string> distribution_args = {};
 };

 struct EmissionSpec {
-  EmissionEnum emission;
+  std::string emission_name;


I think we should keep this an enum, consistent with DistributionSpec, so we can treat emission/distribution more consistently (which becomes convenient once NoisyRelation is integrated).

I'd rather get rid of DistributionEnum than make this an enum. Having them be enum's just means that the logic for getting a distribution or emission becomes spread out over two functions instead of concentrated in a single get_ function, and we have another place to modify (the list of enum values) when we want to add a new one.

I think if we have a field that can take one of a limited set of values, that should be an enum and not a string. If we keep ObservationEnum in #79 (which I introduced for reasons described in the PR description), then we're modifying the logic in two places anyway when we add a new distribution/emission (which seems to me like not much additional burden). I would rather keep this as an enum and continue the discussion on #79, and then maybe replace the distribution/emission enums with strings simultaneously.

+Srinivas for his vote on whether the first field of EmissionSpec should be an enum or string.

Reasons to use enum here, to represent one of a limited set of values:

Enum comparison is faster than string comparison when we're using this to build a cluster prior of the right type.

Strings seem more bug prone, where an invalid value could be introduced somewhere along the line and caught only when the if/else bottoms out (this issue doesn't exist for enums).

The compiler tells us if we failed to account for an enum value in a switch/case statement, unlike for strings.

The (minimal IMO) extra complexity of using enums instead of strings is hidden from the user.

If we keep ObservationEnum, the gain in simplicity of using string instead of enum for distributions/emissions is erased (which is why I think we should discuss this on Incorporate NoisyRelation into (H)IRM. #79, since that PR better illustrates the rationale for using enum).

I also prefer enum here (for the reasons Emily mentioned).

My feeling is that using an enum here is a premature optimization.

We have two functions here: one that takes a string and returns a spec, and another that takes a spec and returns an Emission object. We have the choice between exposing both functions to the user (= one of us programmers, the only users relevant to this discussion), or only exposing their composition. By exposing both functions, we introduce a new data type and the user now has two call two functions instead of one.

It is true that the user can cache the result of the first function and gain some small, probably insignificant speed improvement. But it is my feeling that whenever we can reduce code complexity at the cost of a little speed, we should happily do so.

Note that this way of viewing things make it clear that there is no net gain in terms of code bugy-ness. Any bugs in the composition of the two functions would still be there in one of the two. In particular, any bad string value would still cause problems when converting from the string to spec.

Regarding ObservationEnum, it appears that it is only there to support observation_string_to_value, and I think we can support that functionality in other ways. For example, adding the following lines to Distribution in distributions/base.hh:

ObservationVariant from_string(const string&s) {
SampleType t;
stringstream ss(s);
ss >> t;
return t;
}

ObservationEnum is also used at the top of irm.cc to build a CleanRelation with the right template parameter.

If you're confident that strings are better here, I'd ask that you let me merge #79 and then you send a follow-up PR that removes DistributionSpec and EmissionSpec (and ObservationSpec if you want). Considering the overall system, I don't see how to make this work nicely with strings (though I see your argument that it's nicer locally in get_emission.hh). Then we can look at both proposals more concretely.

ThomasColthurst added 7 commits July 8, 2024 17:07

Add can_dirty_equal_clean option to Sometimes

9014abb

Fix merge conflict

11ebb55

Fix cmath inclusion

d1818f9

Merge branch 'master' into 070824-thomaswc-sometimes-same

177789b

Add sometimes_categorical to get_emission

3ccc1ac

Build and test fixes

cef8528

Sometimes::incorporate must pass weight when calling incorporate on i…

26bbc4d

…ts members!

ThomasColthurst requested a review from emilyfertig July 10, 2024 13:58

Fix build errors

91e994b

ThomasColthurst mentioned this pull request Jul 10, 2024

Add NoisyRelation and a Relation virtual class. #74

Merged

emilyfertig reviewed Jul 10, 2024

View reviewed changes

ThomasColthurst added 2 commits July 10, 2024 19:31

Address reviewer comments

d28222e

Fix merge conflict

e007af1

emilyfertig approved these changes Jul 10, 2024

View reviewed changes

ThomasColthurst requested a review from srvasude July 10, 2024 20:54

srvasude approved these changes Jul 11, 2024

View reviewed changes

ThomasColthurst mentioned this pull request Jul 15, 2024

Cleanup interfaces to Distributions and Emissions #87

Closed

ThomasColthurst closed this Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "sometimes_categorical" to get_emission #81

Add "sometimes_categorical" to get_emission #81

ThomasColthurst commented Jul 10, 2024

emilyfertig Jul 10, 2024

ThomasColthurst Jul 10, 2024

emilyfertig Jul 10, 2024

ThomasColthurst Jul 10, 2024

emilyfertig Jul 10, 2024

ThomasColthurst Jul 10, 2024

emilyfertig Jul 10, 2024

ThomasColthurst Jul 10, 2024

emilyfertig Jul 10, 2024

ThomasColthurst Jul 10, 2024

emilyfertig Jul 10, 2024

ThomasColthurst Jul 10, 2024

emilyfertig Jul 10, 2024

srvasude Jul 10, 2024

ThomasColthurst Jul 11, 2024

emilyfertig Jul 11, 2024

Add "sometimes_categorical" to get_emission #81

Add "sometimes_categorical" to get_emission #81

Conversation

ThomasColthurst commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment