Add StringCat distribution for a categorical distribution over a finite set of strings #73

ThomasColthurst · 2024-06-28T20:23:43Z

My thought is that even if the data is stored in file as "0", "1", "2", etc., this distribution class plus a simple string emission class would do a good job of modelling typos.

emilyfertig

Thanks, this looks good!

emilyfertig · 2024-06-28T22:22:23Z

cxx/distributions/stringcat.cc

+#include "distributions/stringcat.hh"
+
+int StringCat::string_to_index(const std::string& s) const {
+  auto it = std::find(strings.begin(), strings.end(), s);


WDYT of instead building a map<string, int> in the ctor for faster lookup?

I thought about this, but my expectation is that most of the time, the number of strings in the class will be small enough that the speed difference will be minimal. And saving space by not creating a map isn't entirely inconsequential when running on large datasets. (Keeping in mind that the number of distinct values in a column isn't the same as the number of rows -- I'm saying that the first will probably be small, but the second might be large. And because we cluster rows and create Distributions per cluster, that drives the number of instances of this class that get instantiated. In fact, we might want to consider designs where this class doesn't store its own copy of the vector of strings, but that's a pull request for another day.)

emilyfertig · 2024-06-28T22:29:59Z

cxx/util_distribution_variant.cc

@@ -5,13 +5,7 @@

 #include <cassert>
 #include <sstream>
-
-#include "distributions/beta_bernoulli.hh"


I looked this up the other day and the style guide says to include these: https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.md?cl=head#Include_What_You_Use (IMO we might as well follow that but I don't feel strongly)

emilyfertig · 2024-06-28T22:38:51Z

cxx/distributions/stringcat_test.cc

+#include <boost/test/included/unit_test.hpp>
+namespace tt = boost::test_tools;
+
+BOOST_AUTO_TEST_CASE(test_simple) {


Could you call sample in the test somewhere?

ThomasColthurst added 2 commits June 28, 2024 17:56

Add StringCat distribution

72e1995

Fix test

8a86769

ThomasColthurst requested a review from emilyfertig June 28, 2024 20:23

emilyfertig approved these changes Jun 28, 2024

View reviewed changes

ThomasColthurst added 2 commits June 29, 2024 00:15

Add sample test

a9a15ce

Include what you use for util_distribution_variant.cc

cb72995

ThomasColthurst merged commit 4fc2bad into master Jun 30, 2024
1 of 2 checks passed

ThomasColthurst deleted the 062824-thomaswc-stringcat branch June 30, 2024 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StringCat distribution for a categorical distribution over a finite set of strings #73

Add StringCat distribution for a categorical distribution over a finite set of strings #73

ThomasColthurst commented Jun 28, 2024

emilyfertig left a comment

emilyfertig Jun 28, 2024

ThomasColthurst Jun 29, 2024

emilyfertig Jun 28, 2024

ThomasColthurst Jun 30, 2024

emilyfertig Jun 28, 2024

ThomasColthurst Jun 29, 2024

Add StringCat distribution for a categorical distribution over a finite set of strings #73

Add StringCat distribution for a categorical distribution over a finite set of strings #73

Conversation

ThomasColthurst commented Jun 28, 2024

emilyfertig left a comment

Choose a reason for hiding this comment

emilyfertig Jun 28, 2024

Choose a reason for hiding this comment

ThomasColthurst Jun 29, 2024

Choose a reason for hiding this comment

emilyfertig Jun 28, 2024

Choose a reason for hiding this comment

ThomasColthurst Jun 30, 2024

Choose a reason for hiding this comment

emilyfertig Jun 28, 2024

Choose a reason for hiding this comment

ThomasColthurst Jun 29, 2024

Choose a reason for hiding this comment