-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StringCat distribution for a categorical distribution over a finite set of strings #73
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good!
#include "distributions/stringcat.hh" | ||
|
||
int StringCat::string_to_index(const std::string& s) const { | ||
auto it = std::find(strings.begin(), strings.end(), s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of instead building a map<string, int>
in the ctor for faster lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but my expectation is that most of the time, the number of strings in the class will be small enough that the speed difference will be minimal. And saving space by not creating a map isn't entirely inconsequential when running on large datasets. (Keeping in mind that the number of distinct values in a column isn't the same as the number of rows -- I'm saying that the first will probably be small, but the second might be large. And because we cluster rows and create Distributions per cluster, that drives the number of instances of this class that get instantiated. In fact, we might want to consider designs where this class doesn't store its own copy of the vector of strings, but that's a pull request for another day.)
@@ -5,13 +5,7 @@ | |||
|
|||
#include <cassert> | |||
#include <sstream> | |||
|
|||
#include "distributions/beta_bernoulli.hh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked this up the other day and the style guide says to include these: https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.md?cl=head#Include_What_You_Use (IMO we might as well follow that but I don't feel strongly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
#include <boost/test/included/unit_test.hpp> | ||
namespace tt = boost::test_tools; | ||
|
||
BOOST_AUTO_TEST_CASE(test_simple) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you call sample
in the test somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
My thought is that even if the data is stored in file as "0", "1", "2", etc., this distribution class plus a simple string emission class would do a good job of modelling typos.