Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StringCat distribution for a categorical distribution over a finite set of strings #73

Merged
merged 4 commits into from
Jun 30, 2024

Conversation

ThomasColthurst
Copy link
Collaborator

My thought is that even if the data is stored in file as "0", "1", "2", etc., this distribution class plus a simple string emission class would do a good job of modelling typos.

Copy link
Collaborator

@emilyfertig emilyfertig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good!

#include "distributions/stringcat.hh"

int StringCat::string_to_index(const std::string& s) const {
auto it = std::find(strings.begin(), strings.end(), s);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of instead building a map<string, int> in the ctor for faster lookup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but my expectation is that most of the time, the number of strings in the class will be small enough that the speed difference will be minimal. And saving space by not creating a map isn't entirely inconsequential when running on large datasets. (Keeping in mind that the number of distinct values in a column isn't the same as the number of rows -- I'm saying that the first will probably be small, but the second might be large. And because we cluster rows and create Distributions per cluster, that drives the number of instances of this class that get instantiated. In fact, we might want to consider designs where this class doesn't store its own copy of the vector of strings, but that's a pull request for another day.)

@@ -5,13 +5,7 @@

#include <cassert>
#include <sstream>

#include "distributions/beta_bernoulli.hh"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked this up the other day and the style guide says to include these: https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.md?cl=head#Include_What_You_Use (IMO we might as well follow that but I don't feel strongly)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <boost/test/included/unit_test.hpp>
namespace tt = boost::test_tools;

BOOST_AUTO_TEST_CASE(test_simple) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you call sample in the test somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ThomasColthurst ThomasColthurst merged commit 4fc2bad into master Jun 30, 2024
1 of 2 checks passed
@ThomasColthurst ThomasColthurst deleted the 062824-thomaswc-stringcat branch June 30, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants