-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Synthetic CrossCat datasets #175
Conversation
@@ -0,0 +1,4 @@ | |||
col1 ~ bernoulli(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the relation name here have to match the second column of the .obs file? Here it is "col1" but the .obs file uses "has_col1".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was a bad push. Fixed here and elsewhere.
@@ -0,0 +1,4 @@ | |||
col1 ~ stringcat[strings="a:b:c:d",delim=:](id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the "has_" issue, there also appears to be an off by one error. Here, "col1" is the one that has a/b/c/d values, but it is "has_col0" in the .obs file that has those values.
Similarly for the other relations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed as well.
NUM_SAMPLES_1 = 33 | ||
NUM_SAMPLES_2 = 50 | ||
NUM_SAMPLES = 100 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding some tests for this program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding integration tests in a later PR that will test the files (test some basic invariants, like creating two IRMs, etc)
Add four synthetic CrossCat datasets for each of the data types. Will add unit tests that verify that we get the appropriate clusterings with these later.