-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copied (and adapted) existing k-means test cases to new k-means #3
base: master
Are you sure you want to change the base?
Conversation
@huor It seems that |
@huor Are there assumption on the names in |
The algorithms/methods defined in algorithms.xml, and datasets defined in dataset.xml will be used in kmeans.xml. |
The information in dataset.xml and kmeans.xml is not redundant. The the datasets defined dataset.xml is a kind of algorithms-prarameters combination, it will ease us to write cases in kmeans.xml |
So, you are saying that, e.g., the fact that test suite |
Regarding the naming conventions: Yes, I see that some names are references. But I am wondering if you also make assumption that names are composed of different substrings? Do you ever concatenate string and then assume that there is a certain algorithm/test suite/method with that composed name? |
</list_parameter> | ||
<list_parameter> | ||
<name>dist_metric</name> | ||
<value>squared_dist_norm1</value><value>squared_dist_norm2</value><value>squared_angle</value><value>squared_tanimoto</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that I had to explicitly include the madlib schema, e.g.: madlib.squared_dist_norm1
(Similar change had to be made in the algorithmspec.xml file for madlib.avg)
The MADmark Installation Guide that Jiali sent explains everything pretty well. Here's a summary of how to run select test cases: After updating the 3 XML files accordingly, run 'cd $MADMARK_HOME/bin; python run.py -g' to generate all your test cases. So, if in kmeans.xml, you have a test_suite named "kmeans_baseline" and you have a total of 4 combinations of parameters to test, you'll be generating four case files in $MADMARK_HOME/testcase: To run select test cases:
|
…ml, to track changes compared to old kmeans test cases.
…d. Now use CTAS workaround to be compatible with older versions of Greenplum.
, {src_col_data} -- expr_point | ||
, (SELECT centroids FROM {table_name}) -- centroids | ||
, {dist_metric} -- fn_dist | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need the "DROP TABLE {table_name};" in order to be able to run different tests using the same dataset (otherwise, after using the dataset table_name once, all other tests that use the same dataset ERROR out since the table_name already exists).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a teardown
section should be used for that. This is what I did.
…a pair. Added declarations for kmeanspp_seeding and kmeans_random_seeding.
More complex arguments may contain quotes, e.g., ARRAY['madlib.squared_dist_norm2','madlib.dist_norm2']. Previously, quotes did not pass through the shell invocation but caused errors.
This is still work in progress. Pull request opened to facilitate discussion. Please only merge when requested.