Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copied (and adapted) existing k-means test cases to new k-means #3

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

fschopp
Copy link
Member

@fschopp fschopp commented Sep 24, 2012

This is still work in progress. Pull request opened to facilitate discussion. Please only merge when requested.

@fschopp
Copy link
Member Author

fschopp commented Sep 24, 2012

@huor It seems that dataset.xml and kmeans.xml (and probably the casespecs of other modules) contain redundant information. Is it really necessary to list a method in dataset.xml?

@fschopp
Copy link
Member Author

fschopp commented Sep 25, 2012

@huor Are there assumption on the names in dataset.xml, algorithms.xml, and the casespecs (such as kmeans.xml)? For instance, is a test_suite assumed to be named <method>_<name>?

@huor
Copy link
Member

huor commented Sep 25, 2012

The algorithms/methods defined in algorithms.xml, and datasets defined in dataset.xml will be used in kmeans.xml.
The algorithms/methods defined in algorithms.xml also can be used in dataset.xml.

@huor
Copy link
Member

huor commented Sep 25, 2012

The information in dataset.xml and kmeans.xml is not redundant. The the datasets defined dataset.xml is a kind of algorithms-prarameters combination, it will ease us to write cases in kmeans.xml

@fschopp
Copy link
Member Author

fschopp commented Sep 25, 2012

So, you are saying that, e.g., the fact that test suite kmeans_cset_negative_src_relation contains init_cset_rel and that dataset.xml contains the same parameter is an oversight and not necessary?

@fschopp
Copy link
Member Author

fschopp commented Sep 25, 2012

Regarding the naming conventions: Yes, I see that some names are references. But I am wondering if you also make assumption that names are composed of different substrings? Do you ever concatenate string and then assume that there is a certain algorithm/test suite/method with that composed name?

</list_parameter>
<list_parameter>
<name>dist_metric</name>
<value>squared_dist_norm1</value><value>squared_dist_norm2</value><value>squared_angle</value><value>squared_tanimoto</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that I had to explicitly include the madlib schema, e.g.: madlib.squared_dist_norm1
(Similar change had to be made in the algorithmspec.xml file for madlib.avg)

@geeg
Copy link
Contributor

geeg commented Sep 27, 2012

The MADmark Installation Guide that Jiali sent explains everything pretty well. Here's a summary of how to run select test cases:

After updating the 3 XML files accordingly, run 'cd $MADMARK_HOME/bin; python run.py -g' to generate all your test cases. So, if in kmeans.xml, you have a test_suite named "kmeans_baseline" and you have a total of 4 combinations of parameters to test, you'll be generating four case files in $MADMARK_HOME/testcase:
kmeans_baseline_0_0
kmeans_baseline_0_1
kmeans_baseline_0_2
kmeans_baseline_0_3

To run select test cases:

  1. Create a .yaml file in $MADMARK_HOME/schedule, e.g. $MADMARK_HOME/schedule/example.yaml
  2. Add relevant case files that you want to run, e.g. (in example.yaml):
    cases : case_example
    platform : GPDB42
  3. In the case file (which should also be in the $MADMARK_HOME/schedule directory), include whichever test cases you want to run, e.g. (in case_example):
    kmeans_baseline_0_0
    kmeans_baseline_0_1
    kmeans_baseline_0_2
    kmeans_baseline_0_3
  4. Run 'cd $MADMARK_HOME/bin; python run.py -s example.yaml'

Florian Schoppmann added 3 commits September 28, 2012 10:59
…ml, to

track changes compared to old kmeans test cases.
…d. Now

use CTAS workaround to be compatible with older versions of Greenplum.
, {src_col_data} -- expr_point
, (SELECT centroids FROM {table_name}) -- centroids
, {dist_metric} -- fn_dist
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the "DROP TABLE {table_name};" in order to be able to run different tests using the same dataset (otherwise, after using the dataset table_name once, all other tests that use the same dataset ERROR out since the table_name already exists).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a teardown section should be used for that. This is what I did.

Florian Schoppmann added 2 commits October 2, 2012 15:17
…a pair.

Added declarations for kmeanspp_seeding and kmeans_random_seeding.
More complex arguments may contain quotes, e.g.,
ARRAY['madlib.squared_dist_norm2','madlib.dist_norm2']. Previously, quotes did
not pass through the shell invocation but caused errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants