Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove compute time from known metafeatures in tests #169

Open
bjschoenfeld opened this issue Apr 4, 2019 · 7 comments
Open

Remove compute time from known metafeatures in tests #169

bjschoenfeld opened this issue Apr 4, 2019 · 7 comments

Comments

@bjschoenfeld
Copy link
Member

Do we use these values for anything? They bury our pull requests in lots of irrelevant changes.

@epeters3
Copy link
Contributor

@bjschoenfeld which files are storing these compute times, i.e. which files are tracking lots of irrelevant changes?

@epeters3
Copy link
Contributor

Is it the files at metalearn/test/data/dataset_metafeatures?

It looks like our run_tests.py file by default comments out the line where test metafeatures are updated. It seems like that would mean that if we update code related to computing metafeatures, then run our tests, they won't by default use up-to-date metafeatures computed by running the modified code, and the tests will give misleading results, since they ran on stale metafeatures.

Since the metafeatures are fast to compute on our test data sets, is seems like it would be nice to by default recompute all the metafeatures each time the tests are run, and gitignore the test metafeatures files. I feel like that would simultaneously solve this issue and the risk of running tests on stale metafeatures.

@emrysshevek
Copy link
Contributor

Some of those pre-computed metafeatures (the ones for the small datasets) have been checked by hand to make sure they are correct. If we updated our metafeatures every time we ran the tests, we wouldn't be sure if we are comparing against correct values or if they were affected by some of the changes.

@bjschoenfeld
Copy link
Member Author

it would be nice to by default recompute all the metafeatures each time the tests are run

This is what we are doing. We compare these results against "known" values in static files, i.e. the "stale" metafeatures. If there are differences between the computed and the known, the tests fail. This allows us to make changes to the code and test whether there were any material differences in how we compute metafeatures.

It may be the case that we find a bug in how we compute a particular metafeature, in which case the "known" value is wrong and should be updated with the new value. It is in this case that we run a little script to update the file containing the known metafeatures. When we do this, all the compute times for all metafeatures (not just the bug fixed one) get updated as well. If we do not store the compute times in the known metafeatures files, we will not have many irrelevant diffs.

@emrysshevek
Copy link
Contributor

emrysshevek commented Aug 15, 2019

Is it worth it to add a parameter to compute to say whether or not to include the compute times? Then we could just use that when comparing metafeature values.

To that end, and off-topic for this issue, should we include some code to test that our timing mechanisms are properly working? AFAIK, that aspect of our package is completely untested right now.

@bjschoenfeld
Copy link
Member Author

I think this issue combined with those suggestions would make a great PR. :)

@emrysshevek
Copy link
Contributor

Ok, then I'm thinking of combining this issue, adding return_times param, adding tests for our timing, and addressing #196.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants