Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the 1 billion row file be deterministic? #35

Open
datdenkikniet opened this issue Jan 3, 2024 · 5 comments
Open

Should the 1 billion row file be deterministic? #35

datdenkikniet opened this issue Jan 3, 2024 · 5 comments

Comments

@datdenkikniet
Copy link

datdenkikniet commented Jan 3, 2024

Currently it seems that the 1 billion rows file is generated randomly. Making the generation pseudorandom would make sharing the 1 billion row file a little easier (since it should always be the same), and would make sure that everyone is running exactly the same test.

Just using a Random with a predefined seed to pick out stations, and seeding a Random with the hash code of the city name to obtain measurements should do the trick.

@gunnarmorling
Copy link
Owner

PR welcome for this change to the generator. Note that I am already using the same measurements.txt file for evaluating all entries, i.e. fairness is ensured.

@datdenkikniet
Copy link
Author

datdenkikniet commented Jan 5, 2024

I've now opened a PR that adds this functionality in #149. Also puts in a little bit of ground work to hopefully make #125 a bit easier to use generically by hiding/putting WeatherStation in its own class.

@mtopolnik
Copy link
Contributor

The evaluation shouldn't use a public test file because that allows the contenders to tightly optimize for the exact keyset in that file. For example, tweaking the hash function to minimize collisions, having special cases for some keys, sizing everything exactly right for the keyset, etc.

@mtopolnik
Copy link
Contributor

If there's concern that some solution may just get unlucky with a given keyset, the winner can be determined by repeating the test with 2-3 different test files. I very much doubt that this would be a factor, given the large keyset size (10,000); more noise can be expected from all the environmental factors on the test machine.

@datdenkikniet
Copy link
Author

datdenkikniet commented Jan 5, 2024

You are absolutely correct, and I agree! I do not think that the current test-file should be shared or changed, but am asking for determinism so that it becomes a lot easier to compare/run on 1 billion row files that other contestants are using without requiring transmission of the entire data file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants