Should the 1 billion row file be deterministic? #35

datdenkikniet · 2024-01-03T11:50:30Z

Currently it seems that the 1 billion rows file is generated randomly. Making the generation pseudorandom would make sharing the 1 billion row file a little easier (since it should always be the same), and would make sure that everyone is running exactly the same test.

Just using a Random with a predefined seed to pick out stations, and seeding a Random with the hash code of the city name to obtain measurements should do the trick.

The text was updated successfully, but these errors were encountered:

gunnarmorling · 2024-01-03T13:41:53Z

PR welcome for this change to the generator. Note that I am already using the same measurements.txt file for evaluating all entries, i.e. fairness is ensured.

datdenkikniet · 2024-01-05T20:54:22Z

I've now opened a PR that adds this functionality in #149. Also puts in a little bit of ground work to hopefully make #125 a bit easier to use generically by hiding/putting WeatherStation in its own class.

mtopolnik · 2024-01-05T21:07:40Z

The evaluation shouldn't use a public test file because that allows the contenders to tightly optimize for the exact keyset in that file. For example, tweaking the hash function to minimize collisions, having special cases for some keys, sizing everything exactly right for the keyset, etc.

mtopolnik · 2024-01-05T21:11:09Z

If there's concern that some solution may just get unlucky with a given keyset, the winner can be determined by repeating the test with 2-3 different test files. I very much doubt that this would be a factor, given the large keyset size (10,000); more noise can be expected from all the environmental factors on the test machine.

datdenkikniet · 2024-01-05T21:12:15Z

You are absolutely correct, and I agree! I do not think that the current test-file should be shared or changed, but am asking for determinism so that it becomes a lot easier to compare/run on 1 billion row files that other contestants are using without requiring transmission of the entire data file.

datdenkikniet mentioned this issue Jan 5, 2024

Make generation of measurements file deterministic #149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the 1 billion row file be deterministic? #35

Should the 1 billion row file be deterministic? #35

datdenkikniet commented Jan 3, 2024 •

edited

Loading

gunnarmorling commented Jan 3, 2024

datdenkikniet commented Jan 5, 2024 •

edited

Loading

mtopolnik commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

datdenkikniet commented Jan 5, 2024 •

edited

Loading

Should the 1 billion row file be deterministic? #35

Should the 1 billion row file be deterministic? #35

Comments

datdenkikniet commented Jan 3, 2024 • edited Loading

gunnarmorling commented Jan 3, 2024

datdenkikniet commented Jan 5, 2024 • edited Loading

mtopolnik commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

datdenkikniet commented Jan 5, 2024 • edited Loading

datdenkikniet commented Jan 3, 2024 •

edited

Loading

datdenkikniet commented Jan 5, 2024 •

edited

Loading

datdenkikniet commented Jan 5, 2024 •

edited

Loading