-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search on a single thread is non-deterministic in 0.29-rc0 with T78 nets #1769
Comments
If you (or someone else) build Lc0 from sources, it would be interesting to |
Is it repeatable with a T80 net? |
Actually it is not! With T80 net everything is fine - I get the same output every time. Can someone confirm this strange behaviour with T78? |
Thanks, this is very useful information. T78 nets use a new part of the the cuda backend, so we now know where to look. |
This possibly get along that if we create a transformer out of encoders only, the search goes totally wrong, and totally differently, this issue gets mitigated with mbs=1, meaning that the issue is related with the memory allocations when encoder layers are being evaluated, also the issue seems being solved in Ceres backend under this commit: dje-dev/Ceres@da03904, however that solution does not translate to lc0, imo mostly due to cudagraph solves the issue with allocating the tensor memory automatically. |
@diceydust can you check whether #1773 makes T78 results deterministic? |
Since #1773 is now merged, a test can also be done using master. |
Guys, I've no access to my home PC these days. Perhaps someone could provide Lc0-master binaries? Or maybe releasing 29 rc1 is an idea. |
You can find the current master lc0 binary with cuda backend in https://ci.appveyor.com/api/buildjobs/l4tti40aktxcp0g6/artifacts/build%2Flc0.exe |
I have checked it. And the issue is still there. Attaching log files. |
Just my 2 cents. It's definitely a thing. Yesterday I left an unfinished analysis in my files, today I went back to the same position but by taking a different "path" through the alternatives (because some of them I had already analyzed), and when I reached the exact same position, the top move this time was different. This is not great, I'd say. |
@diceydust can you check again now that v0.30.0-rc1 is out, the related code was significantly revamped for the release. |
Just reporting that the issue is still there (in v30.0). I've checked the latest binary with t1-768x15x24h-swa-4000000.pb.gz net. Done two runs (go depth 100000) from initial position. After first run I got: And after second run:
|
Even with number of threads set to 1, the search process of lc0 is non-deterministic, meaning that the everytime I do analysis of the same position, the output is different in terms of number of visits etc. For example: I have done many runs of 0.29-rc0 (784010 net) from the starting position with the following options:
And after 'go nodes 100000' most often the engine returns d2d4 as the best move. However there were runs where e2e4 was chosen. I attach two log files - two runs from the same position returning different best moves.
I'm pretty sure this is not expected behavior, therefore reporting this as a bug.
1.txt
2.txt
The text was updated successfully, but these errors were encountered: