Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare crashes with -fno-trapping-math #585

Open
Aloril opened this issue Dec 14, 2018 · 10 comments
Open

Rare crashes with -fno-trapping-math #585

Aloril opened this issue Dec 14, 2018 · 10 comments

Comments

@Aloril
Copy link
Contributor

Aloril commented Dec 14, 2018

Works fine:
CC=clang-6.0 CXX=clang++-6.0 ./build.sh

Crashes (uses gcc 5.4.0):
./build.sh
./build/release/lc0 --weights=networks/weights_test35b10-35001.pb.gz
position fen rnbqkb1r/1ppppppp/5n2/p7/8/3PB3/PPP1PPPP/RN1QKBNR w KQkq - 0 1 moves g1f3 d7d6 b1c3 e7e5 g2g3 b8c6 f1g2 d6d5 e1g1 d5d4 a2a3 d4e3 f1e1 a5a4 h2h4 e5e4 b2b4 e4f3 d1d2 e3d2 e2e3 b7b6 g3g4 b6b5 g1h1 g7g6 e1d1 g6g5 h4h5 h7h6 h1g1 f3g2 f2f3 c6d4
go movetime 1000000000 nodes 1

Another test case:
./build/release/lc0 --weights=networks/weights_test35b10-35081.pb.gz
position fen r1bqkbnr/pppppp1p/n5p1/8/6P1/5N2/PPPPPP1P/RNBQKB1R w KQkq - 0 1 moves d2d4 d7d5 h1g1
go movetime 1000000000 nodes 1

Also works fine if build.sh modified as following when using GCC:
if [ -f ${BUILDDIR}/build.ninja ]
then
meson configure ${BUILDDIR} -Db_lto=false --buildtype ${BUILDTYPE} --prefix ${INSTALL_PREFIX:-/usr/local} "$@"
else
meson ${BUILDDIR} -Db_lto=false --buildtype ${BUILDTYPE} --prefix ${INSTALL_PREFIX:-/usr/local} "$@"
fi

@borg323
Copy link
Member

borg323 commented Dec 14, 2018

This was the crash @Aloril reported on discord:
Thread 38 "lczero20d1" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff81fff700 (LWP 127980)]
0x0000000000469f06 in void std::__insertion_sort<__gnu_cxx::__normal_iterator<lczero::EdgeAndNode*, std::vector<lczero::EdgeAndNode, std::allocatorlczero::EdgeAndNode > >, __gnu_cxx::__ops::_Iter_comp_iter<lczero::Search::GetVerboseStats(lczero::Node*, bool) const::{lambda(lczero::EdgeAndNode, lczero::EdgeAndNode)#1}> >(__gnu_cxx::__normal_iterator<lczero::EdgeAndNode*, std::vector<lczero::EdgeAndNode, std::allocatorlczero::EdgeAndNode > >, __gnu_cxx::__ops::_Iter_comp_iter<lczero::Search::GetVerboseStats(lczero::Node*, bool) const::{lambda(lczero::EdgeAndNode, lczero::EdgeAndNode)#1}>, __gnu_cxx::__ops::_Iter_comp_iter<lczero::Search::GetVerboseStats(lczero::Node*, bool) const::{lambda(lczero::EdgeAndNode, lczero::EdgeAndNode)#1}>) [clone .lto_priv.922] ()

@borg323
Copy link
Member

borg323 commented Dec 17, 2018

@Aloril can you check the binutils version installed, and whether you can reproduce the issue using ./build.sh -Ddefault_library=static

@Aloril
Copy link
Contributor Author

Aloril commented Dec 18, 2018

binutils: 2.26.1-1ubuntu1~16.04.7
Crashes with above build command too.

@borg323
Copy link
Member

borg323 commented Dec 28, 2018

Happened to me as well with gcc 5.5 and fixed node tests, while gcc 8.1 works fine. Made #625 to disable lto builds for now. Also under Ubuntu 16.04 with the same binutils version.

@sergiovieri
Copy link

Native C++ compiler: c++ (gcc 5.4.0 "c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609")
Segmentation fault when using lto. Crashes instantly on fixed nodes in uci mode, takes a while in selfplay.

@mooskagh mooskagh added the release blocker Bugs which block releases label Sep 10, 2019
@borg323
Copy link
Member

borg323 commented Sep 10, 2019

The lto breakage was due to -ffast-math. I did some bisecting, and it was introduced in 1a5f95f(#580) and fixed in d2bc747(#661). Adding -ffast-math to current master still results in bad lto executables with gcc 5.5 and regular built executables with clang (as mentioned in #661).

@borg323
Copy link
Member

borg323 commented Sep 10, 2019

The root cause is -fno-trapping-math (included in -ffast-math) with both gcc and clang. The most probable explanation is a (rare) division by zero.
Using /fp:fast with msvc seems to be safe (but more tests would be nice).

@mooskagh mooskagh removed the release blocker Bugs which block releases label Nov 18, 2019
@Naphthalin
Copy link
Contributor

Is there any known issue with any gcc/clang compiler we still support since we switched to C++17?

@borg323 borg323 changed the title lto build on Ubuntu 16.04 using GCC broken, works using clang Rare crashes with -fno-trapping-math Apr 28, 2020
@borg323
Copy link
Member

borg323 commented Apr 28, 2020

The underlying issue is still there. I've changed the title to reflect the real issue.

@Naphthalin
Copy link
Contributor

Issue still active? @borg323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants