Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sporadic test-waic failure (possibly only on MacOS) #1368

Open
paciorek opened this issue Dec 13, 2023 · 12 comments · Fixed by #1393
Open

sporadic test-waic failure (possibly only on MacOS) #1368

paciorek opened this issue Dec 13, 2023 · 12 comments · Fixed by #1393
Assignees
Labels

Comments

@paciorek
Copy link
Contributor

paciorek commented Dec 13, 2023

We're seeing this test failure, possibly only on MacOS:

══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Error ('test-waic.R:585:5'): standalone offline WAIC works ───────────────
Error in `FUN(X[[i]], ...)`: no such symbol new_modelValuesConf_UID_172 in package /private/var/folders/3s/vfzpb5r51gs6y328rmlgzm7c0000gn/T/RtmpSNASp7/nimble_generatedCode/P_12_waicClass_offline_12_12_21_42_16.so
Backtrace:

I was able to reproduce this (fairly reliably but not every time I ran the specific test) on my laptop. Here's what seems to be going on:

  • It comes in one of the repeated calls to calculateWAIC in lines 562-572, often line 567 or line 572.
  • the 172 label is in the .cpp/.h file. It is also 172 in the interface function in R. However the symbol in the .so file is an "old" one, with a label like 170 from one of the previous compilations involved in the repeated calls to calculateWAIC.
  • When I look carefully at the compilation process, I see that in the cases where calculateWAIC fails, the P_<ID>_waicClass_offline.o file is not being regenerated. Only the clang++ call that creates the .so file from the .o file is invoked, not the clang++ call that creates the .o file from the .cpp file. As a result the old 170 symbol is in the .so file. [see below for more details]
  • We compile using R CMD SHLIB, which apparently uses make (I haven't looked into details).
  • I'm wondering if there is some weirdness in file timestamps or something else going on with make that causes it to think that the .cpp/.h files haven't changed and therefore that the .o file doesn't need to be regenerated, even though ls -l clearly shows that the .o file is older than the .cpp/.h files.

Here's what we expect our system2 call to R CMD SHLIB to do:

clang++ -arch arm64 -std=gnu++17 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DR_NO_REMAP   -DEIGEN_MPL2_ONLY=1 -I"/private/tmp/nim-102/nimble/include" -Wno-misleading-indentation -Wno-ignored-attributes -Wno-deprecated-declarations  -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c P_11_waicClass_offline.cpp -o P_11_waicClass_offline.o
clang++ -arch arm64 -std=gnu++17 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o P_11_waicClass_offline_12_12_14_04_13.so P_11_waicClass_offline.o -L/private/tmp/nim-102/nimble/CppCode -lnimble -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

As indicated above only the second clang++ call occurs in the cases when the error occurs, despite the SHLIBargs in cppProjectClass$compileFile being the same.

Not sure we want to go this route but using R CMD SHLIB --preclean seems to avoid the issue, though since it is sporadic, I can't 100% guarantee it. But in many repeated tries, it no longer had the error. Or perhaps we would generally want to use R CMD SHLIB --clean with the thought that we don't need the various .o files anyway so cleaning them up could make good sense generally.

I'm hoping that because this seems to occur with repeated compilations in this specific situation where the same .cpp/.h/.o file names are used repeatedly that it is not something users would encounter. I don't think we've gotten similar reports though I have a vague feeling we've seen "no such symbol" report(s) before.

Before I do anything more with this, I'll plan to discuss with @perrydv .

@perrydv
Copy link
Contributor

perrydv commented Jan 5, 2024

@paciorek I have spent a while digging into this. I'm as baffled as you are. From a shell I am using symbols P_1_waicClass_offline.o | grep modelValues to see the relevant symbols in the latest compiled result. This also works with a .so file as input. I can see the state that occurs when, after our R CMD SHLIB call, the "new_modelValuesConf_UID_XX" symbol has XX (some number from our counter) that is one behind what is in the .h and .cpp files. Essentially this is just finding a way to look more directly, from outside of R, and see the mismatched state where the .o has evidently not been updated.

Also nimbleOptions(pauseAfterWritingFiles=TRUE) has been helpful in this digging.

I did reproduce this problem on 1.0.1 (current release). I am wondering if it has to do with changes in R or OS X tool chains, unlikely as that seems. I do have ccache running, but we are seeing this on your system and @danielturek 's system and the CI build systems, so unless they all have ccache running, that doesn't seem likely to be the culprit.

I like the "--preclean" option. It would be easy to insert. It would be a big change in that it would touch every compilation call we make. But it also makes sense.

What do you think?

@paciorek
Copy link
Contributor Author

paciorek commented Jan 5, 2024

Yeah, I think your debugging/digging steps were similar to mine.

I'm a bit wary of making the --preclean option universal given we haven't heard any reports of this apart from our testing, though I agree it makes sense so maybe it would just be a good thing to do.

One thought is that we set up a nimble option to cause --preclean to be invoked but have it default to FALSE. For testing we can set it to TRUE (perhaps just for the problematic waic testing but perhaps more broadly) and it's in our back pocket in case we start seeing reports of the problem from users.

@perrydv
Copy link
Contributor

perrydv commented Jan 5, 2024

Here is a sporadically reproducible nearly-minimal example entirely outside of nimble.

As file1.cpp:

#include "file1.h"

SEXP foo_1() {
  Rprintf("hello world\n");
  return R_NilValue;
}

As file1.h:

#define R_NO_REMAP
#include "R.h"
#include "Rinternals.h"
#include "Rdefines.h"

extern "C" SEXP foo_1();

As file2.cpp:

#include "file2.h"

SEXP foo_2() {
  return R_NilValue;
}

as file2.h

#include "R.h"
#include "Rinternals.h"
#include "Rdefines.h"

extern "C" SEXP foo_2();

Then from R, run:

system2("rm", "test*.*")

system2("cp", c("file1.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test1.so"))
system2("symbols", c("test1.so", "| grep foo"))
dll1 <- dyn.load("test1.so", local=TRUE)
getNativeSymbolInfo("foo_1", dll1) # should be valid

system2("cp", c("file2.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test2.so"))
system2("symbols", c("test2.so", "| grep foo"))
dll2 <- dyn.load("test2.so", local=TRUE)
getNativeSymbolInfo("foo_2", dll2) # should be valid, sometimes isn't
getNativeSymbolInfo("foo_1", dll2) # should not be valid, sometimes is

The .h and .cpp files might not be fully minimal yet, but close.

@paciorek
Copy link
Contributor Author

paciorek commented Jan 5, 2024

Nice - extracting this out of the nimble context is a big step forward.

One thought here is that I could run this by our SCF sysadmin (Ryan Lovett).

I'm also going to try running on my new M2 Mac and see what happens.

@paciorek
Copy link
Contributor Author

paciorek commented Jan 5, 2024

Ok, I've reproduced this on my M2 Mac.

If I insert a 3-second sleep before the file2.cpp compilation, everything seems fine. So it feels like a timestamp kind of thing.

@paciorek
Copy link
Contributor Author

paciorek commented Jan 6, 2024

I think the following reproduces things using make.

Here's the Makefile:

test.o: test.cpp
	clang++ -arch arm64 -std=gnu++17 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c test.cpp -o test.o

test2.so: test.o
	clang++ -arch arm64 -std=gnu++17 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o test2.so test.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

Here's the reprex:

rm test*.*

cp file1.cpp test.cpp
R CMD SHLIB test.cpp -o test1.so
symbols test1.so | grep foo

cp file2.cpp test.cpp
make test2.so
symbols test2.so | grep foo

@paciorek
Copy link
Contributor Author

paciorek commented Jan 6, 2024

Actually, I no longer think the sleep changes things.
However if I manually edit test.cpp instead of using cp then things seem fine.

Ryan is also flummoxed.

@paciorek
Copy link
Contributor Author

paciorek commented Jan 6, 2024

And one more thing -- I can reproduce the problem on Linux.

@paciorek
Copy link
Contributor Author

paciorek commented Jan 6, 2024

I think I've changed my mind and agree with the idea of using --preclean by default given we seem to have isolated this to strange behavior of make.

@perrydv
Copy link
Contributor

perrydv commented Jan 8, 2024

Here's more experimental evidence that the issue has to do with timestamps. The following code runs 100 iterations of the copy-compile scheme I set up above. For each one, it records whether the second compilation was properly done (the .so contains foo_2) or was not done (the .so still contains foo_1). If the Sys.sleep(1) is commented-out, in about 1/3 of iterations (33/100), the result is wrong (compilation not done).

When I extract timestamps using stat, we see that, in all cases when the timestamp of test.cpp (after cp file2.cp test.cpp) is tied with that of the previous test.o, the compilation is not updated (wrong). On the other hand, in many (but not all) cases when those two timestamps do not match, the compilation is updated (right). I speculate that the imperfect match of the latter is due to a "quantum" problem where the steps of the commands take bits of time themselves so observing them is imperfect. But it could just as well be that I am not sure if I am looking at the right one of the available times. I can't find much easy information from search results on getting it below the precision of whole seconds.

When Sys.sleep(1) is uncommented below, the problem never occurs.

I am somewhat baffled that the timing mechanisms used for builds can have this kind of problem/phenomenon. It does seem that --preclean may be the intended way to ensure we force compilation.

stat_test_cpp <- character(100)
stat_test_o <- character(100)
found_foo1 <- logical(100)
found_foo2 <- logical(100)
for(i in 1:100) {
  system2("rm", "test*.*")
  dyn.unload("test1.so")
  dyn.unload("test2.so")

  system2("cp", c("file1.cpp","test.cpp"))
  system2("R", c("CMD SHLIB", "test.cpp", "-o test1.so"))
  dll1 <- dyn.load("test1.so", local=TRUE)

#  Sys.sleep(1.0)
  system2("cp", c("file2.cpp","test.cpp"))
  system2("R", c("CMD SHLIB", "test.cpp", "-o test2.so"))
  dll2 <- dyn.load("test2.so", local=TRUE)
  find_foo2 <- try(getNativeSymbolInfo("foo_2", dll2)) # should be valid, sometimes isn't
  find_foo1 <- try(getNativeSymbolInfo("foo_1", dll2)) # should not be valid, sometimes is

  found_foo2[i] <- !inherits(find_foo2, "try-error")
  found_foo1[i] <- !inherits(find_foo1, "try-error")

  system2("stat","test.cpp > stat_test_cpp.txt")
  stat_test_cpp[i] <- readLines("stat_test_cpp.txt")
  system2("stat","test.o > stat_test_o.txt")
  stat_test_o[i] <- readLines("stat_test_o.txt")
}
sum(found_foo1) # how many times did compilation NOT occur (wrong)
sum(found_foo2) # how many time did compilation occur (right)

# pull out "birthtime" timestamps and see when they match between the two stat calls
stat_test_cpp_split <- lapply(stat_test_cpp, \(x) strsplit(x, "\""))
stat_test_o_split <- lapply(stat_test_o, \(x) strsplit(x, "\""))
matches <- mapply(\(x1, x2) identical(x1[[1]][8], x2[[1]][8]), stat_test_cpp_split, stat_test_o_split)

cbind(matches, found_foo1) # align when the timestamps match and when foo_1 was found (indicating that compilation did not occur)

@paciorek
Copy link
Contributor Author

I'm preparing a branch with --preclean where that is shielded by a nimbleOption but defaults to TRUE.

@paciorek
Copy link
Contributor Author

Just a note that at least for Laplace, this adds an extra, unneeded compilation of nimbleCppADbaseClass.cpp that, for the crossed random effects example in test-ADlaplace.R adds an additional 20 seconds of C++ compilation time, taking the total time from something like 60 sec. to something like 80 sec on my Linux machine.

g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG -DR_NO_REMAP   -DEIGEN_MPL2_ONLY=1 -I"/tmp/nim-devel/nimble/include" -Wno-misleading-indentation -Wno-ignored-attributes -Wno-deprecated-declarations      -fpic  -g -O2 -ffile-prefix-map=/build/r-base-H0vbME/r-base-4.3.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c nimbleCppADbaseClass.cpp -o nimbleCppADbaseClass.o

So at some point we might want to look more carefully at managing what is recompiled more directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants