-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check results for new CI, allow some benchmarks to fail #211
Conversation
script to check it.
I am taking this approach to figure out the expected results:
|
Yes, this sounds reasonable. It's importantly that CI doesn't just fail all the time so we develop a habit of ignoring it instead of noticing real regressions. For each (intermittently) failed benchmark, we should have an issue for it and put it in the comment of the yaml. Once the issues are fixed, the yaml should be updated. |
So I guess if a PR fixes a bug then we would have to update the expected results manually? |
Yes. |
I am still testing this. There are a few issues I have seen. I did not mark those tests as 'ignore' yet, I feel those are possibly related with the CI environment rather than the code we test. Script quit after one plan
kafka 2500 608 0. The log for the first plan (semispace) ===== DaCapo evaluation-git-04132797 kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%[2023-08-18 01:06:02,073] ERROR Error while appending records to foo1-8 in dir /tmp/runbms-hmqvnk5o/scratch/kafka-logs (kafka.server.LogDirFailureChannel)
java.io.IOException: No space left on device
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:92)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:158)
at kafka.log.LocalLog.append(LocalLog.scala:436)
at kafka.log.UnifiedLog.append(UnifiedLog.scala:929)
at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:740)
at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1167)
at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1155)
at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:947)
at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
at scala.collection.mutable.HashMap.map(HashMap.scala:35)
at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:935)
at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:593)
at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:665)
at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
at java.base/java.lang.Thread.run(Thread.java:829)
[2023-08-18 01:06:02,104] ERROR Shutdown broker because all log dirs in /tmp/runbms-hmqvnk5o/scratch/kafka-logs have failed (kafka.log.LogManager) The log for the second plan ===== DaCapo evaluation-git-04132797 kafka starting =====
Trogdor is running the workload....
Starting 1000000 requests...
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%[2023-08-18 01:06:14,999] ERROR Error while appending records to foo1-1 in dir /tmp/runbm All plans failed
h2o 2500 340 0....... No extra logging about the error, just something like this for all the plans. Using scaled threading model. 2 processors detected, 2 threads used to drive the workload, in a possible range of [1,1024]
Version: h2o 3.38.0.3
Nominal stats: NEP: 2, NES: 3, NET: 9, NEW: 1, NMH: 102, NML: 1, NMR: 1212, NMS: 20, NMT: 100, NMU: 130, NOA: 142, NOL: 152, NOM: 16, NOS: 16 |
actions/runner-images#2840 We can apply these workarounds. This shouldn't be a problem once we switch to DaCapo Chopin RC2 (there's a kafka issue in DaCapo) @steveblackburn |
It seems often that many jobs are skipped because of out of disk space (showing failure at the step of 'Extract OpenJDK'): https://github.com/mmtk/mmtk-openjdk/actions/runs/5934085396 |
Should we try and get this merged? I think applying the workarounds mentioned by @caizixian should allow us to merge this. |
I added a step to free up space, as what Zixian posted. It did not solve the problem. The runs may still fail due to out of disk. The tests are too flaky at the moment. |
Feel free to push any changes or attempts to the PR if you are aware of a workaround. |
The out-of-disk error should be fixed by #241. |
It seems like even semispace could randomly fail. I feel if we ignore any benchmark that may randomly fail, we may end up ignoring a lot more than what we want. |
Hm. SemiSpace failing is odd. The CI uploads artifacts -- I'll take a look EDIT: That's annoying. It just says |
@qinsoon You will have to rename it to |
@qinsoon you can run DaCapo benchmarks with |
Sure. I haven't used plugins before. I am not sure if I use it in the right way. 37593c6 |
@qinsoon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds a yaml file that records our expected results for DaCapo tests. It also adds a script to check the results for each CI run. We should generally expect a CI run to succeed (which means the results are expected) once this PR is merged.
However, as we may have random failures that we haven't noticed, it is possible a CI run still fails. In such a case, we should update the
ci-expected-results.yml
, create an Github issue, and update #246.