-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@batch
slows down other non-@batch
ed loops with allocations on macOS ARM
#89
Comments
@batch
slows other non-@batch
ed loops with allocations down on macOS ARM@batch
slows down other non-@batch
ed loops with allocations on macOS ARM
On an Intel laptop: julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 17.436 μs … 73.138 ms ┊ GC (min … max): 0.00% … 2.37%
Time (median): 20.561 μs ┊ GC (median): 0.00%
Time (mean ± σ): 92.966 μs ± 2.244 ms ┊ GC (mean ± σ): 1.94% ± 0.08%
▂█▄▃▃▃▃▄▁
▁▁▆█████████▆▄▃▃▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁ ▂
17.4 μs Histogram: frequency by time 39.1 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.267 μs … 2.121 ms ┊ GC (min … max): 0.00% … 96.71%
Time (median): 19.250 μs ┊ GC (median): 0.00%
Time (mean ± σ): 22.940 μs ± 62.979 μs ┊ GC (mean ± σ): 8.50% ± 3.09%
▁▂▆██▇▇▇▇▇▆▅▄▃▃▃▂▁▁▁▁▁ ▁▂▁▂▂▁ ▁ ▁ ▁▁▁ ▃
███████████████████████▇███████████████████▇██▇▆▆▇▇▇▇▇▅▆▆▆▇ █
16.3 μs Histogram: log(frequency) by time 40.2 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002. Not as extreme, but the problem still exists. |
One workaround is to set a julia> function with_minbatch()
# Just some loop with @batch with basically no runtime
@batch minbatch=100 for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
with_minbatch (generic function with 1 method)
julia> @benchmark with_minbatch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.096 μs … 2.231 ms ┊ GC (min … max): 0.00% … 98.34%
Time (median): 17.549 μs ┊ GC (median): 0.00%
Time (mean ± σ): 20.675 μs ± 63.241 μs ┊ GC (mean ± σ): 9.52% ± 3.09%
▁▃▅▇█▇▆▅▅▃▃▂▁▁ ▁ ▁▁▂▂▁ ▂
████████████████▇▆▆▆▆▄▂▅▂▄▅▅▆▄▄▅▄▃▄▂▆▇▇███████▇▆▆▅▄▄▅▃▅▇▆▆▅ █
16.1 μs Histogram: log(frequency) by time 33.2 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002. This means we'd need at least 100 iterations per thread. |
Thanks for the quick reply. |
function with_batch_sleep()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
ThreadingUtilities.sleep_all_tasks()
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end I get julia> @benchmark with_batch_sleep()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.542 μs … 1.063 ms ┊ GC (min … max): 0.00% … 90.75%
Time (median): 18.041 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.843 μs ± 31.948 μs ┊ GC (mean ± σ): 4.85% ± 2.99%
█▇ ▂▁▃▁ ▁
▆████████▇▇▆▆▃▅▅▅█▅▅▄▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▄▃▃▁▁▄▅ █
16.5 μs Histogram: log(frequency) by time 61.4 μs <
Memory estimate: 47.02 KiB, allocs estimate: 1003.
julia> versioninfo()
Julia Version 1.9.0-DEV.1073
Commit 0b9eda116d* (2022-08-01 14:27 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 |
It's ridiculous that this is slow: julia> function with_thread()
Threads.@threads for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
with_thread (generic function with 1 method)
julia> @benchmark with_thread()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 20.250 μs … 1.244 ms ┊ GC (min … max): 0.00% … 90.11%
Time (median): 52.875 μs ┊ GC (median): 0.00%
Time (mean ± σ): 55.113 μs ± 38.817 μs ┊ GC (mean ± σ): 2.25% ± 3.08%
▂ ▄█
▂▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▄▄▃▅█▇██▇▆▅▄▃▃▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂ ▃
20.2 μs Histogram: frequency by time 73.5 μs <
Memory estimate: 49.17 KiB, allocs estimate: 1025. =/ |
I think |
Amazing, thank you!
Interestingly, with
|
Polyester/ThreadingUtilities block excess threads for a few milliseconds while looking for work to do. Base threading does as well, but for not as long: But going to sleep more quickly can help other things, like here. Presumably, something wants to run on these threads periodically. You can change ThreadingUtilities default behavior: |
I think we can close this issue once someone adds a section on the README (preferably close to the top, as it's an important gotcha). PRs welcome :). |
I would create a PR, but I still don't fully understand the problem that you explained in your last comment. How is the longer sleep threshold of Polyester problematic here? What is the consequence of threads falling asleep with a shorter threshold? Why is Polyester/ThreadingUtilities not doing that by default? |
It could be simple and merely suggest trying it when you see unexpected regressions.
I am not sure why. This suggests that maybe only occasionally the loop wants to use another thread, perhaps related to GC, and when this happens, it has to wait for ThreadingUtilities' tasks to go to sleep.
If the threads are awake when you assign them work, e.g. through Consider these benchmarks on an Intel (Cascadelake [i.e., Skylake-AVX512 clone]) CPU: julia> function batch()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
end
batch (generic function with 1 method)
julia> function batch_sleep()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
ThreadingUtilities.sleep_all_tasks()
end
batch_sleep (generic function with 1 method)
julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 656 evaluations.
Range (min … max): 183.107 ns … 242.933 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 185.463 ns ┊ GC (median): 0.00%
Time (mean ± σ): 188.037 ns ± 7.939 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▇▄▅▄▄▅▄▂▂ ▁ ▁ ▂
█████████████▇▆▇▇▇▇▆▇▅▅▆▇▇▆▇▆▅▅▆▇▇▆▅▆▄▄▃▄▁▅▆▃▃▁▃▄▃▁▁▄▁▃▃███▇█ █
183 ns Histogram: log(frequency) by time 229 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.652 μs … 4.131 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.772 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.792 μs ± 62.787 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆▁
▂▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▄▆████▇▅▅▄▃▃▄▆██▆▆▅▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
1.65 μs Histogram: frequency by time 1.96 μs <
Memory estimate: 28 bytes, allocs estimate: 0. On a Zen3 CPU: julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 38 evaluations.
Range (min … max): 880.237 ns … 10.576 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 920.763 ns ┊ GC (median): 0.00%
Time (mean ± σ): 927.305 ns ± 100.294 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▅▃▂▁▁ ▂▁ ▁
▃▃▃▁▁▁▁▁▁▄▃▁▅▆▇██████████▇▇▇▇▇▇▆▇▆▆▆▅▅▅▆▅▄▆▄▇██▅▅▃▄▅▄▅▁▅▄▃▄▃▃ █
880 ns Histogram: log(frequency) by time 1.03 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.543 μs … 6.771 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.343 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.195 μs ± 329.152 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁▃▆▇█▆▄▂
▁▂▂███▅▃▄▄▄▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▄▅████████▇▆▄▃▂▂▂▁▁▁▁ ▃
2.54 μs Histogram: frequency by time 3.65 μs <
Memory estimate: 24 bytes, allocs estimate: 0. And finally, on my M1: julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.796 μs … 6.125 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.604 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.610 μs ± 70.689 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃█▃ ▁
▃▁▁▁▁▁▁▃▁▁▁▁▃▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▄▁▁▁▁▁▄▆▁▁▁▆████▄▆▄▃▃▆▃▁▁▃▃▁█ █
1.8 μs Histogram: log(frequency) by time 2.87 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.218 μs … 8.690 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.597 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.606 μs ± 93.115 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂█▂ ▁ ▁
▃▁▁▃▁▃▃▄▁▁▃▃▃▃▁▃▁▄▄█▅▄▁▅▃▅▅▄▄▆████▆▅▄▃▅▇███▇▆█▄▄▅▄▄▅█▃▄▁▁▄ █
2.22 μs Histogram: log(frequency) by time 2.89 μs <
Memory estimate: 28 bytes, allocs estimate: 0. The M1 is much slower than the x86 CPUs here. I don't know if it's a problem with how ThreadingUtilities works on the M1, but I have known for a while that threading has substantially higher overhead on it than my x64 CPUs.
It cannot currently. |
It's interesting that some evaluations take over a second longer in my initial example, even though the sleep timeout is just a millisecond. It seems like there is something preventing the threads from going to sleep, right? |
Has anyone posted this to JuliaLang/julia yet, since it affects |
Unfortunately, there still doesn't seem to be a good solution after over a year. using Polyester
using ThreadingUtilities
function with_batch()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
function with_batch_sleep()
@batch for i in 1:2
nothing
end
ThreadingUtilities.sleep_all_tasks()
v = [[]]
for i in 1:1000
v[1] = []
end
end
function batch_without_allocations()
@batch for i in 1:1000
i^3
end
end
function batch_sleep_without_allocations()
@batch for i in 1:1000
i^3
end
ThreadingUtilities.sleep_all_tasks()
end
While Is there any better way by now? |
It seems that this is fixed in 1.10?
I ran the code in the very first post. |
Some of my simulations are regularly stopping for about a second when using
@batch
on macOS ARM.I could reduce this problem to this minimal example, but I am now clueless how to continue.
Benchmarking yields the following:
About one execution out of 2000 takes over one second, which causes the mean to be 30x higher than without any
@batch
loops. This is consistent with what I see in simulations, where most time steps are fast, but then some take over a second.This problem is specific to macOS ARM. The same Julia version on an x86 machine works as expected.
The text was updated successfully, but these errors were encountered: