-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proof of Concept] Polyester.@batch
for threading
#104
base: main
Are you sure you want to change the base?
Conversation
@@ -158,8 +158,8 @@ function map_pairwise_serial!( | |||
show_progress::Bool=false | |||
) where {F,N,T} | |||
p = show_progress ? Progress(length(cl.ref), dt=1) : nothing | |||
for i in eachindex(cl.ref) | |||
output = inner_loop!(f, output, i, box, cl) | |||
@batch for i in eachindex(cl.ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This insertion of @batch
is in the serial implementation, so I´m unsure if that was what you were willing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the easiest way to try this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a few tests I don't see the difference in performance you report. I might have to dig into your example to see.
One thing that might be important (I didn't look closely to your example) is the density of particles. My tests have the 10^5/(100^3)
density, which is the atomic density of water, in the range of the applications I'm interested in.
The tradeoffs in other density ranges are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the density relevant? Because of the cutoff? I used a cutoff of 3 times the spacing of a grid of particles. For more realistic benchmarks, I randomly perturbed the particle positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course, in relation to the cutoff. A typical molecular dynamics simulation has a density of 10ˆ5/100ˆ3
particles, and use a cutoff of something of around 8 to 15, in the same units.
In some analsys of galaxy distributions (not my speciality), people use densities of 10ˆ5/20.274ˆ3
for a cutoff of 5.0. (much denser systems).
These define how many particles end up in each cell, and then what is worth doing in the inner loops that compute the interactions inside each cell or among neighbor cells. One parameter is the side of the cell relative to the cutoff (lcell
in CellListMap), for very dense systems a cell side smalller than thte cutoff is beneficial. An in CellListMap y also project the particles in the vector connecting cell centers and then sort the projected distances to avoid even further computing spurious distances.
I´ve experimented with Polyerster a few times and, indeed, if the mapped loop is tight it is faster. However, I have the understanding that Polyester does not compose well if the function is called from within another threaded code. I think it will just fill the threads and compete for the resources. That's why I never stick with it. Concerning the benchmark: is that using CellListMap? That was not clear to me. One thing is that the way the inner loop of the |
Yes, |
Now that I think, probably what |
But wouldn't we see a similar speedup on a single thread then?
Polyester.jl:
|
I don´t see the difference in performance you report here, really. For example, I´m running the examples of this page: https://m3g.github.io/CellListMap.jl/stable/LowLevel/#Histogram-of-distances The forces example, which is probably the most similar to what you are computing in your tests, results in: Current stable CellListMap 0.9.4: julia> @btime map_pairwise!(
$((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
$forces,$box,$cl
);
34.576 ms (76 allocations: 13.74 MiB) and with your PR: julia> @btime map_pairwise!(
$((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
$forces,$box,$cl
);
34.173 ms (82 allocations: 13.75 MiB) In other examples there might be a slight improvement in perfomance with Could you test one of these examples in your machine? |
Offtopic: can you please send me the link of your talk at JuliaCon about this when it is available? Thanks. |
https://www.youtube.com/live/V7FWl4YumcA?si=XIKhTj8dfmon0rwk&t=4669 But we only quickly mentioned the neighborhood search. |
Well, I´ll have to investigate what´s going on. The issue is that I don´t see these performance differences at all in the hardware I have, with my test functions. For instance, this is what I get with the polyester version: julia> using CellListMap
julia> x, box = CellListMap.xatomic(10^4);
julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);
julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
18.444 ms (87 allocations: 16.14 KiB)
75140.7942005164
With current version: julia> using CellListMap
julia> x, box = CellListMap.xatomic(10^4);
julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);
julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
18.692 ms (87 allocations: 16.14 KiB)
77202.5687995922 |
This is not using the |
Oh, right... sorry. But in that case I get the opposite result, the current version is faster: with polyester: julia> x, box = CellListMap.xatomic(10^4);
julia> y, _ = CellListMap.xatomic(10^4);
julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);
julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
46.463 ms (43 allocations: 7.25 KiB)
julia> Threads.nthreads()
6 with current version: julia> x, box = CellListMap.xatomic(10^4);
julia> y, _ = CellListMap.xatomic(10^4);
julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);
julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
23.328 ms (93 allocations: 16.33 KiB) So I'll really will have to dig further into your example to understand what's going on. |
Your benchmark on a Threadripper 3990X: julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
8.930 ms (286 allocations: 54.41 KiB)
julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
6.020 ms (53 allocations: 8.59 KiB)
julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
3.438 ms (698 allocations: 132.58 KiB) |
Well, definitelly not what I see here. This is another machine (a fairly standard one): With julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys);
63.056 ms (1 allocation: 448 bytes) with current stable: julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys);
21.881 ms (117 allocations: 19.95 KiB) With: julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD EPYC 7282 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 8 default, 0 interactive, 4 GC (on 16 virtual cores) (with only |
I skipped the reduce in all versions and only returned 0. |
Have you considered using Polyester.jl for threading instead of
Threads
?I always use the code in this PR when I want to do fair benchmarks of my implementations against CellListMap.jl. I get consistently better performance with this PR over
main
.Here is a very cheap benchmark that is running one time step of an n-body simulation. This is the code:
https://github.com/trixi-framework/PointNeighbors.jl/blob/main/benchmarks/n_body.jl#L24-L33
This is on an AMD Ryzen Threadripper 3990X using 128 threads:
This is a minimum speedup of 1.6x and a speedup of 4.5x for the largest problem size.
Here is a real-life benchmark, running one particle interaction loop of the Weakly Compressible SPH method:
This is a speedup of 2x for most problem sizes.