`@batch` uses only the first 64 threads #83

Keluaa · 2022-06-29T13:06:18Z

I am currently working with a 128 cores (64 cores x 2 sockets) CPU, and I noticed that Polyester.jl will only use the first 64 threads available :

julia> using Polyester

julia> Threads.nthreads()
128

julia> tids = zeros(Int, Threads.nthreads());

julia> @batch for _ in 1:Threads.nthreads()
           tids[Threads.threadid()] += 1
       end

julia> all(tids .== 1)
false

julia> [tids[1:64]' ; tids[65:end]']
2x64 Matrix{Int64}
2  2  2  2  2  2  2  2 ... 2  2  2  2  2  2  2  2
0  0  0  0  0  0  0  0 ... 0  0  0  0  0  0  0  0

julia> tids .= 0;

julia> Threads.@threads for _ in Threads.nthreads()
           tids[Threads.threadid()] += 1
       end

julia> all(tids .== 1)
true

julia> [tids[1:64]' ; tids[65:end]']
2x64 Matrix{Int64}
1  1  1  1  1  1  1  1 ... 1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1 ... 1  1  1  1  1  1  1  1

Notice how the threads 65 to 128 were ignored by @batch but not by Threads.@threads.
I get similar results with hyperthreading with 256 threads.

I am quite sure that this also affects LoopVectorization.jl since I am getting the same time for some very simple benchmarks :

julia> a = rand(Float64, 10000); b = rand(Float64, 10000); c = rand(Float64, 10000);

julia> @btime @tturbo for i in 1:10000
           c[i] = a[i] * b[i]
       end

Gives ~3.0 µs for 64 or 128 threads.

I suppose that it is related to the behaviors mentioned in #22. In any case I would be happy to help resolving this issue.

The text was updated successfully, but these errors were encountered:

chriselrod · 2022-06-29T16:00:05Z

Hmm.
What do you get for

using Polyester
Polyester.num_threads()
length(Polyester.ThreadingUtilities.TASKS)

?

chriselrod · 2022-06-29T16:03:38Z

The two most likely places this can be going wrong:

Multiple sockets not recognized, so if you have 2x 64 cores, it thinks you only have 1x 64 cores. That's why I wanted to see Polyester.num_threads() above.
PolyesterWeave uses 64-bit masks to indicate which ThreadingUtilities.TASKS are available for use. If you have >64 TASKS, it needs to use a tuple of such masks. Maybe it's only using the first element of this tuple.

chriselrod · 2022-06-29T16:04:23Z

Sorry about repeatedly closing this issue. Apparently C-ret to comment now chooses "Close with comment" instead of "Comment"?

Keluaa · 2022-06-30T06:56:30Z

I get :

julia> using Polyester

julia> Polyester.num_threads()
static(128)

julia> length(Polyester.ThreadingUtilities.TASKS)
127

So it seems that it correctly detects the number of threads.
I also tried it with 64 threads and I got static(64) and 63. With hyperthreading enabled I get static(256) and 255.
Everything seems to be okay here.

Keluaa · 2022-08-02T09:55:52Z

I finally found some time to try to debug this issue.

The source of the problem seems to be the function free_threads! of PolyesterWeave, as it operates only on the first element of the worker bit masks tuple. Therefore at the first call of @batch, all 128 threads are running, but only the first 64 are freed, and so the next calls will only use the first 64 threads available.

I also spotted some weird behaviors caused by this line: since Nr is unsigned, if it is less than nthread (which is always is, if I understood the code correctly) it will overflow, making the next threads iterate over a bigger range than they should because of this line.

Should I create pull requests for my fixes?

chriselrod · 2022-08-02T12:20:32Z

The source of the problem seems to be the function free_threads! of PolyesterWeave, as it operates only on the first element of the worker bit masks tuple.

It is called in a loop here:

Polyester.jl/src/batch.jl

Lines 159 to 172 in 2926208

    
           for (threadmask, nthread, torelease) ∈ 
        
               zip(threadmask_tuple, nthread_tuple, torelease_tuple) 
        
             tm = mask(UnsignedIteratorEarlyStop(threadmask, nthread)) 
        
             tid = 0x00000000 
        
             while tm ≠ zero(tm) 
        
               # assume(tm ≠ zero(tm))  
        
               tz = trailing_zeros(tm) % UInt32 
        
               tz += 0x00000001 
        
               tm >>>= tz 
        
               tid += tz 
        
               # @show tid, ThreadingUtilities._atomic_state(tid) 
        
               ThreadingUtilities.wait(tid) 
        
             end 
        
             free_threads!(torelease)

I also spotted some weird behaviors caused by this line: since Nr is unsigned,

Yeah, that definitely looks like a bug/like I wrote it assuming Nr was signed. I guess just cast it into a signed integeger, Nr%Int.
PRs welcome!

Keluaa · 2022-08-02T13:56:45Z

I just tried to run the original code where I first encountered this issue, and my fixes didn't work. Turns out that somehow I was using the per=thread option while debugging (at least that's what I think happened), and without it the number of threads is limited by the number of cores, but Polyester.num_cores() == 64 instead of 128.

The node has a CPU (AMD EPYC 7H12) with 2 sockets, each with 64 cores on them, but because of hyperthreading there is 2 threads per core. Therefore Polyester.num_threads() should return 256 and Polyester.num_cores() should return 128, meaning that the second socket is not recognized, in contrary to what I said in my first reply.

I am on a x86_64 architecture, and by looking at CPUSummary.jl, it seems that the number of sockets is hardcoded, and since CpuId.cpucores() only returns the number of cores for the current socket (as per the docs), we get 64 cores...

It is called in a loop here:

But the problem is in the function itself:

https://github.com/JuliaSIMD/PolyesterWeave.jl/blob/3eb4d66fffe82c2097e142ff8dc3ad008e504ea3/src/request.jl#L20-L23

It only considers the first element of the bit mask tuple since worker_pointer() always returns the same value. Therefore the loop applies the masks torelease to the same worker mask repeatedly.

chriselrod · 2022-08-02T14:12:05Z

At one point in time, it was important to support WINE, but I believe this is no longer the case.
using Hwloc causes Julia to crash on WINE, which meant that CPUSummary had to switch from Hwloc.jl (which can correctly get the number of sockets) to CpuId.jl, for which I do not know how to do that.

We can switch back. Feel free to make a PR reverting main to version 0.1.8 or so.

chriselrod · 2022-08-02T14:13:32Z

since worker_pointer() always returns the same value. Therefore the loop applies the masks torelease to the same worker mask repeatedly.

Ah, you're right. That's definitely a bug.

chriselrod closed this as completed Jun 29, 2022

chriselrod reopened this Jun 29, 2022

chriselrod closed this as completed Jun 29, 2022

chriselrod reopened this Jun 29, 2022

chriselrod mentioned this issue Jul 22, 2022

Support more than 64 threads JuliaLinearAlgebra/Octavian.jl#153

Open

Keluaa mentioned this issue Aug 3, 2022

Fix 64+ threads support #87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`@batch` uses only the first 64 threads #83

`@batch` uses only the first 64 threads #83

Keluaa commented Jun 29, 2022

chriselrod commented Jun 29, 2022

chriselrod commented Jun 29, 2022

chriselrod commented Jun 29, 2022 •

edited

Loading

Keluaa commented Jun 30, 2022

Keluaa commented Aug 2, 2022

chriselrod commented Aug 2, 2022

Keluaa commented Aug 2, 2022

chriselrod commented Aug 2, 2022

chriselrod commented Aug 2, 2022

@batch uses only the first 64 threads #83

@batch uses only the first 64 threads #83

Comments

Keluaa commented Jun 29, 2022

chriselrod commented Jun 29, 2022

chriselrod commented Jun 29, 2022

chriselrod commented Jun 29, 2022 • edited Loading

Keluaa commented Jun 30, 2022

Keluaa commented Aug 2, 2022

chriselrod commented Aug 2, 2022

Keluaa commented Aug 2, 2022

chriselrod commented Aug 2, 2022

chriselrod commented Aug 2, 2022

`@batch` uses only the first 64 threads #83

`@batch` uses only the first 64 threads #83

chriselrod commented Jun 29, 2022 •

edited

Loading