Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock if fewer threads (<args.nthrds) started #28

Open
mjaggi-cavium opened this issue Jun 7, 2018 · 12 comments
Open

Deadlock if fewer threads (<args.nthrds) started #28

mjaggi-cavium opened this issue Jun 7, 2018 · 12 comments

Comments

@mjaggi-cavium
Copy link

This is similar to earlier issue I posted sometime back.

After a run of about 20 minutes a deadlock is observed when not all of the 'n' threads (args.nthrds) could be started by main(). All cores on which threads are started are at 100%. The first child thread is waiting for ready_lock, while others are waiting for sync_lock.

This behaviour is observed when number of cores (threaded per core threads 4) is 200+.

Not sure why all nthrds not starting, could be RT throttling issue.
Comments suggestions...?

@geoffreyblake
Copy link
Contributor

Hi, does the issue still happen with the "-s" flag which disables the SCHED_FIFO setting?

@zoybai
Copy link
Contributor

zoybai commented Jun 8, 2018

Hi Manish, which workload did you run for this deadlock case? Thanks

@mjaggi-cavium
Copy link
Author

I am running ./runall.sh.
The issue is not seen when run a single instance of workload.

@mjaggi-cavium
Copy link
Author

Hi, does the issue still happen with the "-s" flag which disables the SCHED_FIFO setting?
No. Test completes without hanging

@lucasclucasdo
Copy link
Contributor

When running with -s what sort of effective parallelism do you see? It should be close to the number of requested cores. If it's significantly lower then the improvement may be due to -s mode not being able to recreate the high contention case and not directly a problem with FIFO mode itself.

@mjaggi-cavium
Copy link
Author

It should be close to the number of requested cores
Yes.

I havent seen with -s, anytime number of thread created < nthrds.
So mainthread is not starved.

@lucasclucasdo
Copy link
Contributor

lucasclucasdo commented Jun 11, 2018

The number of threads created will be the same but "effective parallelism" (output by the tool) tells you how many of the actual threads are running at the same time. So you could have 200 cores and 200 threads but if each one runs to completion on one core before the next core starts you can theoretically have effective parallelism of only 1 thread even though thread creation equals requested threads.

@mjaggi-cavium
Copy link
Author

AFAIK,

  • main-thread and child thread 0 always runs on hw thread 0.
  • all child threads run on hw-thread0 and are then sets appropriate affinity and later get scheduled on s
    specific affined cores
  • if all child threads run on hw-thread0 first and with SCHED_FIFO, and child thread 0 always runs on thread 0, would there not be any point that mainthread is starved?

@lucasclucasdo
Copy link
Contributor

lucasclucasdo commented Jun 11, 2018

It's more likely that the child threads get starved but the scheduler should be waking up cores to steal and run the child threads since there will be balance problems otherwise (one core with two runnable FIFO processes and one core with nothing). One thing I've been thinking about trying is spawning a bunch of threads to make the balance issue look worse and cause the scheduler to step in sooner and then affine threads to whichever unloaded core they end up on first (or exit if the core they end up already has a waiting lockhammer process).

Anyway, that's not relevant to the question I'm asking which is "does safemode successfully achieve the requested contention level." I'm guessing not since FIFO mode was added in to avoid this exact problem in the first place which is why I'm asking. In other words safemode might "solve" the issue you're seeing but it probably does it by making the test a useless measure of performance in the high core count contention case (because it likely fails to achieve it). How does the "effective parallelism" metric compare to requested thread counts for high thread counts where you were previously seeing the scheduling issue?

Edit: slight change, main thread should be free to run anywhere, not just hw thread 0 (if that's not case it's a bug).

@lucasclucasdo
Copy link
Contributor

I created a test branch which sched_yields the thread on core 0 if all child threads are not ready yet. Unfortunately I cannot replicate this issue on systems to which I have access so please try this branch and see if it helps:

https://github.com/codeauroraforum/synchronization-benchmarks/tree/lh-yieldwait

@mjaggi-cavium
Copy link
Author

Tried this, and replaced below as well
/* Spin until the "marshal" sets the appropriate bit */
wait64_yield(&sync_lock, (nthrds * 2) | 1);

I think i missed one point, affinity of main thread is all cores, so wherever it is rescheduled and there is a contention not all threads will start. So I believe we need to put sched_yield in all atomic functions.

@lucasclucasdo
Copy link
Contributor

If we yield the other threads then we need to add in another sync step without a yield to make sure everyone is actually both started and running. Eg, current scheme is:

  1. Startup threads
  2. Wait for all threads to startup
  3. Threads are FIFO and unyielding so if they've reported started then they must be running still
  4. Send a start signal since we know threads are all started up (because they told us) and currently running (because they must be by definition)

If we yield the startup threads it should be:

  1. Startup threads
  2. Wait with yielding for all threads to startup
  3. Thread have all started up but may be currently not running due to yielding while startup was ongoing
  4. Wait without yielding for all started up threads to get rescheduled and report back in
  5. Send a start signal since we've confirmed all threads are started up (because they told us) and currently running (because they also told us)

That said I still think this is more of a scheduler balance problem where at high core counts a single core with an extra runnable but not running process (ie, the main thread) doesn't look like too bad of a balance problem so sleeping hardware threads are not woken up to execute the main software thread for a long time in the hopes that one of the many low utilization hardware threads already running can take care of it in a short amount of time (but of course they can't because they're all running FIFO threads that are busy spinning).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants