Deadlock in `switch_to` Due to Premature Task Placement in Run Queue (SMP) #197

hky1999 · 2024-11-01T08:01:37Z

Description

In run_queue.rs, the switch_to function in ArceOS currently implements a spin-loop to wait until the on_cpu flag for the next_task is set to false.

This design is intended to ensure that the scheduling of next_task occurs atomically on the current CPU, preventing simultaneous access to the same task on different CPUs.

fn switch_to(&mut self, prev_task: CurrentTask, next_task: AxTaskRef) {
        // Make sure that IRQs are disabled by kernel guard or other means.
        #[cfg(all(not(test), feature = "irq"))] // Note: irq is faked under unit tests.
        assert!(
            !axhal::arch::irqs_enabled(),
            "IRQs must be disabled during scheduling"
        );
        debug!(
            "context switch: {} -> {}",
            prev_task.id_name(),
            next_task.id_name()
        );
        #[cfg(feature = "preempt")]
        next_task.set_preempt_pending(false);
        next_task.set_state(TaskState::Running);
        if prev_task.ptr_eq(&next_task) {
            return;
        }

        // Task must be scheduled atomically, wait for next task's scheduling process to complete.
        // If the owning (remote) CPU is still in the middle of schedule() with
        // this task (next task) as prev, wait until it's done referencing the task.
        //
        // Pairs with the `clear_prev_task_on_cpu()`.
        //
        // Note:
        // 1. This should be placed after the judgement of `TaskState::Blocked,`,
        //    because the task may have been woken up by other cores.
        // 2. This can be placed in the front of `switch_to()`
        #[cfg(feature = "smp")]
        while next_task.on_cpu() {
            // Wait for the task to finish its scheduling process.
            core::hint::spin_loop();
        }

        // Claim the task as running, we do this before switching to it
        // such that any running task will have this set.
        #[cfg(feature = "smp")]
        next_task.set_on_cpu(true);
        
        // ...
}

However, this design leads to a deadlock scenario under specific conditions in an SMP setup.

If:

Core 0 attempts to yield from task 0 to task 1
Core 1 simultaneously attempts to yield from task 1 to task 0

Core 0 will be stuck waiting for task 1 to fully yield Core 1, and Core 1 will be waiting for task 0 to yield Core 0, resulting in a deadlock.

Root Cause

The deadlock is caused by prematurely placing a task in the target core’s run queue (rq), allowing it to be immediately accessed and potentially popped by switch_to on that core.

This access in switch_to leads to the aforementioned blocking spin-loop, as both cores await each other.

Previous Solution

The polling should be placed at unblock_task, which would only insert the target task into the target CPU’s run queue once the on_cpu flag has been safely unset on the previous CPU.

This will ensure the task is only placed into the run queue when it’s no longer active on another core.

Complication with migrate_current

The reason why I place the polling in switch_to is that we need to support migrate_current, as I claimed in #183 (comment)

Moving the polling to unblock_task introduces complexity for the migrate_current method.

Since migrate_current is intended to place the current task in another CPU’s run queue, it would require waiting on the current task's on_cpu flag to be false—which is inherently impossible for the current task.

The text was updated successfully, but these errors were encountered:

hky1999 · 2024-11-01T08:30:06Z

Additional Notes on Linux Task Migration

The task migration process in Linux is intricate and involves several critical functions to ensure safe and atomic migration between CPUs.
The main workflow is handled in __set_cpus_allowed_ptr, which controls the permitted CPUs for a task.
If the task’s current CPU is no longer allowed, it will be migrated.

The migration relies on migration_cpu_stop, called via stop_one_cpu_nowait.

Here’s a brief outline of the core steps:

Triggering Migration: If a task’s cpus_allowed mask changes to disallow its current CPU, __set_cpus_allowed_ptr initiates migration.
Scheduling the Stop Task: The core migration logic uses stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, &pending->arg, &pending->stop_work), which triggers a "stop task" on the current CPU. This stop task safely interrupts the target CPU and ensures the task is detached from its run queue.
Executing migration_cpu_stop: The migration_cpu_stop function performs the actual migration by moving the task to a permissible CPU. It works within a “stop machine” framework to ensure that the migration occurs atomically and avoids race conditions.

/*
 * This is how migration works:
 *
 * 1) we invoke migration_cpu_stop() on the target CPU using
 *    stop_one_cpu().
 * 2) stopper starts to run (implicitly forcing the migrated thread
 *    off the CPU)
 * 3) it checks whether the migrated task is still in the wrong runqueue.
 * 4) if it's in the wrong runqueue then the migration thread removes
 *    it and puts it into the right queue.
 * 5) stopper completes and stop_one_cpu() returns and the migration
 *    is done.
 */

/*
 * move_queued_task - move a queued task to new rq.
 *
 * Returns (locked) new rq. Old rq's lock is released.
 */
static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
				   struct task_struct *p, int new_cpu)
{
	lockdep_assert_held(&rq->lock);

	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
	set_task_cpu(p, new_cpu);
	rq_unlock(rq, rf);

	rq = cpu_rq(new_cpu);

	rq_lock(rq, rf);
	BUG_ON(task_cpu(p) != new_cpu);
	activate_task(rq, p, 0);
	check_preempt_curr(rq, p, 0);

	return rq;
}

hky1999 · 2024-11-02T08:03:38Z

I don't think we need a complex mechanism like Linux’s stop task to handle task migration between run queues, nor do we need a separate work flow similar to gc_task to accomplish this.

Since we already store a weak pointer to the prev_task in the TaskInner structure, why not let the next_task complete the migration of the prev_task after scheduling finishes? just like current clear_prev_task_on_cpu design,

This approach is similiar to @guoweikang‘s idea, but it’s less radical. @guoweikang suggests that all insert prev_task operations should be handled by the next_task after scheduling completion, I believe it’s sufficient to have the next_task handle only the migration of the prev_task, which would streamline the modification without the complexity of the refactor of all insert operations

To implement this approach, we first need to upgrade the weak pointer to prev_task to an Arc.
This is necessary because, if prev_task needs to be migrated, it won’t immediately be placed in any run queue.
Additionally, it will be removed from CurrentTask in the switch_to function due to CurrentTask::set_current(prev_task, next_task); , leaving no active references to it until next_task completes the migrate operation to another run queue.

also, we may need a Migrating task state to mark one task as waiting for migrated by next task

migrate_current

    pub fn migrate_current(&mut self) {
        let curr = &self.current_task;
        trace!("task migrate: {}", curr.id_name());
        assert!(curr.is_running());

        curr.set_state(TaskState::Migrating);

        self.inner.resched();
    }

finish_prev_task_switch

    #[cfg(feature = "smp")]
    pub(crate) unsafe fn finish_prev_task_switch(&self) {
        let prev_task = (*self.prev_task.get())
            // Takes the `Arc` reference to the previous task, setting it to None in `self`.
            .take()
            .expect("Invalid prev_task pointer");

        // Clears the `on_cpu` field of previous task running on this CPU.
        prev_task.set_on_cpu(false);

        // Migrate the previous task to the run queue of correct CPU if necessary.
        // If `prev_task` doesn't need to be migrated, its `AxTaskRef` will be dropped automatically.
        if prev_task.is_migrating() {
            select_run_queue::<kernel_guard::NoOp>(&prev_task).migrate_task(prev_task);
        }

        // No need to manually set the `prev_task` to `None` since `take()` already did it.
    }

migrate_task (implemented for AxRunQueueRef)

    pub fn migrate_task(&mut self, task: AxTaskRef) {
        let task_id_name = task.id_name();
        let cpu_id = self.inner.cpu_id;
        debug!("task migrate: {} to run_queue {}", task_id_name, cpu_id);
        assert!(self
            .inner
            .put_task_with_state(task, TaskState::Migrating, false))
    }

hky1999 added the bug Something isn't working label Nov 1, 2024

hky1999 assigned equation314, hky1999 and Azure-stars Nov 1, 2024

hky1999 linked a pull request Nov 2, 2024 that will close this issue

[fix] bug related with migrate_current and on_cpu flag polling #198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in `switch_to` Due to Premature Task Placement in Run Queue (SMP) #197

Deadlock in `switch_to` Due to Premature Task Placement in Run Queue (SMP) #197

hky1999 commented Nov 1, 2024 •

edited

Loading

hky1999 commented Nov 1, 2024 •

edited

Loading

hky1999 commented Nov 2, 2024 •

edited

Loading

Deadlock in switch_to Due to Premature Task Placement in Run Queue (SMP) #197

Deadlock in switch_to Due to Premature Task Placement in Run Queue (SMP) #197

Comments

hky1999 commented Nov 1, 2024 • edited Loading

Description

Root Cause

Previous Solution

Complication with migrate_current

hky1999 commented Nov 1, 2024 • edited Loading

Additional Notes on Linux Task Migration

hky1999 commented Nov 2, 2024 • edited Loading

Deadlock in `switch_to` Due to Premature Task Placement in Run Queue (SMP) #197

Deadlock in `switch_to` Due to Premature Task Placement in Run Queue (SMP) #197

hky1999 commented Nov 1, 2024 •

edited

Loading

hky1999 commented Nov 1, 2024 •

edited

Loading

hky1999 commented Nov 2, 2024 •

edited

Loading