Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busy loop in cpu_backend/task_queue.cpp keeps 1 thread at 100% CPU when queue is empty #80

Closed
sayap opened this issue Sep 9, 2024 · 5 comments · Fixed by #83
Closed

Comments

@sayap
Copy link
Contributor

sayap commented Sep 9, 2024

Hi, I notice that the busy loop in cpu_backend/task_queue.cpp will keep 1 thread at 100% CPU when the queue is empty:

    while (true) {
        mutex.lock();
        if (tasks.empty()) {
            if (exit_flag.load(std::memory_order_seq_cst)) {
                return;
            }
            mutex.unlock();
            continue;
        }
        ...
    }

On a laptop, this busy thread will keep the fans on all the time.

If I change the code to use a std::condition_variable, then all the threads will be at near 0% CPU when the queue is empty.

I suppose std::condition_variable also works on Windows, but I don't have a machine ready to test that. If it doesn't work on Windows, then we can add a short sleep before continue as a workaround. From my testing, a short sleep of 0.1ms can keep the threads at near 0% CPU without impacting the queue processing speed. However, this feels much less elegant compared to using a conditional variable 😅

@chenht2022
Copy link
Contributor

You are right, this problem should be fixed.

I tried using std::condition_variable, but the delay in waking up the thread is not negligible, especially when individual tasks are relatively small, such as calculating a MoE layer for a single token, std::condition_variable can lead to significant performance degradation.

A better practice, as you mentioned, is to use short sleep to yield the core when idle waiting is detected. The modified code is as follows. It will be integrated into the repository.

void TaskQueue::processTasks() {
    auto start = std::chrono::steady_clock::now();
    while (true) {
        mutex.lock();
        if (tasks.empty()) {
            if (exit_flag.load(std::memory_order_seq_cst)) {
                return;
            }
            mutex.unlock();
            auto now = std::chrono::steady_clock::now();
            auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
            if (duration > 50) {
                std::this_thread::sleep_for(std::chrono::milliseconds(1));
            }
            continue;
        }
        ...
        start = std::chrono::steady_clock::now();
    }
}

@sayap
Copy link
Contributor Author

sayap commented Sep 10, 2024

Actually I have been running a build with cond var for a few days (see sayap@9c81098), and there is no performance degradation during prefill/decode. This is on kernel 6.10.

@chenht2022
Copy link
Contributor

chenht2022 commented Sep 11, 2024

I tried your code, and indeed there is no performance degradation; there might have been a mistake in my previous testing method.

It would be great if you could submit a Pull Request to our project.

By the way, in cpu_backend/backend.cpp, there is a group of threads (Backend::worker_thread) that also uses a "busy loop + short sleep" synchronization method, which also creates unnecessary background overhead for the CPU. I'm not sure if changing them all to "condition variable" will cause performance degradation because there are at least dozens of threads on the server environment. Maybe a "busy loop + condition variable" method can achieve both high performance and resource savings.I will study it later.

@sayap
Copy link
Contributor Author

sayap commented Sep 11, 2024

Cool, I created pull request #83. Haven't tested it on Windows though.

@sayap
Copy link
Contributor Author

sayap commented Sep 11, 2024

By the way, in cpu_backend/backend.cpp, there is a group of threads (Backend::worker_thread) that also uses a "busy loop + short sleep" synchronization method, which also creates unnecessary background overhead for the CPU.

Ah I see. When I searched for what was causing the 100% CPU, I did see a lot of 1ms sleeps in strace output. This is probably the reason. I don't think these 1ms sleeps occupy too much CPU time, but yeah it is probably better to replace with cond var as well 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants