Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove thread pool, switch to tokio #10

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Remove thread pool, switch to tokio #10

wants to merge 8 commits into from

Conversation

seanlinsley
Copy link
Member

@seanlinsley seanlinsley commented Oct 17, 2022

Internally we use both tokio and async-std because of differing crate dependencies, and then additionally this crate runs its own thread pool. We've seen that removing async-std resolves an unbounded memory growth issue (related to TLS caching), but that introduced a significant performance regression. Hopefully moving sidekiq_server to tokio as well will allow tokio's cooperative task scheduling to work as intended. If not, we may still have to use unconstrained 😅

This PR removes the middleware layer as it significantly increased the number of compiler errors. It could certainly be reintroduced in the future, though we aren't using it internally at the moment.

In order to work around the many lifetime and thread safety errors that come from passing around async closures, I settled on a trait API that users of this crate can implement for every job they need to process. This happens to be very similar to the original Sidekiq library in Ruby, as each job is a class. In the future the job handlers could be given a &mut self so they can manage temporary state in their owned struct.

When reviewing this, it's important to consider if the async tasks are being handled correctly for an environment where many jobs are running in parallel. Especially run_queue_once, which performs a blocking call to Redis.

error!("report alive failed: '{}'", e);
}
select! {
recv(signal) -> signal => {
biased;
Copy link
Member Author

@seanlinsley seanlinsley Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select! loop sets biased to preserve the original behavior where each entry is attempted in order. https://docs.rs/tokio/latest/tokio/macro.select.html#fairness


use rand::{Rng, distributions};

use threadpool::ThreadPool;
use async_channel::{bounded, Receiver, Sender};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tokio provides a mpsc channel, but the code previously relied on using a mpmc channel (to allow cloning) so I added this crate. https://docs.rs/async-channel/latest/async_channel/

Err(ErrorKind::JobHandlerError(Box::new("a".parse::<i8>().unwrap_err())).into())
#[async_trait]
pub trait JobHandler: Send + Sync {
async fn perform(&self, job: &Job) -> JobHandlerResult;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msepga has a different approach using a BoxFuture. That fixed the lifetime issues I had as of the first commit on this branch, but when integrating it into our own code, the futures generated by other crates caused compile errors because they aren't Sync.

Copy link
Contributor

@msepga msepga Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this should be equivalent to Box<dyn Sync + ...> given that JobHandler requires Sync here, though we can merge as-is regardless 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by JobFuture -- there isn't a type like that around.

My understanding is that the BoxFuture version and my attempt in the first commit was running into issues because it required that the returned futures be Sync. This implementation only requires that the JobHandler is Sync.

Copy link
Contributor

@msepga msepga Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be JobHandler in my comment 🙂 So, BoxFuture doesn't require the future to be Sync, the linked implementation only required the handler closure to be Sync:

With the definition of BoxFuture:

pub type BoxFuture<'a, T> = Pin<Box<dyn Future<Output = T> + Send + 'a, Global>>;

Looking at the type now:

Box<dyn Send + Sync + FnMut(Job) -> BoxFuture<'static, Result<JobSuccessType>>>
    | The handler itself must: |    | The BoxFuture must:                    |
    | - impl Send              |    |  - Be Send                             |
    | - impl Sync              |    |  - Be 'static                          |
    | - Take a `Job` argument  |    |  - Return Result<JobSuccessType>       |
    | - Return `BoxFuture<..>`----> |                                        |
    `--------------------------'    |  The BoxFuture *does not* have to be   |
                                    |  Sync                                  |
                                    |                                        |
                                    `----------------------------------------'

Here's a playground link that illustrates how the handler can be sync without the returned future being sync. You'll notice that uncommenting line 25 will fail to compile, because the returned future doesn't implement Sync, even if the handler is required to do so.

We don't have to let this block the merge BTW, I just figured I'd leave a footnote in case it could help clear up how Box<dyn ... + Fn> is equivalent to the JobHandler trait here.

src/worker.rs Outdated
let mut result: Result<Option<(String, String)>> = Ok(None);
task::block_in_place(|| {
Handle::current().block_on(async {
result = self.redis.brpop(queues, 10).await.map_err(From::from);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is the correct way to run blocking operations such that it doesn't block other async tasks from running. Note, we intentionally want to preserve this blocking brpop because it allows workers to start working on new jobs as soon as they're available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the server, and each worker individually has its own Redis connection. If they shared the same connection this wouldn't work properly.

Copy link
Member

@lfittl lfittl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @msepga on a detailed review, but conceptually this makes sense to me.

It became overly complicated since tokio cancels the futures for branches
that aren't reached, and for the blocking Redis request that could mean
jobs are lost in-transit over the network.

Instead, simplify the loop to rely on the blocking operation to prevent
the loop from executing too quickly. Also, reduce the blocking duration
to 1 second so workers are able to respond to shutdown requests faster.
loop {
select! {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to use tokio's select! here since it cancels the future of any branch that wasn't executed. For the Redis brpop that could mean the future is canceled after Redis has sent us data over the network but before it's been received (meaning that job would be lost). I did have a select! implementation in this PR originally, but I removed it in favor of a simple loop because it's much more understandable this way.

Copy link
Contributor

@msepga msepga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Left a nit regarding br_pop

src/worker.rs Outdated Show resolved Hide resolved
Err(ErrorKind::JobHandlerError(Box::new("a".parse::<i8>().unwrap_err())).into())
#[async_trait]
pub trait JobHandler: Send + Sync {
async fn perform(&self, job: &Job) -> JobHandlerResult;
Copy link
Contributor

@msepga msepga Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this should be equivalent to Box<dyn Sync + ...> given that JobHandler requires Sync here, though we can merge as-is regardless 👍

@lfittl
Copy link
Member

lfittl commented Dec 8, 2022

@seanlinsley I happened to run across this by coincidence - should we merge this into main, given we're using this branch in app already?

@seanlinsley
Copy link
Member Author

I intentionally didn't merge this because this version is several times slower for our workload than the thread pool, and this PR doesn't maintain backwards compatibility. There are some people who have starred / are watching this repo so it's possible other people are relying on this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants