Measuring event loop lag #2865

jeromegn · 2020-09-23T10:42:14Z

jeromegn
Sep 23, 2020

I was wondering if there's a way to measure the event loop lag. I've been discussing it in the tokio Discord, but I figured this might be useful here.

Our current implementation for somewhat measuring lag on the event loop is this:

async fn record_event_loop_lag() {
    let clock = quanta::Clock::default();
    let mut interval = tokio::time::interval(Duration::from_secs(5));

    while let Some(_) = interval.next().await {
        let start = clock.start();
        tokio::task::yield_now().await;
        gauge!("event_loop_lag_ns", (clock.end() - start) as f64);
    }
}

We had considered (and it's also been suggested) running this every 1-2 seconds instead of every 5 and recording the lag as a histogram. We're storing this in a prometheus cluster.

It's also been suggested that this is inaccurate because of the work-stealing nature of the multi-threaded event loop. That is to say: one thread might have lag in processing tasks while others might be fast.

Even with these inaccuracies, I've been able to sometimes measure a lag of up to 190ms on our event loop. That seems enormous. For context: we're running a reverse-proxy, based on hyper. We handshake TLS a few times per second and we cache some stuff on disk (we use tokio's fs async API for those). We also reload a configuration (at most once per second), which reads a file and parses 5MB of json with serde.

We're interested in measuring the lag precisely because we're seeing some weird behaviour from time to time (hyper bodies returning very slowly). That's a story for another discussion.

It would also be interesting to measure the number of tasks currently running / pending / etc. on the event loop, to fully troubleshoot this slowness.

carllerche · 2020-09-23T15:30:08Z

carllerche
Sep 23, 2020
Maintainer

What I have done in the past is implement a future that wraps the spawned task and tracks the data. You can use that to instrument waking and track the lag between a wake notification and poll. If needed, I can sketch it up.

2 replies

hawkw Sep 23, 2020
Maintainer

This could be implemented by consuming the tokio::task tracing spans that instrument all spawned tasks when the "tracing" feature flag is enabled. I have a proof of concept implementation of a tracing-subscriber Layer that implements some task-tracking functionality, including recording time to first poll (please note that this is a POC and probably shouldn't be used in prod yet).

I imagine you could either use this layer directly (it provides a handle that exposes this data), or use my code as a starting point for your own implementation.

jeromegn Sep 23, 2020
Author

I can probably base our thing on that, yes. Trying to think of useful data points to measure. "Time to first poll" could be interesting and indicative of "lag" on the event loop.

Maybe this will complement the traces I send to honeycomb pretty well. I'll at least enable the tracing feature on tokio.

jeromegn · 2020-09-28T21:52:11Z

jeromegn
Sep 28, 2020
Author

With help from @Darksonn, I started measuring slow poll times in our app with a wrapper like this:

use std::future::Future;
use std::pin::Pin;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::task::{Context, Poll};
use std::time::Instant;

const THRESHOLD: u128 = 5000 * 1000;

pub struct Measured<F> {
    fut: F,
    label: &'static str,
    threshold: u128,
    count: AtomicUsize,
}

impl<F> Measured<F> {
    pub fn wrap(fut: F, label: &'static str) -> Self {
        Self {
            fut,
            label,
            threshold: THRESHOLD,
            count: AtomicUsize::new(0),
        }
    }
    pub fn into_inner(self) -> F {
        self.fut
    }
    pub fn wrap_t(fut: F, label: &'static str, threshold: u128) -> Self {
        Self {
            fut,
            label,
            threshold,
            count: AtomicUsize::new(0),
        }
    }
}

impl<F: Future> Future for Measured<F> {
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, ctx: &mut Context<'_>) -> Poll<F::Output> {
        let label = self.label.clone();
        let t = self.threshold;
        let count = self.count.fetch_add(1, Ordering::SeqCst);
        let fut = unsafe { Pin::map_unchecked_mut(self, |me| &mut me.fut) };
        let start = Instant::now();
        let res = fut.poll(ctx);
        let measured = start.elapsed();

        if measured.as_nanos() > t {
            warn!({nanos = measured.as_nanos() as u64, label, count = count},
                "future took a long time!"
            );
        }

        res
    }
}

This helped us figure out the slowest ones. It helped to put a few of them (like TLS handshake, and some fs IO) in tokio::task::spawn_blocking calls.

However, I'm still getting many "slow polls" that take over 5-10ms to complete. Even locally in an environment where there is virtually no load. I compiled our app in release mode and I've been able to trigger some slow polls.

The biggest slowness I've identified if when accepting connections. If I run a load testing tool with keepalives (and connection reuses), then the slow polls only log when the initial connections are created. If I disable keepalives, they log throughout the load test. Of course, it's mostly the polls on futures happening while we accept a connection, TLS handshake the connection, etc. that are slow. I don't understand why though.

For example, here's a rough structure (with too much useless details) of what's happening during TLS handshake, it appears to consistently show up as slow poll logs:

- handle accept
    - bunch of logic to find the app from the destionation IP addr
    - handle tls
        - tokio::time::timeout (30s)
            - parse_tls_record (very rarely triggers slow poll log)
                - tokio's TcpStream peek
                - parse bytes (synchronous)
            - various checks on the parsed TLS client hello (SNI, ciphers, etc.)
            - get the proper tls acceptor (measured even if synchornous, does not trigger slow log)
                - parking_lot::RwLock::read
                - hashmap lookup
                    - not there? parking_lot::RwLock::write
                    - create a simple struct and store that
            - ensure certificate in tls acceptor (rarely triggers slow poll log)
                - check if we have certificate in cache
                    - parking_lot::RwLock::read
                    - no? check if we have a wildcard that matches
                        - parking_lot::RwLock::read
                        - no? get certificate from disk cache
                            - tokio::task::spawn_blocking
                                - file_lock::FileLock::lock
                                - no? fetch from encrypted store (Vault)
                                    - hyper request
                                        - store on fs (acquiring write lock)
            - spawn_blocking (rarely triggers slow poll log)
                - tokio-rustls tls handshake

This is fairly simple and all "measured" for slow polls. I don't think any of this should be slow. The top-level future triggers often (with 10+ms polls), but none of its children trigger much, this is making me unsure where to look next. Putting the tls handshake in a spawn_blocking really helped, but we're still seeing 80+ms event loop "lag" (99th percentile) from our measurement.

This is the logs I get when sending ~1200 requests in 10s (with a concurrency of 60). Each request creates a new connection (keepalives disabled) which our app needs to the accept.

Sep 28 17:29:23.561  nanos=21132236 label="hyper http service handler" count=0
Sep 28 17:29:23.561  nanos=21321404 label="hyper http server conn" count=1
Sep 28 17:29:23.561  nanos=21343341 label="handle http" count=1
Sep 28 17:29:23.561  nanos=21359963 label="handle tcp accept" count=4
Sep 28 17:29:23.564  nanos=1892982 label="parse tls record 1" count=0
Sep 28 17:29:23.573  nanos=40927511 label="handle frontend tcp incoming" count=4
Sep 28 17:29:23.575  nanos=1633366 label="ensure cert" count=2
Sep 28 17:29:23.583  nanos=1104503 label="ensure cert" count=2
Sep 28 17:29:23.588  nanos=3421896 label="ensure cert" count=2
Sep 28 17:29:23.599  nanos=7396502 label="ensure cert" count=2
Sep 28 17:29:23.599  nanos=7512195 label="full tls handshake" count=3
Sep 28 17:29:23.599  nanos=7545714 label="handle tls" count=3
Sep 28 17:29:23.599  nanos=7574951 label="handle tcp accept" count=4
Sep 28 17:29:23.608  nanos=3584200 label="ensure cert" count=2
Sep 28 17:29:23.609  nanos=1886194 label="ensure cert" count=2
Sep 28 17:29:23.613  nanos=4672994 label="ensure cert" count=2
Sep 28 17:29:23.618  nanos=2624324 label="ensure cert" count=1
Sep 28 17:29:23.623  nanos=1244534 label="ensure cert" count=2
Sep 28 17:29:23.623  nanos=2821252 label="ensure cert" count=2
Sep 28 17:29:24.143  nanos=6350489 label="handle frontend tcp incoming" count=18
Sep 28 17:29:25.219  nanos=7344325 label="full tls handshake" count=2
Sep 28 17:29:25.219  nanos=7407840 label="handle tls" count=2
Sep 28 17:29:25.220  nanos=7473481 label="handle tcp accept" count=3
Sep 28 17:29:25.768  nanos=15471816 label="hyper http service handler" count=0
Sep 28 17:29:25.768  nanos=15633293 label="hyper http server conn" count=0
Sep 28 17:29:25.768  nanos=15671527 label="handle http" count=0
Sep 28 17:29:25.768  nanos=15765316 label="handle tcp accept" count=4
Sep 28 17:29:26.321  nanos=6311991 label="full tls handshake" count=2
Sep 28 17:29:26.321  nanos=6372654 label="handle tls" count=2
Sep 28 17:29:26.321  nanos=6394354 label="handle tcp accept" count=3
Sep 28 17:29:26.346  nanos=7695592 label="full tls handshake" count=2
Sep 28 17:29:26.346  nanos=7812007 label="handle tls" count=2
Sep 28 17:29:26.346  nanos=7925489 label="handle tcp accept" count=3
Sep 28 17:29:30.854  nanos=7195705 label="full tls handshake" count=2
Sep 28 17:29:30.854  nanos=7269164 label="handle tls" count=2
Sep 28 17:29:30.854  nanos=7295080 label="handle tcp accept" count=3

(the many ensure_cert are because the first few connections were all waiting on the locks for the cache)

Any ideas what I should be measuring next? What to look for? Would using the tracing layer suggested by @hawkw help us find out more?

0 replies

jeromegn · 2020-09-29T13:59:24Z

jeromegn
Sep 29, 2020
Author

I tried minimally reproducing this and I was able to with the following code:

// main.rs
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Request, Response, Server};
use std::{convert::Infallible, net::SocketAddr};

mod future;

async fn handle(_: Request<Body>) -> Result<Response<Body>, Infallible> {
    future::Measured::wrap_t(
        async move { Ok(Response::new("Hello, World!".into())) },
        "handle",
        1000 * 1000,
    )
    .await
}

#[tokio::main(core_threads = 2)]
async fn main() {
    let addr = SocketAddr::from(([127, 0, 0, 1], 3214));

    let make_svc = make_service_fn(|_conn| async { Ok::<_, Infallible>(service_fn(handle)) });

    let server = Server::bind(&addr).serve(make_svc);

    if let Err(e) = future::Measured::wrap(server, "server").await {
        eprintln!("server error: {}", e);
    }
}

// future.rs
use std::future::Future;
use std::pin::Pin;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::task::{Context, Poll};
use std::time::Instant;

const THRESHOLD: u128 = 5000 * 1000;

pub struct Measured<F> {
    fut: F,
    label: &'static str,
    threshold: u128,
    count: AtomicUsize,
}

impl<F> Measured<F> {
    pub fn wrap(fut: F, label: &'static str) -> Self {
        Self {
            fut,
            label,
            threshold: THRESHOLD,
            count: AtomicUsize::new(0),
        }
    }
    pub fn into_inner(self) -> F {
        self.fut
    }
    pub fn wrap_t(fut: F, label: &'static str, threshold: u128) -> Self {
        Self {
            fut,
            label,
            threshold,
            count: AtomicUsize::new(0),
        }
    }
}

impl<F: Future> Future for Measured<F> {
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, ctx: &mut Context<'_>) -> Poll<F::Output> {
        let label = self.label.clone();
        let t = self.threshold;
        let count = self.count.fetch_add(1, Ordering::SeqCst);
        let fut = unsafe { Pin::map_unchecked_mut(self, |me| &mut me.fut) };
        let start = Instant::now();
        let res = fut.poll(ctx);
        let measured = start.elapsed();

        if measured.as_nanos() > t {
            eprintln!(
                "future took a long time! nanos = {}, label = {}",
                measured.as_nanos(),
                label
            );
        }

        res
    }
}

I then blasted the hyper server with this:

$ wrk -t2 -c512 http://localhost:3214 --latency

I started seeing the following logs:

future took a long time! nanos = 81163333, label = server
future took a long time! nanos = 22612998, label = server

I was mostly trying to get an idea of a baseline for polling latency.

1 reply

jeromegn Sep 29, 2020
Author

Setting core_threads = 1 on that example code does not appear to trigger the slow poll logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring event loop lag #2865

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Measuring event loop lag #2865

jeromegn Sep 23, 2020

Replies: 3 comments · 3 replies

carllerche Sep 23, 2020 Maintainer

hawkw Sep 23, 2020 Maintainer

jeromegn Sep 23, 2020 Author

jeromegn Sep 28, 2020 Author

jeromegn Sep 29, 2020 Author

jeromegn Sep 29, 2020 Author

jeromegn
Sep 23, 2020

Replies: 3 comments 3 replies

carllerche
Sep 23, 2020
Maintainer

hawkw Sep 23, 2020
Maintainer

jeromegn Sep 23, 2020
Author

jeromegn
Sep 28, 2020
Author

jeromegn
Sep 29, 2020
Author

jeromegn Sep 29, 2020
Author