Excessive resource usage with TLS or external connection #300

masnax · 2024-07-04T21:06:04Z

It seems as soon as each cluster member begins regularly querying to view the leader info, the CPU usage of go-dqlite balloons.

This was noticed in microcluster where we have been forced to set our cluster heartbeats to fire every 30 seconds to avoid excessive CPU usage.

The plaintext implementation appears fine, but when starting with WithExternalConn or WithTLS, the service appears to become very inefficient.

Take the following example using WithTLS in the dqlite-demo example daemon. The only change is adding the following loop to each peer to query the dqlite leader every 3 seconds:

// Following the call to http.Serve:
go func() {
        // initially sleep to give the daemon time to set up.
	time.Sleep(5 * time.Second)
	for {
                // get a client to the leader.
		c, err := app.Leader(context.Background())
		if err != nil {
			panic(err)
		}

                // fetch the leader's node info.
		c.Leader(context.Background())
                
                // sleep 3s before trying again.
		time.Sleep(3 * time.Second)
	}
}()

Now launch 3 dqlite-demo peers with TLS enabled:

DNS=$(hostname)
IP=$(hostname -I | cut -f 1 -d ' ')
CN=example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
  -nodes -keyout cluster.key -out cluster.crt -subj "/CN=$CN" \
  -addext "subjectAltName=DNS:$DNS,IP:$IP"

for i in $(seq 1 3) ; do
  if [ ${i} = 1 ] ; then
    dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --cert "cluster.crt" --key "cluster.key" &
    continue
  fi

  sleep 1
  dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --join 127.0.0.1:9001 --cert "cluster.crt" --key "cluster.key" &
done

Finally, see the result:

health-check -c -p 1663326 | head -20
CPU usage (in terms of 1 CPU):
     PID Process                USR%   SYS% TOTAL%   Duration
  163326 dqlite-demo            4.02   0.90   4.92      60.00  (light load)

Page Faults:
     PID Process                 Minor/sec    Major/sec    Total/sec
  163326 dqlite-demo                 55.20         0.00        55.20

Context Switches:
     PID Process                Voluntary   Involuntary     Total
                               Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  163333 dqlite-demo                205.45         0.03       205.48 (quite high)
  163328 dqlite-demo                201.15         0.00       201.15 (quite high)
  163326 dqlite-demo                162.78         0.03       162.81 (quite high)
  163330 dqlite-demo                105.57         0.03       105.60 (quite high)
  163329 dqlite-demo                 86.23         0.03        86.27 (moderate)
  163380 dqlite-demo                 71.87         0.08        71.95 (moderate)
  163331 dqlite-demo                 70.47         0.03        70.50 (moderate)
  163335 dqlite-demo                 69.53         0.08        69.62 (moderate)
  163378 dqlite-demo                 57.15         0.02        57.17 (moderate)

Meanwhile top zig-zags between 0 and 10% cpu usage at each "heartbeat" interval:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 163326 root      20   0 2813608  33852  10752 S   9.0   0.1   0:08.66 dqlite-demo

For comparison, I put the same 3-second dqlite leader check to run continually on every node of a 3-member LXD cluster, and oddly LXD still uses less CPU with comparable context switches, and never goes above 2% CPU usage. This is on top of LXD's already-present continual cluster heartbeats. It's worth noting LXD does not use the main dqlite.App implementation.

CPU usage (in terms of 1 CPU):
     PID Process                USR%   SYS% TOTAL%   Duration
  185737 lxd                    0.48   1.03   1.52      60.00  (very light load)

Page Faults:
     PID Process                 Minor/sec    Major/sec    Total/sec
  185737 lxd                          5.33         0.00         5.33

Context Switches:
     PID Process                Voluntary   Involuntary     Total
                               Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  185755 lxd                        288.58         0.02       288.60 (quite high)
  185840 lxd                        217.03         0.02       217.05 (quite high)
  185762 lxd                        166.83         0.17       167.00 (quite high)
  185758 lxd                        152.43         0.10       152.53 (quite high)
  185757 lxd                        132.81         0.02       132.83 (quite high)
  185778 lxd                        132.01         0.02       132.03 (quite high)
  185837 lxd                         93.98         0.03        94.02 (moderate)
  185836 lxd                         65.27         0.00        65.27 (moderate)
  185834 lxd                         48.82         0.02        48.83 (moderate)

The text was updated successfully, but these errors were encountered:

masnax · 2024-08-21T18:01:28Z

Just to add, an easier reproducer instead of adding the goroutine mentioned in the issue would be to set a lower role adjustment frequency like app.WithRolesAdjustmentFrequency(time.Second * 5)

masnax · 2024-08-21T18:13:26Z

When deploying a dqlite cluster with TLS and RolesAdjustmentFrequency set to 5s, and WithMaxConcurrentLeaderConns set to 1, I see the following CPU spikes with top, although they are very brief, they occur every couple seconds before the CPU returns to idle:

 top -d 0.5 -p $(pgrep -f "dqlite-demo.*8001")

Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.2 sy,  0.0 ni, 98.4 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  32078.2 total,  23412.6 free,   2042.6 used,   6623.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  28138.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  21330 root      20   0 2666584  33396  10752 S  30.0   0.1   0:21.51 dqlite-demo

cole-miller mentioned this issue Sep 24, 2024

Speed up leader search #320

Merged

cole-miller mentioned this issue Oct 17, 2024

Enable client session cache for App TLS #339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive resource usage with TLS or external connection #300

Excessive resource usage with TLS or external connection #300

masnax commented Jul 4, 2024 •

edited

Loading

masnax commented Aug 21, 2024

masnax commented Aug 21, 2024 •

edited

Loading

Excessive resource usage with TLS or external connection #300

Excessive resource usage with TLS or external connection #300

Comments

masnax commented Jul 4, 2024 • edited Loading

masnax commented Aug 21, 2024

masnax commented Aug 21, 2024 • edited Loading

masnax commented Jul 4, 2024 •

edited

Loading

masnax commented Aug 21, 2024 •

edited

Loading