Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive resource usage with TLS or external connection #300

Open
masnax opened this issue Jul 4, 2024 · 2 comments
Open

Excessive resource usage with TLS or external connection #300

masnax opened this issue Jul 4, 2024 · 2 comments

Comments

@masnax
Copy link
Contributor

masnax commented Jul 4, 2024

It seems as soon as each cluster member begins regularly querying to view the leader info, the CPU usage of go-dqlite balloons.

This was noticed in microcluster where we have been forced to set our cluster heartbeats to fire every 30 seconds to avoid excessive CPU usage.

The plaintext implementation appears fine, but when starting with WithExternalConn or WithTLS, the service appears to become very inefficient.

Take the following example using WithTLS in the dqlite-demo example daemon. The only change is adding the following loop to each peer to query the dqlite leader every 3 seconds:

// Following the call to http.Serve:
go func() {
        // initially sleep to give the daemon time to set up.
	time.Sleep(5 * time.Second)
	for {
                // get a client to the leader.
		c, err := app.Leader(context.Background())
		if err != nil {
			panic(err)
		}

                // fetch the leader's node info.
		c.Leader(context.Background())
                
                // sleep 3s before trying again.
		time.Sleep(3 * time.Second)
	}
}()

Now launch 3 dqlite-demo peers with TLS enabled:

DNS=$(hostname)
IP=$(hostname -I | cut -f 1 -d ' ')
CN=example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
  -nodes -keyout cluster.key -out cluster.crt -subj "/CN=$CN" \
  -addext "subjectAltName=DNS:$DNS,IP:$IP"

for i in $(seq 1 3) ; do
  if [ ${i} = 1 ] ; then
    dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --cert "cluster.crt" --key "cluster.key" &
    continue
  fi

  sleep 1
  dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --join 127.0.0.1:9001 --cert "cluster.crt" --key "cluster.key" &
done

Finally, see the result:

health-check -c -p 1663326 | head -20
CPU usage (in terms of 1 CPU):
     PID Process                USR%   SYS% TOTAL%   Duration
  163326 dqlite-demo            4.02   0.90   4.92      60.00  (light load)

Page Faults:
     PID Process                 Minor/sec    Major/sec    Total/sec
  163326 dqlite-demo                 55.20         0.00        55.20

Context Switches:
     PID Process                Voluntary   Involuntary     Total
                               Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  163333 dqlite-demo                205.45         0.03       205.48 (quite high)
  163328 dqlite-demo                201.15         0.00       201.15 (quite high)
  163326 dqlite-demo                162.78         0.03       162.81 (quite high)
  163330 dqlite-demo                105.57         0.03       105.60 (quite high)
  163329 dqlite-demo                 86.23         0.03        86.27 (moderate)
  163380 dqlite-demo                 71.87         0.08        71.95 (moderate)
  163331 dqlite-demo                 70.47         0.03        70.50 (moderate)
  163335 dqlite-demo                 69.53         0.08        69.62 (moderate)
  163378 dqlite-demo                 57.15         0.02        57.17 (moderate)

Meanwhile top zig-zags between 0 and 10% cpu usage at each "heartbeat" interval:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 163326 root      20   0 2813608  33852  10752 S   9.0   0.1   0:08.66 dqlite-demo

For comparison, I put the same 3-second dqlite leader check to run continually on every node of a 3-member LXD cluster, and oddly LXD still uses less CPU with comparable context switches, and never goes above 2% CPU usage. This is on top of LXD's already-present continual cluster heartbeats. It's worth noting LXD does not use the main dqlite.App implementation.

CPU usage (in terms of 1 CPU):
     PID Process                USR%   SYS% TOTAL%   Duration
  185737 lxd                    0.48   1.03   1.52      60.00  (very light load)

Page Faults:
     PID Process                 Minor/sec    Major/sec    Total/sec
  185737 lxd                          5.33         0.00         5.33

Context Switches:
     PID Process                Voluntary   Involuntary     Total
                               Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  185755 lxd                        288.58         0.02       288.60 (quite high)
  185840 lxd                        217.03         0.02       217.05 (quite high)
  185762 lxd                        166.83         0.17       167.00 (quite high)
  185758 lxd                        152.43         0.10       152.53 (quite high)
  185757 lxd                        132.81         0.02       132.83 (quite high)
  185778 lxd                        132.01         0.02       132.03 (quite high)
  185837 lxd                         93.98         0.03        94.02 (moderate)
  185836 lxd                         65.27         0.00        65.27 (moderate)
  185834 lxd                         48.82         0.02        48.83 (moderate)
@masnax
Copy link
Contributor Author

masnax commented Aug 21, 2024

Just to add, an easier reproducer instead of adding the goroutine mentioned in the issue would be to set a lower role adjustment frequency like app.WithRolesAdjustmentFrequency(time.Second * 5)

@masnax
Copy link
Contributor Author

masnax commented Aug 21, 2024

When deploying a dqlite cluster with TLS and RolesAdjustmentFrequency set to 5s, and WithMaxConcurrentLeaderConns set to 1, I see the following CPU spikes with top, although they are very brief, they occur every couple seconds before the CPU returns to idle:

 top -d 0.5 -p $(pgrep -f "dqlite-demo.*8001")
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.2 sy,  0.0 ni, 98.4 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  32078.2 total,  23412.6 free,   2042.6 used,   6623.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  28138.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  21330 root      20   0 2666584  33396  10752 S  30.0   0.1   0:21.51 dqlite-demo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant