scx_layered: Consume from local LLCs for dispatch #919

hodgesds · 2024-11-11T16:11:00Z

When dispatching consume from DSQs in the local LLC first before trying remote DSQs. This should still be fair as the layer iteration order will be maintained.

scheds/rust/scx_layered/src/bpf/main.bpf.c

likewhatevs

LGTM

hodgesds · 2024-11-11T16:36:23Z

Some stress results, not 100% sure if the llc-affinity test is the best benchmark.

main branch:

$ stress-ng -c 175 -t 32 -M 
stress-ng: info:  [707913] setting to a 32 secs run per stressor
stress-ng: info:  [707913] dispatching hogs: 175 cpu
stress-ng: metrc: [707913] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [707913]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [707913] cpu             5669917     32.00   5582.11      0.25    177174.78        1015.68        99.68          4224
stress-ng: info:  [707913] skipped: 0
stress-ng: info:  [707913] passed: 175: cpu (175)
stress-ng: info:  [707913] failed: 0
stress-ng: info:  [707913] metrics untrustworthy: 0
stress-ng: info:  [707913] successful run completed in 32.03 secs
$ stress-ng --llc-affinity 175 -t 32 -M
stress-ng: info:  [766348] setting to a 32 secs run per stressor
stress-ng: info:  [766348] dispatching hogs: 175 llc-affinity
stress-ng: info:  [766352] llc-affinity: using LLC cache size of 16384K
stress-ng: metrc: [766348] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [766348]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [766348] llc-affinity       1050     32.04   3894.67    146.29        32.77           0.26        72.07         18304
stress-ng: metrc: [766348] miscellaneous metrics:
stress-ng: metrc: [766348] llc-affinity        1405.43 MB pec sec memory write rate (harmonic mean of 175 instances)
stress-ng: metrc: [766348] llc-affinity        1357.90 MB per sec memory read rate (harmonic mean of 175 instances)
stress-ng: metrc: [766348] llc-affinity          31.10 CPU affinity changes per sec (harmonic mean of 175 instances)
stress-ng: info:  [766348] skipped: 0
stress-ng: info:  [766348] passed: 175: llc-affinity (175)
stress-ng: info:  [766348] failed: 0
stress-ng: info:  [766348] metrics untrustworthy: 0
stress-ng: info:  [766348] successful run completed in 32.10 secs
$ stress-ng --cache 175 --cache-enable-all  -M -t 32
stress-ng: info:  [872189] setting to a 32 secs run per stressor
stress-ng: info:  [872189] dispatching hogs: 175 cache
stress-ng: info:  [872190] cache: cldemote is not available, ignoring this option
stress-ng: info:  [872190] cache: cache flags used: prefetch clflush fence sfence clflushopt clwb
stress-ng: metrc: [872189] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [872189]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [872189] cache           8625597     32.03   2423.43      0.70    269286.86        3558.22        43.25         16896
stress-ng: metrc: [872189] miscellaneous metrics:
stress-ng: metrc: [872189] cache            1263481.16 cache ops per second (harmonic mean of 175 instances)
stress-ng: metrc: [872189] cache         1057726110.56 shared cache reads per second (harmonic mean of 175 instances)
stress-ng: metrc: [872189] cache         1407460135.42 shared cache writes per second (harmonic mean of 175 instances)
stress-ng: info:  [872189] skipped: 0
stress-ng: info:  [872189] passed: 175: cache (175)
stress-ng: info:  [872189] failed: 0
stress-ng: info:  [872189] metrics untrustworthy: 0
stress-ng: info:  [872189] successful run completed in 32.06 secs

PR:

$ stress-ng -c 175 -t 32 -M 
stress-ng: info:  [725451] setting to a 32 secs run per stressor
stress-ng: info:  [725451] dispatching hogs: 175 cpu
stress-ng: metrc: [725451] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [725451]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [725451] cpu             5670612     32.00   5582.23      0.22    177196.07        1015.79        99.68          4224
stress-ng: info:  [725451] skipped: 0
stress-ng: info:  [725451] passed: 175: cpu (175)
stress-ng: info:  [725451] failed: 0
stress-ng: info:  [725451] metrics untrustworthy: 0
stress-ng: info:  [725451] successful run completed in 32.03 secs
$ stress-ng --llc-affinity 175 -t 32 -M
stress-ng: info:  [756667] setting to a 32 secs run per stressor
stress-ng: info:  [756667] dispatching hogs: 175 llc-affinity
stress-ng: info:  [756668] llc-affinity: using LLC cache size of 16384K
stress-ng: metrc: [756667] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [756667]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [756667] llc-affinity       1059     32.04   3811.80    146.25        33.05           0.27        70.59         18304
stress-ng: metrc: [756667] miscellaneous metrics:
stress-ng: metrc: [756667] llc-affinity        1462.30 MB pec sec memory write rate (harmonic mean of 175 instances)
stress-ng: metrc: [756667] llc-affinity        1406.11 MB per sec memory read rate (harmonic mean of 175 instances)
stress-ng: metrc: [756667] llc-affinity          31.51 CPU affinity changes per sec (harmonic mean of 175 instances)
stress-ng: info:  [756667] skipped: 0
stress-ng: info:  [756667] passed: 175: llc-affinity (175)
stress-ng: info:  [756667] failed: 0
stress-ng: info:  [756667] metrics untrustworthy: 0
stress-ng: info:  [756667] successful run completed in 32.12 secs
$ $ sudo stress-ng --cache 175 --cache-enable-all  -M -t 32
stress-ng: info:  [880556] setting to a 32 secs run per stressor
stress-ng: info:  [880556] dispatching hogs: 175 cache
stress-ng: info:  [880563] cache: cldemote is not available, ignoring this option
stress-ng: info:  [880563] cache: cache flags used: prefetch clflush fence sfence clflushopt clwb
stress-ng: metrc: [880556] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [880556]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [880556] cache           8625576     32.03   2420.20      0.73    269304.05        3562.93        43.19         16896
stress-ng: metrc: [880556] miscellaneous metrics:
stress-ng: metrc: [880556] cache            1301351.51 cache ops per second (harmonic mean of 175 instances)
stress-ng: metrc: [880556] cache         1351852226.88 shared cache reads per second (harmonic mean of 175 instances)
stress-ng: metrc: [880556] cache         1390292086.63 shared cache writes per second (harmonic mean of 175 instances)
stress-ng: info:  [880556] skipped: 0
stress-ng: info:  [880556] passed: 175: cache (175)
stress-ng: info:  [880556] failed: 0
stress-ng: info:  [880556] metrics untrustworthy: 0
stress-ng: info:  [880556] successful run completed in 32.06 secs

When dispatching consume from DSQs in the local LLC first before trying remote DSQs. This should still be fair as the layer iteration order will be maintained. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds · 2024-11-12T21:20:09Z

More tests on a partially saturated machine (80 instances, 176 CPUs)
TLDR:
cache ops/sec (max): 55496899.98 (main) vs 109916623.28 (PR) 1.98x improvement
cache reads/sec (max): 477780191.75 (main) vs 356949190.61 (PR) 0.75x reduction
cache writes/sec (max): 24614546.65 (main) vs 43032553.41 (PR) 1.75x improvement

main branch:

$ sudo stress-ng --cache-no-affinity --cache-level 3 --cache 80 -t 35 -M 
stress-ng: info:  [535188] setting to a 35 secs run per stressor
stress-ng: info:  [535188] dispatching hogs: 80 cache
stress-ng: info:  [535189] cache: cache flags used: none
stress-ng: info:  [535189] cache: use --cache-enable-all to enable all cache flags for heavier cache stressing
stress-ng: metrc: [535188] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [535188]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [535188] cache          90342995     35.01   1999.98      0.36   2580706.03       45164.00        71.43         16896
stress-ng: metrc: [535188] miscellaneous metrics:
stress-ng: metrc: [535188] cache           55496899.98 cache ops per second (harmonic mean of 80 instances)
stress-ng: metrc: [535188] cache          477780191.75 shared cache reads per second (harmonic mean of 80 instances)
stress-ng: metrc: [535188] cache           24388481.50 shared cache writes per second (harmonic mean of 80 instances)
stress-ng: info:  [535188] skipped: 0
stress-ng: info:  [535188] passed: 80: cache (80)
stress-ng: info:  [535188] failed: 0
stress-ng: info:  [535188] metrics untrustworthy: 0
stress-ng: info:  [535188] successful run completed in 37.31 secs
]$ sudo stress-ng --cache-no-affinity --cache-level 3 --cache 80 -t 35 -M 
stress-ng: info:  [537384] setting to a 35 secs run per stressor
stress-ng: info:  [537384] dispatching hogs: 80 cache
stress-ng: info:  [537408] cache: cache flags used: none
stress-ng: info:  [537408] cache: use --cache-enable-all to enable all cache flags for heavier cache stressing
stress-ng: metrc: [537384] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [537384]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [537384] cache          82070345     35.01   1717.47      0.27   2343979.90       47778.05        61.32         16896
stress-ng: metrc: [537384] miscellaneous metrics:
stress-ng: metrc: [537384] cache           43714815.35 cache ops per second (harmonic mean of 80 instances)
stress-ng: metrc: [537384] cache          423490707.01 shared cache reads per second (harmonic mean of 80 instances)
stress-ng: metrc: [537384] cache           23517242.81 shared cache writes per second (harmonic mean of 80 instances)
stress-ng: info:  [537384] skipped: 0
stress-ng: info:  [537384] passed: 80: cache (80)
stress-ng: info:  [537384] failed: 0
stress-ng: info:  [537384] metrics untrustworthy: 0
stress-ng: info:  [537384] successful run completed in 35.78 secs
$ sudo stress-ng --cache-no-affinity --cache-level 3 --cache 80 -t 35 -M 
stress-ng: info:  [541929] setting to a 35 secs run per stressor
stress-ng: info:  [541929] dispatching hogs: 80 cache
stress-ng: info:  [541930] cache: cache flags used: none
stress-ng: info:  [541930] cache: use --cache-enable-all to enable all cache flags for heavier cache stressing
stress-ng: metrc: [541929] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [541929]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [541929] cache          85706319     35.01   1950.65      0.33   2448015.37       43929.85        69.66         16896
stress-ng: metrc: [541929] miscellaneous metrics:
stress-ng: metrc: [541929] cache           52360653.33 cache ops per second (harmonic mean of 80 instances)
stress-ng: metrc: [541929] cache          448703556.35 shared cache reads per second (harmonic mean of 80 instances)
stress-ng: metrc: [541929] cache           24614546.65 shared cache writes per second (harmonic mean of 80 instances)
stress-ng: info:  [541929] skipped: 0
stress-ng: info:  [541929] passed: 80: cache (80)
stress-ng: info:  [541929] failed: 0
stress-ng: info:  [541929] metrics untrustworthy: 0
stress-ng: info:  [541929] successful run completed in 35.37 secs

PR:

$ sudo stress-ng --cache-no-affinity --cache-level 3 --cache 80 -t 35 -M 
stress-ng: info:  [553389] setting to a 35 secs run per stressor
stress-ng: info:  [553389] dispatching hogs: 80 cache
stress-ng: info:  [553390] cache: cache flags used: none
stress-ng: info:  [553390] cache: use --cache-enable-all to enable all cache flags for heavier cache stressing
stress-ng: metrc: [553389] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [553389]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [553389] cache         131989009     35.00   2793.68      0.68   3771112.66       47234.16        99.80         16896
stress-ng: metrc: [553389] miscellaneous metrics:
stress-ng: metrc: [553389] cache          108346730.27 cache ops per second (harmonic mean of 80 instances)
stress-ng: metrc: [553389] cache          356949190.61 shared cache reads per second (harmonic mean of 80 instances)
stress-ng: metrc: [553389] cache           40803440.87 shared cache writes per second (harmonic mean of 80 instances)
stress-ng: info:  [553389] skipped: 0
stress-ng: info:  [553389] passed: 80: cache (80)
stress-ng: info:  [553389] failed: 0
stress-ng: info:  [553389] metrics untrustworthy: 0
stress-ng: info:  [553389] successful run completed in 35.02 secs
$ sudo stress-ng --cache-no-affinity --cache-level 3 --cache 80 -t 35 -M 
stress-ng: info:  [559376] setting to a 35 secs run per stressor
stress-ng: info:  [559376] dispatching hogs: 80 cache
stress-ng: info:  [559380] cache: cache flags used: none
stress-ng: info:  [559380] cache: use --cache-enable-all to enable all cache flags for heavier cache stressing
stress-ng: metrc: [559376] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [559376]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [559376] cache         136236026     35.00   2794.04      0.63   3892456.97       48748.61        99.81         16896
stress-ng: metrc: [559376] miscellaneous metrics:
stress-ng: metrc: [559376] cache          109916623.28 cache ops per second (harmonic mean of 80 instances)
stress-ng: metrc: [559376] cache          352940483.83 shared cache reads per second (harmonic mean of 80 instances)
stress-ng: metrc: [559376] cache           43032553.41 shared cache writes per second (harmonic mean of 80 instances)
stress-ng: info:  [559376] skipped: 0
stress-ng: info:  [559376] passed: 80: cache (80)
stress-ng: info:  [559376] failed: 0
stress-ng: info:  [559376] metrics untrustworthy: 0
stress-ng: info:  [559376] successful run completed in 35.02 secs
$ sudo stress-ng --cache-no-affinity --cache-level 3 --cache 80 -t 35 -M 
stress-ng: info:  [566322] setting to a 35 secs run per stressor
stress-ng: info:  [566322] dispatching hogs: 80 cache
stress-ng: info:  [566323] cache: cache flags used: none
stress-ng: info:  [566323] cache: use --cache-enable-all to enable all cache flags for heavier cache stressing
stress-ng: metrc: [566322] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [566322]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [566322] cache         133128101     35.00   2792.18      0.70   3803657.29       47666.93        99.75         16896
stress-ng: metrc: [566322] miscellaneous metrics:
stress-ng: metrc: [566322] cache          107637140.48 cache ops per second (harmonic mean of 80 instances)
stress-ng: metrc: [566322] cache          342616850.44 shared cache reads per second (harmonic mean of 80 instances)
stress-ng: metrc: [566322] cache           41565459.58 shared cache writes per second (harmonic mean of 80 instances)
stress-ng: info:  [566322] skipped: 0
stress-ng: info:  [566322] passed: 80: cache (80)
stress-ng: info:  [566322] failed: 0
stress-ng: info:  [566322] metrics untrustworthy: 0
stress-ng: info:  [566322] successful run completed in 35.01 secs

These tests were run with the following config:

[
  {
    "name": "hodgesd",
    "comment": "hodgesd user",
    "matches": [
      [
        {
          "UIDEquals": 224791
        }
      ]
    ],
    "kind": {
      "Confined": {
        "slice_us": 800,
        "util_range": [
          0.25,
          0.6
        ],
        "growth_algo": "Sticky",
        "preempt": false,
        "preempt_first": false,
        "exclusive": false
      }
    }
  },
  {
    "name": "stress-ng",
    "comment": "stress-ng slice",
    "matches": [
      [
        {
          "CommPrefix": "stress-ng"
        }
      ],
      [
        {
          "PcommPrefix": "stress-ng"
        }
      ]
    ],
    "kind": {
      "Confined": {
        "util_range": [
          0.60,
          0.80
        ],
        "slice_us": 1000,
        "preempt": false,
        "idle_smt": true,
        "preempt_first": true,
        "growth_algo": "Topo",
        "exclusive": false
      }
    }
  },
  {
    "name": "normal",
    "comment": "the rest",
    "matches": [
      []
    ],
    "kind": {
      "Grouped": {
        "util_range": [
          0.25,
          0.6
        ],
        "preempt": false,
        "weight": 500,
        "preempt_first": false,
        "slice_us": 800,
        "exclusive": false,
        "growth_algo": "Sticky"
      }
    }
  }
]

JakeHillion

As discussed offline I'm not sure these test results are consistent, and I had some issues running them for longer on main. However the change makes sense!

Signed-off-by: Daniel Hodges <[email protected]>

JakeHillion

NACK to the flag without good reason. Normally I want flags for most changes, but in this case the change makes logical sense and doesn't significantly change our dispatching. The current implication of duplicating the entire loops is not good for code readability/maintenance at all.

If the flag is necessary (we can't settle on a single behaviour), can we implement it by doing divmod in the loop instead? This gives the same optionality without duplicating the bodies.

hodgesds · 2024-11-13T21:02:48Z

for code readability/maintenance at all.

If the flag is necessary (we can't settle on a single behaviour), can we implement it by doing divmod in the loop instead? This gives the same optionality without duplicating the bodies.

I mostly want the flag for doing more testing on actual workloads. It's bothersome to have to rebuild different versions of the scheduler to test the differences in performance and synthetic benchmarks only go so far in testing.

JakeHillion · 2024-11-13T21:07:08Z

If the flag is necessary (we can't settle on a single behaviour), can we implement it by doing divmod in the loop instead? This gives the same optionality without duplicating the bodies.

I mostly want the flag for doing more testing on actual workloads. It's bothersome to have to rebuild different versions of the scheduler to test the differences in performance and synthetic benchmarks only go so far in testing.

Alternatively, could you pull out the body from both loops into a static inline function and then have the if around the two loops? It'll mean doing the if checks multiple times unnecessarily, but should make it way easier to maintain the code (and we were doing those unnecessary checks until this week anyway).

likewhatevs · 2024-11-13T21:21:51Z

I mostly want the flag for doing more testing on actual workloads. It's bothersome to have to rebuild different versions of the scheduler to test the differences in performance and synthetic benchmarks only go so far in testing.

Agreed, fan of more flags. It's that or more bisect/patch etc. to debug issues.

hodgesds · 2024-11-14T20:54:34Z

Some results of doing 75 rounds of tests:

Landing

hodgesds requested review from htejun, JakeHillion, etsal and likewhatevs November 11, 2024 16:11

hodgesds commented Nov 11, 2024

View reviewed changes

scheds/rust/scx_layered/src/bpf/main.bpf.c Outdated Show resolved Hide resolved

likewhatevs approved these changes Nov 11, 2024

View reviewed changes

scx_layered: Consume from local LLCs for dispatch

775d09a

When dispatching consume from DSQs in the local LLC first before trying remote DSQs. This should still be fair as the layer iteration order will be maintained. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds force-pushed the layered-dispatch-local branch from dea4451 to 775d09a Compare November 11, 2024 17:22

JakeHillion approved these changes Nov 13, 2024

View reviewed changes

hodgesds force-pushed the layered-dispatch-local branch from b55c1bc to 3417727 Compare November 13, 2024 20:09

scx_layered: Add flag to control llc iteration on dispatch

4fc0509

Signed-off-by: Daniel Hodges <[email protected]>

hodgesds force-pushed the layered-dispatch-local branch from 3417727 to 4fc0509 Compare November 13, 2024 20:44

JakeHillion previously requested changes Nov 13, 2024

View reviewed changes

Merge branch 'main' into layered-dispatch-local

3a3a7d7

hodgesds enabled auto-merge November 15, 2024 14:16

hodgesds added this pull request to the merge queue Nov 15, 2024

Merged via the queue into sched-ext:main with commit 79125ef Nov 15, 2024
23 checks passed

hodgesds deleted the layered-dispatch-local branch November 15, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_layered: Consume from local LLCs for dispatch #919

scx_layered: Consume from local LLCs for dispatch #919

hodgesds commented Nov 11, 2024

likewhatevs left a comment

hodgesds commented Nov 11, 2024 •

edited

Loading

hodgesds commented Nov 12, 2024 •

edited

Loading

JakeHillion left a comment

JakeHillion left a comment

hodgesds commented Nov 13, 2024

JakeHillion commented Nov 13, 2024

likewhatevs commented Nov 13, 2024

hodgesds commented Nov 14, 2024

scx_layered: Consume from local LLCs for dispatch #919

scx_layered: Consume from local LLCs for dispatch #919

Conversation

hodgesds commented Nov 11, 2024

likewhatevs left a comment

Choose a reason for hiding this comment

hodgesds commented Nov 11, 2024 • edited Loading

hodgesds commented Nov 12, 2024 • edited Loading

JakeHillion left a comment

Choose a reason for hiding this comment

JakeHillion left a comment

Choose a reason for hiding this comment

hodgesds commented Nov 13, 2024

JakeHillion commented Nov 13, 2024

likewhatevs commented Nov 13, 2024

hodgesds commented Nov 14, 2024

hodgesds commented Nov 11, 2024 •

edited

Loading

hodgesds commented Nov 12, 2024 •

edited

Loading