Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(memory): use thread-local sequence-based memory eviction policy #16087

Merged
merged 31 commits into from
May 27, 2024

Conversation

MrCroxx
Copy link
Contributor

@MrCroxx MrCroxx commented Apr 2, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Motivation

Resolves #15305

The previous memory eviction strategy was based on epoch, and in the following scenarios, there may be cases of excessively aggressive eviction:

  1. Uneven data volume between epochs caused by sudden increase or decrease in traffic.
  2. Large data volume in recent epochs due to frequent access to certain data.
  3. Large epoch interval.

This PR introduces a sequence-based memory eviction mechanism. The eviction of the cache is no longer based on epochs but on the sequence of cache access allocation with finer granularity.

Because the sequence needs to be globally shared, although only one atomic variable is needed for the sequence, the overhead of cache invalidation caused by frequent fetch_add cannot be ignored. Therefore, this PR introduces a thread_local sequence that allows global reordering within a certain range.

When insert/access an entry into/from the managed lru cache, the managed lru cache acquires a sequence from the Sequencer. The Sequencer use a thread-local variable to grant the sequence. the thread-local variable is synchronized with the global sequence if (a) the pre-allocated local sequences (step) are exhausted, or (b) the local sequence lag is higher than the threshold (lag). When evicting, the memory controller calculate the memory ratio to evict and normalize it as watermark sequence with the global sequence. The out-of-order threshold is max(lag, step).

CN node memory (before vs after):

image

Changes

  • Remove forked lru dependency. Use customized implemention in risingwave_common::lru.
  • Add thread-local sequencer implementation in risingwave_common::sequencer.
  • Add factor configuration for each eviction policy to control the eviction speed.

Configurations

    #[serde(default = "default::developer::memory_controller_eviction_factor_aggressive")]
    pub memory_controller_eviction_factor_aggressive: f64,

    #[serde(default = "default::developer::memory_controller_eviction_factor_graceful")]
    pub memory_controller_eviction_factor_graceful: f64,

    #[serde(default = "default::developer::memory_controller_eviction_factor_stable")]
    pub memory_controller_eviction_factor_stable: f64,

    #[serde(default = "default::developer::memory_controller_sequence_tls_step")]
    pub memory_controller_sequence_tls_step: u64,

    #[serde(default = "default::developer::memory_controller_sequence_tls_lag")]
    pub memory_controller_sequence_tls_lag: u64,

        pub fn memory_controller_threshold_aggressive() -> f64 {
            0.9
        }

        pub fn memory_controller_threshold_graceful() -> f64 {
            0.8
        }

        pub fn memory_controller_threshold_stable() -> f64 {
            0.7
        }

        pub fn memory_controller_eviction_factor_aggressive() -> f64 {
            2.0
        }

        pub fn memory_controller_eviction_factor_graceful() -> f64 {
            1.5
        }

        pub fn memory_controller_eviction_factor_stable() -> f64 {
            1.0
        }

        pub fn memory_controller_sequence_tls_step() -> u64 {
            128
        }

        pub fn memory_controller_sequence_tls_lag() -> u64 {
            32
        }

Micro Benchmarks for Components

Sequencer microbench:

primitive            1 threads 10000000 loops: 0ns per iter
atomic               1 threads 10000000 loops: 1ns per iter
atomic skip 8        1 threads 10000000 loops: 1ns per iter
atomic skip 16       1 threads 10000000 loops: 1ns per iter
atomic skip 32       1 threads 10000000 loops: 1ns per iter
atomic skip 64       1 threads 10000000 loops: 1ns per iter
sequencer(64,8)      1 threads 10000000 loops: 2ns per iter
sequencer(64,16)     1 threads 10000000 loops: 1ns per iter
sequencer(64,32)     1 threads 10000000 loops: 1ns per iter
sequencer(128,8)     1 threads 10000000 loops: 1ns per iter
sequencer(128,16)    1 threads 10000000 loops: 1ns per iter
sequencer(128,32)    1 threads 10000000 loops: 1ns per iter
coarse               1 threads 10000000 loops: 3ns per iter

primitive            4 threads 10000000 loops: 0ns per iter
atomic               4 threads 10000000 loops: 20ns per iter
atomic skip 8        4 threads 10000000 loops: 5ns per iter
atomic skip 16       4 threads 10000000 loops: 5ns per iter
atomic skip 32       4 threads 10000000 loops: 4ns per iter
atomic skip 64       4 threads 10000000 loops: 4ns per iter
sequencer(64,8)      4 threads 10000000 loops: 4ns per iter
sequencer(64,16)     4 threads 10000000 loops: 5ns per iter
sequencer(64,32)     4 threads 10000000 loops: 5ns per iter
sequencer(128,8)     4 threads 10000000 loops: 3ns per iter
sequencer(128,16)    4 threads 10000000 loops: 3ns per iter
sequencer(128,32)    4 threads 10000000 loops: 3ns per iter
coarse               4 threads 10000000 loops: 10ns per iter

primitive            8 threads 10000000 loops: 0ns per iter
atomic               8 threads 10000000 loops: 43ns per iter
atomic skip 8        8 threads 10000000 loops: 18ns per iter
atomic skip 16       8 threads 10000000 loops: 12ns per iter
atomic skip 32       8 threads 10000000 loops: 9ns per iter
atomic skip 64       8 threads 10000000 loops: 6ns per iter
sequencer(64,8)      8 threads 10000000 loops: 8ns per iter
sequencer(64,16)     8 threads 10000000 loops: 7ns per iter
sequencer(64,32)     8 threads 10000000 loops: 7ns per iter
sequencer(128,8)     8 threads 10000000 loops: 5ns per iter
sequencer(128,16)    8 threads 10000000 loops: 4ns per iter
sequencer(128,32)    8 threads 10000000 loops: 5ns per iter
coarse               8 threads 10000000 loops: 16ns per iter

primitive            16 threads 10000000 loops: 0ns per iter
atomic               16 threads 10000000 loops: 125ns per iter
atomic skip 8        16 threads 10000000 loops: 35ns per iter
atomic skip 16       16 threads 10000000 loops: 24ns per iter
atomic skip 32       16 threads 10000000 loops: 18ns per iter
atomic skip 64       16 threads 10000000 loops: 12ns per iter
sequencer(64,8)      16 threads 10000000 loops: 23ns per iter
sequencer(64,16)     16 threads 10000000 loops: 15ns per iter
sequencer(64,32)     16 threads 10000000 loops: 15ns per iter
sequencer(128,8)     16 threads 10000000 loops: 16ns per iter
sequencer(128,16)    16 threads 10000000 loops: 10ns per iter
sequencer(128,32)    16 threads 10000000 loops: 9ns per iter
coarse               16 threads 10000000 loops: 41ns per iter

primitive            32 threads 10000000 loops: 0ns per iter
atomic               32 threads 10000000 loops: 384ns per iter
atomic skip 8        32 threads 10000000 loops: 72ns per iter
atomic skip 16       32 threads 10000000 loops: 51ns per iter
atomic skip 32       32 threads 10000000 loops: 34ns per iter
atomic skip 64       32 threads 10000000 loops: 21ns per iter
sequencer(64,8)      32 threads 10000000 loops: 138ns per iter
sequencer(64,16)     32 threads 10000000 loops: 64ns per iter
sequencer(64,32)     32 threads 10000000 loops: 28ns per iter
sequencer(128,8)     32 threads 10000000 loops: 137ns per iter
sequencer(128,16)    32 threads 10000000 loops: 63ns per iter
sequencer(128,32)    32 threads 10000000 loops: 16ns per iter
coarse               32 threads 10000000 loops: 184ns per iter

lru microbench:

lru - 1024           1 threads 1000000 loops: 35ns per iter, total evicted: 999424
rw  - 1024           1 threads 1000000 loops: 26ns per iter, total evicted: 999424

lru - 1024           4 threads 1000000 loops: 35ns per iter, total evicted: 3997696
rw  - 1024           4 threads 1000000 loops: 27ns per iter, total evicted: 3997696

lru - 1024           8 threads 1000000 loops: 44ns per iter, total evicted: 7995392
rw  - 1024           8 threads 1000000 loops: 34ns per iter, total evicted: 7995392

lru - 1024           16 threads 1000000 loops: 46ns per iter, total evicted: 15990784
rw  - 1024           16 threads 1000000 loops: 51ns per iter, total evicted: 15990784

lru - 1024           32 threads 1000000 loops: 56ns per iter, total evicted: 31981568
rw  - 1024           32 threads 1000000 loops: 81ns per iter, total evicted: 31981568

lru - 1024           64 threads 1000000 loops: 90ns per iter, total evicted: 63963136
rw  - 1024           64 threads 1000000 loops: 149ns per iter, total evicted: 63963136

Benchmark

benchmark (nexmark, vs nightly-20240511):

http://metabase.risingwave-cloud.xyz/question/2219-nexmark-rw-compare?risingwave_tag_1=nightly-20240512&rw_label_1=daily&risingwave_metrics=avg-source-output-rows-per-second&risingwave_tag_2=git-8bc7ee189094e72c65db0725f05263ca3ec08be3&rw_label_2=benchmark-xx-tls

benchmark (nexmark, vs main without this PR):

http://metabase.risingwave-cloud.xyz/question/2219-nexmark-rw-compare?risingwave_tag_1=git-91b7ee29ce4d846f9c2ee6d9f56264bab414250a&rw_label_1=benchmark-xx-main&risingwave_metrics=avg-source-output-rows-per-second&risingwave_tag_2=git-8bc7ee189094e72c65db0725f05263ca3ec08be3&rw_label_2=benchmark-xx-tls

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@MrCroxx MrCroxx self-assigned this Apr 2, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 4973 files.

Valid Invalid Ignored Fixed
2138 1 2834 0
Click to see the invalid file list
  • src/common/benches/bench_sequencer.rs

src/common/benches/bench_sequencer.rs Show resolved Hide resolved
@MrCroxx MrCroxx marked this pull request as ready for review April 7, 2024 06:41
@MrCroxx MrCroxx requested a review from a team as a code owner April 7, 2024 06:41
@TennyZhuang TennyZhuang changed the title perf(memory): use thread-local squence-based memory eviction policy perf(memory): use thread-local sequence-based memory eviction policy Apr 7, 2024
@BugenZhao
Copy link
Member

Hi, would you mind sharing more information on the motivation and methodology in the PR description?

@MrCroxx MrCroxx enabled auto-merge May 13, 2024 02:43
@MrCroxx MrCroxx disabled auto-merge May 13, 2024 02:59
@MrCroxx MrCroxx requested a review from hzxa21 May 13, 2024 02:59
Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

pub fn put(&mut self, key: K, mut value: V) -> Option<V> {
unsafe {
Copy link
Member

@fuyufjh fuyufjh May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many unsafe in this file 🥵 Please explain the necessities of unsafe with some comments in this file e.g. why LinkedList can't satisfy this use case.

Copy link
Member

@fuyufjh fuyufjh May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I tend to wrap every linked list operations in unsafe { ... } instead of simply wrapping all the function body code. It makes it hard to reason about the safety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single thread LRU with sequence implementation is basically ported from our original modified lru repo, foyer, and our original block cache implementation (ported from RocksDB).

A LRU cache is a classic multi-indexer problem, which cannot be achieved easily and cheaply with safe Rust. It requires mutability with shared pointers, O(1) node lookup with address or reference (which cannot be achieved with std linked list) .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires mutability with shared pointers

Yeah, got this. But is it possible to reduce the size of unsafe block? As mentioned in the 2nd comment.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the idea LGTM

src/common/src/sequence.rs Show resolved Hide resolved
src/compute/src/memory/controller.rs Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard that some stateless queries in NexMark were negatively affected by this PR for some "unknown" cause. Have we found the reason now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not. But the regression hasn't appear these weeks.

@lmatz
Copy link
Contributor

lmatz commented May 13, 2024

Is it necessary to also run the 4X (32C 64G) nexmark once? https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3664#018f689a-0b1d-4fc3-9b82-912358895ccc
Considering that 32 threads case seem to incur more overhead

@MrCroxx MrCroxx requested a review from fuyufjh May 23, 2024 08:05
@MrCroxx
Copy link
Contributor Author

MrCroxx commented May 23, 2024

Is it necessary to also run the 4X (32C 64G) nexmark once? buildkite.com/risingwave-test/nexmark-benchmark/builds/3664#018f689a-0b1d-4fc3-9b82-912358895ccc Considering that 32 threads case seem to incur more overhead

Wha's the hardware configuration of the longevity test? I've ran longevity test and there is no regression.

Signed-off-by: MrCroxx <[email protected]>
@lmatz
Copy link
Contributor

lmatz commented May 23, 2024

Each MV in longevity uses 3 as the parallelism.
The machine is 32C 64GB where there are 3CNs (each 32CPUs) free to compete with each other

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

}

pub fn put(&mut self, key: K, mut value: V) -> Option<V> {
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires mutability with shared pointers

Yeah, got this. But is it possible to reduce the size of unsafe block? As mentioned in the 2nd comment.

grafana/risingwave-dev-dashboard.dashboard.py Outdated Show resolved Hide resolved
grafana/risingwave-dev-dashboard.dashboard.py Outdated Show resolved Hide resolved
src/stream/src/cache/managed_lru.rs Outdated Show resolved Hide resolved
@MrCroxx
Copy link
Contributor Author

MrCroxx commented May 27, 2024

#16087 (comment)

Discussed offline. Separating the unsafe blocks barely helps reduce the explosion radius. Let's keep it as it is now.

@MrCroxx MrCroxx added this pull request to the merge queue May 27, 2024
Merged via the queue into main with commit 240f0b9 May 27, 2024
27 of 28 checks passed
@MrCroxx MrCroxx deleted the xx/thread-local-sequence branch May 27, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

too aggressive and early cache eviction
6 participants