Add new execution code-path for unified scheduler #31239

ryoqun · 2023-04-18T05:21:10Z

Problem

both current tx execution mechanisms for block verification (ReplayStage/blockstore_processor) and generation (BankingStage) aren't ideal for upcoming unified scheduler. see #30746 for these terminology.

Summary of Changes

refer to #31239 (comment) to untangle this complex object graph....

introduce a third tx execution code path and plumb all the things, which is planned to be used for both block verification and generation by unified scheduler.

Primary design choices are (not all things are coded in this pr):

this new execution mechanism allows async (non-blocking) tx buffering for execution (this is very desirable for optimal scheduling across lock-conflicting ledger entries, etc)
concurrent and isolated worker thread pool for each active forks.
minimize overhead extremely for non-batched individual-tx-level task queuing. (to the extent where i went into greater deal just to remove single Arc::upgrade() per task execution.)

Also provide minimally-viable impl just to make the plumbing actually workable for demonstration purpose. That will be replaced with actual full-fledged impl in coming prs.

This pr touches various parts but i tried very hard to make these code changes understandable. For details, see doc comments (rare from me, lol!) in pr changes.

todos

1/3: this pr
2/3: flesh out solana-scheduler with actual scheduling code (along with proper docs, explaning its algo.)
3/3: flesh out solana-scheduler-pool with actual multi-threaded code.

codecov · 2023-04-18T06:32:41Z

Codecov Report

Merging #31239 (465205e) into master (e74bc4e) will increase coverage by 0.0%.
The diff coverage is 91.7%.

@@           Coverage Diff            @@
##           master   #31239    +/-   ##
========================================
  Coverage    81.5%    81.5%            
========================================
  Files         729      732     +3     
  Lines      206580   207349   +769     
========================================
+ Hits       168445   169141   +696     
- Misses      38135    38208    +73

ryoqun · 2023-04-18T07:00:09Z

scheduler-pool/Cargo.toml

naming is must firstly be settled on...

ryoqun · 2023-04-18T07:00:27Z

scheduler/Cargo.toml

naming is must firstly be settled on...

ryoqun · 2023-04-18T07:12:01Z

runtime/src/installed_scheduler_pool.rs

+};
+
+// Send + Sync is needed to be a field of BankForks
+#[cfg_attr(any(test, feature = "test-in-workspace"), automock)]


ryoqun · 2023-04-18T07:13:06Z

runtime/src/installed_scheduler_pool.rs

+        let mocked_scheduler = setup_mocked_scheduler_with_extra(
+            [WaitReason::TerminatedFromBankDrop].into_iter(),
+            Some(|mocked: &mut MockInstalledScheduler| {
+                mocked
+                    .expect_schedule_execution()
+                    .times(1)
+                    .returning(|_, _| ());
+            }),
+        );


hope there's no anti-mock guys around.. lol

ryoqun · 2023-04-18T07:14:31Z

runtime/src/installed_scheduler_pool.rs

+    // Calling this is illegal as soon as schedule_termiantion is called on &self.
+    fn schedule_execution(&self, sanitized_tx: &SanitizedTransaction, index: usize);
+
+    // This optionally signals scheduling termination request to the scheduler.
+    // This is subtle but important, to break circular dependency of Arc<Bank> => Scheduler =>
+    // SchedulingContext => Arc<Bank> in the middle of the tear-down process, otherwise it would
+    // prevent Bank::drop()'s last resort scheduling termination attempt indefinitely
+    fn schedule_termination(&mut self);
+
+    #[must_use]
+    fn wait_for_termination(&mut self, reason: &WaitReason) -> Option<ResultWithTimings>;
+
+    // suppress false clippy complaints arising from mockall-derive:
+    //   warning: the following explicit lifetimes could be elided: 'a


intentionally, these aren't doc-comments... xD not intending proper wordz

ryoqun · 2023-04-18T07:21:15Z

ledger/src/blockstore_processor.rs

+            "process_batches()/schedule_batches_for_execution({} batches)",
+            batches.len()
+        );
+        schedule_batches_for_execution(bank, batches)


execution diversion starts from here. (meaning all locking verification are equally done for both if branches)

ryoqun · 2023-04-18T07:24:22Z

runtime/src/installed_scheduler_pool.rs

i tried this writing first time. but i kinda like putting all relevant code by topic (scheduling in this case) into a single file, opening bunch of layered structs, thanks for rust's open classes at the cost of pub(crate) impurification.

ryoqun · 2023-04-18T07:24:48Z

scheduler/src/lib.rs

+    fn mode(&self) -> SchedulingMode;
+}
+
+// This file will be populated with actual implementation later.


stay tuned!

ryoqun · 2023-04-18T07:37:49Z

tagging as this is monorepo-wide big change: @carllin (i trust your eyes for cutting-through review), @bw-solana (this should be complement to your concurrent replay), @ilya-bobyr (saw your recent work at blockstore_processor; tldr tx could be executed differently)

happy to explain context/impl details if interested (i'll try very hard to put answers into code/comments for future posterity, lol).

apfitzge

Just some very initial comments from first scan through.

At a high-level, I'm still trying to understand these cyclic dependencies but will need another pass at least.

Should probably loop more than just me in on the review requests as well 😄

apfitzge · 2023-04-18T10:56:04Z

core/src/replay_stage.rs

+                            ancestor_hashes_replay_update_sender,
+                            purge_repair_slot_counter,
+                        );
+                        // If the bank was corrupted, abort now to prevent further normal processing


What does a bank being corrupted mean? How does that happen?

corrupt here refers to any BlockstoreProcessorError error; this comment is copied from the existing one. anyway, i don't think corrupt here adding any value; so removed it...: 16a1e9c

ledger-tool/src/main.rs

apfitzge · 2023-04-18T11:06:51Z

local-cluster/src/local_cluster.rs

@@ -126,23 +126,33 @@ pub struct LocalCluster {
 }

 impl LocalCluster {
-    pub fn new_with_equal_stakes(
+    pub fn config_with_equal_stakes(


Move to ClusterConfig impl?

done: eed7db0

also created a pr for this: #32330

apfitzge · 2023-04-18T11:08:49Z

runtime/src/bank.rs

@@ -8282,6 +8291,8 @@ impl TotalAccountsStats {

 impl Drop for Bank {
    fn drop(&mut self) {
+        self.drop_scheduler();
+


nit: prefer not to have the empty line

done: 3ba582b

apfitzge · 2023-04-18T11:16:14Z

runtime/src/installed_scheduler_pool.rs

+//!
+//! Albeit being at this abstract interface level, it's generally assumed that each
+//! `InstalledScheduler` is backed by multiple threads for performant transaction execution and
+//! there're multiple independent schedulers inside a single instance of `InstalledSchedulerPool`.


Is this always the case, or is it only necessary if --replay-slots-concurrently is enabled?

in general, the current situation of concurrency level is complicated.

direct to your question, firstly, it is always the case. as soon as a new fork is created by BankForks::insert(), a new PooledScheduler is spawn, which means a new independent group of threads from other ongoing forks (sharing no synchronization primitives among others). This is true, regardless --replay-slots-concurrently or not. And multiple slots can be replayed concurrently as .schdule_execution() is assumed be async/non-blocking. i.e. single-threaded replay stage can still buffer txes into multiple unified schedulers.

So, when --replay-slots-concurrently is enabled, unified scheduler can take advantage of it in that it can fully concurrently replay ledger entries, including the buffering part (note: by principle!). on the other hand, blockstore_processor uses a shared rayon thread group. While the thread group is populated with N (core count) threads, it's still possible for any blocking call there to starve the other fork. also, I've observed non-optimal rayon behavior when .install() is called from multiple threads against a shared thread group.

the caveat for the full concurrency of unified scheduler is that it's true if there's no single is_complete() bank at given moment among all of ongoing forks. that's because process_replay_results isn't concurrent regarding multiple forks even if --replay-slots-concurrently is enabled (still). so, as soon as any buffered schedule_execution()-ed transactions are all executed, other !is_complete() forks can be starved sub-optimally.

Anyway, these are all edge case runtime behavior... I'll planning to revisit this in distant future... or @bw-solana might be planning to further enhance the replay stage's concurrency. xD

(while writing the above, I noticed i better off pulling a small assert from the dirty topic branch: 9769288 , pending ci for its correctness... lol)

oh, this concurrency bla bla also affected by this pending feature activation: #31239 (comment)

apfitzge · 2023-04-18T11:35:20Z

runtime/src/installed_scheduler_pool.rs

+
+    #[must_use]
+    fn wait_for_scheduler(&self, reason: WaitReason) -> Option<ResultWithTimings> {
+        debug!(


mix of inlined formatting and regular formatting

you meant like this?:

$ git diff HEAD~ diff --git a/runtime/src/installed_scheduler_pool.rs b/runtime/src/installed_scheduler_pool.rs index 83408d1c861..c56c651f238 100644 --- a/runtime/src/installed_scheduler_pool.rs +++ b/runtime/src/installed_scheduler_pool.rs @@ -242,8 +242,7 @@ impl Bank { #[must_use] fn wait_for_scheduler(&self, reason: WaitReason) -> Option<ResultWithTimings> { debug!( - "wait_for_scheduler(slot: {}, reason: {reason:?}): started...", - self.slot() + "wait_for_scheduler(slot: {self.slot()}, reason: {reason:?}): started...", ); let mut scheduler_guard = self.scheduler.write().expect("not poisoned");

actually, this is syntax error:

error: invalid format string: expected `'}'`, found `'.'` --> runtime/src/installed_scheduler_pool.rs:245:44 | 245 | "wait_for_scheduler(slot: {self.slot()}, reason: {reason:?}): started...", | - ^ expected `}` in format string | | | because of this opening brace | = note: if you intended to print `{`, you can escape it using `{{` error: could not compile `solana-runtime` due to previous error

ha I was unclear, but actually meant the opposite direction

debug!( "wait_for_scheduler(slot: {}, reason: {:?}): started...", self.slot(), reason, );

It's fine to leave it, I always just find it harder to read if some params are inlined and others are not, and either choose to go all inlined or all not.

I was also under the impression that it is recommended to use either all inlined, or all non-inlined arguments, but not a mix.
But I can not find any recommendations on this point right now.
One reason might be that there is a clippy rule that insists that all arguments should be inlined, when they all can be inlined.
But as soon as there is at least one non-inlinable argument, it stops complaining.
This rule has been downgraded to "pedantic" due to rust-analyzer not been robust enough just yet.

done: 1dc0c2c

apfitzge · 2023-04-18T11:41:13Z

scheduler-pool/Cargo.toml

@@ -0,0 +1,22 @@
+[package]
+name = "solana-scheduler-pool"
+description = "The solana scheduler pool"


Suggested change

description = "The solana scheduler pool"

description = "The Solana scheduler pool"

apfitzge · 2023-04-18T11:44:06Z

scheduler/Cargo.toml

@@ -0,0 +1,10 @@
+[package]
+name = "solana-scheduler"


Not sure if we want to be so generic with the name if we've got other scheduler impls in the codebase

hehe, how about solana-unified-scheduler-logic/solana-unified-scheduler-pool then?

I like those better 😄

scheduler-pool/src/lib.rs

apfitzge · 2023-04-18T11:55:51Z

scheduler-pool/src/lib.rs

+// this will be replaced with more proper implementation...
+// not usable at all, especially for mainnet-beta
+#[derive(Debug)]
+struct Scheduler {


Confused by Scheduler being here, rather than in the scheduler crate, even if this is a temporary

good call. well, maybe this struct should be named like PooledScheduler. as you pointed out, this is unintended name conflict... and this isn't temporary. the struct is something which will be glueing/managing scheduling logic and threads together (in the future). solana-scheduler is meant to hold pure logic not depending on mundane world of solana-ledger (and even solana-runtime!)... so, i wonder i should also name the crate to solana-unified-scheduler-logic or something else to avoid ambiguity coming from calling things as scheduler altogether... xD

If you've got the separation in a branch, could you reference that to me? or just explain a bit more what you mean in terms of separting "pure logic" from threads. I suppose I'm lacking imagination, and am not yet sold on the idea that they need/should be separated.

If we do go down the route that they should indeed be separate, then I do like the name solana-unified-scheduler-logic.

If you've got the separation in a branch, could you reference that to me?

sure!: https://github.com/ryoqun/solana/tree/unified-scheduler-for-block-verification-wip-pre-strip

that branch is full of dirty code, though. however, it's runnable against mainnet-beta. (fully-performant code)

just explain a bit more what you mean in terms of separting "pure logic" from threads.

sure, again! some more reasoning:

I think crate-level separation provides good discipline over time for designing things, assuming creating crates is cheap in terms of human time. as you know, solana-core/solana-runtime/solana-ledger is too tangled up right now (ref: https://discord.com/channels/428295358100013066/439194979856809985/1097774577817505842). so, I'm afraid to add another lump of code to there. and to the brand-new solana-unified-scheduler-pool as well.

discipline here means well-cut and well-defined interface between and explicit and discrete set of dependencies for each respective crates.

in a say, this is kind of crate-safety as in type-safety.

also, that strict discipline make it easy to bench the scheduling logic by its own. i'll put bunch of benches in solana-unified-scheduler-logic, let alone unit-testing the scheduler logics.

lastly, i think scheduler logic shold be agnostic to the actual execution details up to the communicated information across the interface, but not more. in this way, logic eval and reasoning and reviewing should be easy. that will eventually leads to performant impl. (Imo).

as for the interface, we just need bunch of crossbeam_channel::Channelss to define such an api between crates (without any extra runtime overhead if those code would be in a single crate)

callee:

https://github.com/ryoqun/solana/blob/96013125e3c6f34d093447b5fe260fa08b34435a/scheduler/src/lib.rs#L2139-L2175

caller:

https://github.com/ryoqun/solana/blob/96013125e3c6f34d093447b5fe260fa08b34435a/scheduler-pool/src/lib.rs#L738-L755

another datapoint is that solana-unified-scheduler-logic's Cargo.toml is so small like this:

https://github.com/ryoqun/solana/blob/unified-scheduler-for-block-verification-wip-pre-strip/scheduler/Cargo.toml

and the other from solana-unified-scheduler-pool:

https://github.com/ryoqun/solana/blob/unified-scheduler-for-block-verification-wip-pre-strip/scheduler-pool/Cargo.toml

if these explanation does make sense, i'll transcribe these as source comments in this pr or coming pr.

If we do go down the route that they should indeed be separate, then I do like the name solana-unified-scheduler-logic.

gocha

(this separate-or-not discussion is currently blocking all the renames and crate template creation according to #31239 (comment)).

I think after examining the dependency graph the crate separation makes sense. Appreciate the detailed comments for justification.

apfitzge · 2023-04-18T19:40:29Z

Tried to map out some of my thoughts this morning about dependencies, which I always find benefit from some visualization:

graph TD
    Bank["Arc&lt;Bank&gt;"]

    subgraph solana-runtime
        BankForks;
        subgraph cyclic-ref
        Bank;
        SchedulingContext;
        InstalledScheduler{{InstalledScheduler}};
        end
        InstalledSchedulerPool{{InstalledSchedulerPool}};
    end

    subgraph solana-ledger
        ExecuteBatch(["execute_batch()"]);
    end


    subgraph solana-scheduler-pool
        SchedulerPool;
        SchedulerPoolScheduler[Scheduler];
    end

    subgraph solana-scheduler
        WithSchedulingMode{{WithSchedulingMode}};
    end

    SchedulingContext -- refs --> Bank;
    BankForks -- owns --> Bank;
    BankForks -- owns --> InstalledSchedulerPool;
    Bank -- refs --> InstalledScheduler;

    SchedulerPool -. impls .-> InstalledSchedulerPool;
    SchedulerPoolScheduler -. impls .-> InstalledScheduler;
    InstalledScheduler -- refs --> SchedulingContext;
    SchedulingContext -. impls .-> WithSchedulingMode;

    SchedulerPoolScheduler -- refs --> SchedulerPool;
    SchedulerPool -- owns --> SchedulerPoolScheduler;
    SchedulerPoolScheduler -. calls .-> ExecuteBatch;

    solana-scheduler-pool -- deps --> solana-scheduler;
    solana-scheduler-pool -- deps --> solana-ledger;
    solana-scheduler-pool -- deps --> solana-runtime;
    solana-ledger -- deps --> solana-runtime;
    solana-runtime -- deps --> solana-scheduler;

As mentioned in my review comment, I'm trying to wrap my head around the cyclical-ness of the traits & structs.

From this diagram (I may have missed something), it seems like the main reason we need to define these new crates for implementing the scheduler(pool) is the dependency on execute_batch and TransactionStatusSender (not shown in the diagram).

I'm wondering if moving execute_batch and TransactionStatusSender types into runtime would significantly simplify this? Do we need dyn then?

edit: @ryoqun thanks for the edits marking ref/own/dep on each connection!

bw-solana · 2023-04-18T21:14:22Z

core/src/replay_stage.rs

@@ -2640,14 +2640,44 @@ impl ReplayStage {
                    .expect("Bank fork progress entry missing for completed bank");

                let replay_stats = bank_progress.replay_stats.clone();
-                let r_replay_stats = replay_stats.read().unwrap();
+                let mut replay_stats = replay_stats.write().unwrap();


Is this only mutable for the accumulate on line 2651? Maybe we can keep the write lock nested down below and then grab a read lock after

yep. thanks for nice suggestion! well, i overlooked proper locking granularity thinking...: 3a7810c

ryoqun · 2023-04-19T06:25:50Z

Tried to map out some of my thoughts this morning about dependencies, which I always find benefit from some visualization:
...

super appreciate that diagram. yeah, you depicted it correctly. I further improved it.

I'm wondering if moving execute_batch and TransactionStatusSender types into runtime would significantly simplify this? Do we need dyn then?

Unfortunately, this won't be that easy to remove dyn in that way. execute_batch is such a beast, it would pull in even spl-token crates. so, i resorted to dyn. Also, solana-scheduler-pool will need to depend on PohRecorder in the future. That's another reason for this complex arrangement... ;)

ilya-bobyr · 2023-04-19T23:44:01Z

runtime/src/installed_scheduler_pool.rs

+pub type SchedulerPoolArc = Arc<dyn InstalledSchedulerPool>;
+pub(crate) type InstalledSchedulerPoolArc = Option<SchedulerPoolArc>;


It seems a bit confusing that a SchedulerPoolArc is an Arc<InstalledSchedulerPool>, and at the same time InstalledSchedulerPoolArc is an Option<SchedulerPoolArc>.
I wonder if it could be because the Installed part in the InstalledScheduler name is somewhat "artificial".
It seems like it is used to introduce the Scheduler API and nothing else. Scheduler always implements InstalledScheduler:

impl InstalledScheduler for Scheduler

So there is no actual action that makes a certain scheduler "installed" as opposed to some other state, at the type level.

The fact that InstalledSchedulerPool API deals with SchedulerBoxes, which are internally InstalledScheduler. But it is a bit strange that this API does not operate rather on InstalledSchedulerBoxes.

pub trait InstalledSchedulerPool: Send + Sync + Debug { fn take_from_pool(&self, context: SchedulingContext) -> SchedulerBox; fn return_to_pool(&self, scheduler: SchedulerBox); } pub type SchedulerBox = Box<dyn InstalledScheduler>; pub(crate) struct InstalledSchedulerBox(Option<SchedulerBox>);

Maybe adding a new name to use for the Scheduler API could remove some of this back and forth?
For example call the API TransactionExecutor or TransactionScheduler.
And then the implementation can be called Scheduler.

Or call the API Scheduler, and give a different name to the implementation.
SchedulerImpl, or SimpleScheduler, or PooledScheduler as you suggested.

really appreciate these in-depth feedback from someone with fewest context. i think these makes code more readable/accessible in the end.

yeah, the Installed prefix was chosen artificially just for the sake of trait/impl naming. Other options were: Injected/Dyn/Like/Pluggable. However, I've settled onInstalled after long thought (not bothered to create wall of text... lol).

Also, I'm reluctant to use different nouns like Executor (this is taken by ExecutorCache) and Transaction (scheduler will work on something called Tasks; generic to both plain Transactions and/or Bundle (common word for grouped tx in mev jargon)).

anyway, as you pointed out there was still inconsistency for the usage of Installed.. how about this?: a8d66fa

I might be missing some context.
It still feels strange to have the word Installed here, even when the same object is used while both being installed into a Bank and not.
In other words, an InstalledSchedulerPoolArc, on the first glance, would then need to refer to a pool of schedulers that are actually "installed". But, IIUC, they are not. It is holding schedulers that are not installed into any bank, and are waiting to be installed.

When I was trying to build a mental model of the abstractions interacting here, also reading your comments on the Bank/Scheduler circular references affecting their lifetimes, I wondered if choosing an ownership model could simplify both references and names.

Consider this:

Rename the InstalledScheduler trait to be just Scheduler.

Remove reference to the Bank from the Scheduler.

Only Scheduler::execute requires a Bank, so just pass it in as an argument.

When Bank will be using the scheduler it currently has installed, it will just pass self as the first argument to execute().

I wonder if it would also remove the need for tracking the shutdown process, simplifying the Scheduler trait to this:

pub trait Scheduler: Send + Sync + Debug { fn id(&self) -> SchedulerId; fn pool(&self) -> InstalledSchedulerPoolArc; fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize); // suppress false clippy complaints arising from mockall-derive: // warning: the following explicit lifetimes could be elided: 'a #[allow(clippy::needless_lifetimes)] fn context<'a>(&'a self) -> Option<&'a SchedulingContext>; fn replace_context(&mut self, context: SchedulingContext); }

Bank now holds Box<dyn Scheduler> and returns it into a pool when it is frozen.

In other words, an InstalledSchedulerPoolArc, on the first glance, would then need to refer to a pool of schedulers that are actually "installed". But, IIUC, they are not. It is holding schedulers that are not installed into any bank, and are waiting to be installed.

how about Installable?

When I was trying to build a mental model of the abstractions interacting here, also reading your comments on the Bank/Scheduler circular references affecting their lifetimes, I wondered if choosing an ownership model could simplify both references and names.

really, appreciate going so deep and suggestion. I think i mislead you. I think i can improve docs with this feedback from someone with non-perfect context

Only Scheduler::execute requires a Bank, so just pass it in as an argument.
...

fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);

admittedly, this isn't documented anywhere in this pr, i thought this approach; but we can't take it.

we need to make {schedule_execution/execute}() non-blocking without executing txes. soon, it just crossbeam::Channel::send()s SanitizedTransactions. Also, Schedulers are expected to be inherited across banks in the case of block_generation scheduling_mode. so, we can't naitvely tie any SanitizedTransaction with particular bank at the time of scheduling. Also, if we were to tie, it will be forced to use sync::Weak to avoid bank leaks just because of buffered sanitized txes, which in turn incurs perf. penalty for each tx, which isn't acceptable.

all in all, SchedulingContext and complicated dance stage is born.

That said, i'm still open to better trait design. :)

again, I'll write a proper in-source doc explaining this.

[... ] I think i mislead you. I think i can improve docs with this feedback from someone with non-perfect context

Glad my attempts to understand the PR brought at least some value :))

Only Scheduler::execute requires a Bank, so just pass it in as an argument.
...

fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);

admittedly, this isn't documented anywhere in this pr, i thought this approach; but we can't take it.

we need to make {schedule_execution/execute}() non-blocking without executing txes. soon, it just crossbeam::Channel::send()s SanitizedTransactions. Also, Schedulers are expected to be inherited across banks in the case of block_generation scheduling_mode. so, we can't naitvely tie any SanitizedTransaction with particular bank at the time of scheduling. Also, if we were to tie, it will be forced to use sync::Weak to avoid bank leaks just because of buffered sanitized txes, which in turn incurs perf. penalty for each tx, which isn't acceptable.

all in all, SchedulingContext and complicated dance stage is born.

That said, i'm still open to better trait design. :)

again, I'll write a proper in-source doc explaining this.

I should have checked. So the dependency is not via a Bank directly, but via a SchedulingContext. I am still missing a complete picture, but just to be clear, you are saying that this:

fn execute(&self, context: &SchedulingContext, sanitized_tx: &SanitizedTransaction, index: usize);

Is not going to work, correct?
Is it because, the caller of the execute() does not have a SchedulingContext at hand, and you do not want to add it into the Bank?

Currently, SchedulingContext just wraps a reference to a Bank, adding a mode to it:

#[derive(Clone, Debug)] pub struct SchedulingContext { mode: SchedulingMode, // Intentionally not using Weak<Bank> for performance reasons bank: Arc<Bank>, }

Maybe if you put mode into the Scheduler, but keep the Bank as the execute() argument, it would work, ownership wise?
And if, in the future, the scheduling machinery is going to change, you can replace mode with a channel or anything else, still living inside a Scheduler, but the Bank is kept outside.
Bank and any other data is the only combined when execute() is invoked, and the reference to Bank is unnecessary for other method calls.
This way, there is no circular dependency between the Bank that (temporarily) owns the Scheduler and the Scheduler itself.

Something like this:

pub trait Scheduler: Send + Sync + Debug { fn id(&self) -> SchedulerId; fn pool(&self) -> InstalledSchedulerPoolArc; fn mode(&self) -> SchedulingMode; fn set_mode(&mut self, v: SchedulingMode); fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize); }

Depending on how mode is used in a specific implementation, mode()/set_mode() might be unnecessary in this trait. If, for example, setter and getter of the mode both know the concrete type of the Scheduler they work with.

also, the above benches are also demonstrating why std::sync::Weak can't be used due to perf. concern, justifying another related design decision of inevitable introduction of cyclic ref of Bank.

❗ arc downgrade + upgrade takes 1us? Did I do that math right? ~100_000_000 ns overhead per iter with 10_000 txs, 10 upgrade/downgrade per tx = 100_000 upgrade/downgrade per iter.

~4GHz machine => ~4000 cycles per upgrade downgrade?

That seems really high to me, maybe my understanding of how Arc works is not matched with reality.

exclamation arc downgrade + upgrade takes 1us?

well, you did that math wrong... it's actually 100ns (for the contended testcase) , thanks to the super busy cpu cache-line for the Arc<Bank>:

100_000_000 ns overhead per iter with 10_000 txs, 100 (see my out-of-tree patch comment) upgrade/downgrade per tx = 1_000_000 upgrade/downgrade per iter.

that said, 100ns is still too costly, considering this perf. penalty increases as the count count goes up (I'm targeting 100 cores in mind).

as shown by blocking::bench_{with,without}_arc_mutation and nonblocking::bench_with_01_thread_{with,without}_arc_mutation, the uncontended-case of overhead is ~14_000_000 ns, so 14ns per upgrade/downgrade`. (that's reasonable as a baseline number for a few of cas) that's 6x slowdown, which can be completely avoidable with proper impl like this pr.

@ilya-bobyr

(hi, really thanks for another dense code-review round. I'll address them in order.)

So, the InstalledScheduler trait namaing and api is okay for you eyes by now?

I'm planning to start intense coding/writing session once it's okay as laid out at : #31239 (comment)

btw, I'm clinging to AttachedScheduler now that we're settling on BankWithScheduler while InstalledSchedulerPool is remaining as-is... lol

ilya-bobyr · 2023-04-19T23:46:36Z

scheduler-pool/src/lib.rs

+impl Scheduler {
+    fn spawn(pool: Arc<SchedulerPool>, initial_context: SchedulingContext) -> Self {
+        Self {
+            id: thread_rng().gen::<SchedulerId>(),


Is there a particular reason these need to be random, rather than just sequentially numbered?
For example, using a shared AtomicU64.

ilya-bobyr · 2023-04-19T23:48:05Z

scheduler-pool/src/lib.rs

+    }
+
+    fn schedule_termination(&mut self) {
+        drop::<Option<SchedulingContext>>(self.context.take());


I wonder why an explicit type here is necessary.
Or was it added for documentation purposes?

originally i thought this was nice idea; now, not... lol: 9f82921

ilya-bobyr · 2023-04-19T23:48:51Z

scheduler-pool/src/lib.rs

+            WaitReason::ReinitializedForRecentBlockhash => {
+                // rustfmt...
+                true
+            }


minor
Why not remove the comment and allow rustfmt to turn block in to an expression?

i occasionally prevent rustfmt from formatting to see each match arms be styled in the same manner.

but, the code is gone as part of 81075ac

ilya-bobyr · 2023-04-19T23:50:51Z

scheduler-pool/src/lib.rs

+    fn scheduling_context(&self) -> Option<&SchedulingContext> {
+        self.context.as_ref()
+    }
+
+    fn replace_scheduler_context(&mut self, context: SchedulingContext) {
+        self.context = Some(context);
+        *self.result_with_timings.lock().expect("not poisoned") = None;
+    }


Naming between these two functions is a bit inconsistent.
They both operate on the same field: context.
The first function talks about "scheduling context", while the second talks about "scheduler context".

oops, this is leftover from rename... fn replace_scheduling_context is correct.

btw, Scheduler <=> Scheduling is so nuanced. i named context as SchedulingContext to indicate its lifecycle irrelevance to particular Scheduler (i.e. reinitialize carrys over the same SchedulingContext) . i'll comment this aspect as well later. thanks for pointing out!

i'll comment this aspect as well later. thanks for pointing out!

done: b39e2c9

ilya-bobyr · 2023-04-19T23:53:13Z

scheduler-pool/src/lib.rs

+    fn scheduler_id(&self) -> SchedulerId {
+        self.id
+    }
+
+    fn scheduler_pool(&self) -> SchedulerPoolArc {
+        self.pool.clone()
+    }
+
+    fn schedule_execution(&self, transaction: &SanitizedTransaction, index: usize) {


minor
Are these methods expected to be part of an API that has multiple other purposes?
If not, their names could probably be shortened a bit to id(), pool() and execute() respectively.

i named schedule_execution to indicate non-blocking/async nature like the counterpart: schedule_termination. (termination will be non-blocking as well in the future).

as for id() and pool(), i thought it's too short at first. but rethink it later. (i really like shorter name)

as for id() and pool(), i thought it's too short at first. but rethink it later. (i really like shorter name)

i changed my mind: 21b7250

ilya-bobyr · 2023-04-19T23:58:20Z

scheduler-pool/src/lib.rs

+        // so, we're NOT scheduling at all; rather, just execute tx straight off.  we doesn't need
+        // to solve inter-tx locking deps only in the case of single-thread fifo like this....
+        if result.is_ok() || !fail_fast {
+            *result = execute_batch(


As a person lacking some context, I must admit that the logic here is still not completely clear to me, even after reading the comment.
Is the idea that certain scheduler may want to execute transactions even if one of the previous transactions in the block have failed?

I think adding a comment to the

let fail_fast = match context.mode() {

line might clarify this. You clearly already made an effort there when you said

// this should be false, for (upcoming) BlockGeneration variant.

but it might still be lacking a higher level explanation as to the whole reasoning for the fail_fast existance in the first place.

under block generation mode, it's normal for some txes to fail. I'll write a comment. :)

ilya-bobyr · 2023-04-20T00:04:01Z

scheduler-pool/src/lib.rs

+        let debug = format!("{pool:#?}");
+        assert!(!debug.is_empty());


Wouldn't this always be true?
pool is an Arc<SchedulerPool>.
SchedulerPool is a struct with a default implementation for Debug.
And Arc forwards its Debug implementation to the contained type.

So debug will contain a debug output for a SchedulerPool instance. Which should never be an empty string.

yep, this is done for sanity check and coverage of Debug impl. i did similar thing Bank. maybe i need to comment both as such

Are you talking about the "Codecov Report", or by "coverage" you are saying you want to actually exercise the Debug code?
For SchedulerPool Debug is written by a macro. So there seems to be no need for checking it.

ryoqun · 2023-04-24T05:36:02Z

runtime/src/bank.rs

+        // This is needed until we activate fix_recent_blockhashes because intra-slot
+        // recent_blockhash updates necessitates synchronization for consistent tx check_age
+        // handling.
+        self.wait_for_reusable_scheduler();


ilya-bobyr · 2023-06-24T05:10:55Z

scheduler-pool/benches/scheduler.rs

+            let mut processing_states: Vec<State> = vec![State::Blocked; dependency_graph.len()];
+            let mut signature_indices: HashMap<&Signature, usize> =
+                HashMap::with_capacity(dependency_graph.len());
+            signature_indices.extend(
+                pending_transactions
+                    .iter()
+                    .enumerate()
+                    .map(|(idx, tx)| (tx.signature(), idx)),
+            );
+
+            let mut is_done = false;
+            while !is_done {
+                is_done = true;
+                for idx in 0..processing_states.len() {
+                    match processing_states[idx] {
+                        State::Blocked => {
+                            is_done = false;
+
+                            // if all the dependent txs are executed, this transaction can be
+                            // scheduled for execution.
+                            if dependency_graph[idx]
+                                .iter()
+                                .all(|idx| matches!(processing_states[*idx], State::Done))
+                            {
+                                self.inner_scheduelr.schedule_execution(Arc::new((
+                                    pending_transactions[idx].clone(),
+                                    idx,
+                                )));
+                                // this idx can be scheduled and moved to processing
+                                processing_states[idx] = State::Processing;
+                            }
+                        }
+                        State::Processing => {
+                            is_done = false;
+                        }
+                        State::Done => {}
+                    }
+                }
+
+                if is_done {
+                    break;
+                }
+
+                let mut executor_responses: Vec<_> = vec![receiver.recv().unwrap()];
+                executor_responses.extend(receiver.try_iter());
+                for r in &executor_responses {
+                    processing_states[*signature_indices.get(r).unwrap()] = State::Done;
+                }
+            }
+            Ok(())


<bike-shedding>

I wonder if instead of going through all dependency_graph.len() entries of the vec every time, and doing a search for the signature via a hash map it would make sense to just have a few mutating vectors.

Something like

#[derive(Copy, PartialEq, Eq)] enum State { Blocked, Processing, Done, } let mut state = vec![State::Blocked; dependency_graph.len()]; let mut blocked = pending_transactions .iter() .enumerate() .map(|(idx, tx)| (idx, tx.signature())) .collect<Vec<_>>(); let mut pending = Vec::with_capacity(dependency_graph.len()); while !(blocked.is_empty() && pending.is_empty()) { for blocked_i in 0..blocked.len() { if blocked_i >= blocked.len() { break; } let (tx_idx, _signature) = blocked[blocked_i]; // if all the dependent txs are executed, this transaction can be // scheduled for execution. if dependency_graph[tx_idx] .iter() .all(|dependency_idx| state[*dependency_idx] == State::Done) { self.inner_scheduelr.schedule_execution(Arc::new(( pending_transactions[tx_idx].clone(), tx_idx, ))); pending.push(blocked.swap_remove(blocked_i)); state[tx_idx] = State::Processing; } } assert!( !pending.is_empty(), "If there is no pending work but `blocked` is not empty we \ are in a deadlock" ); let done_tx_sig = receiver.recv() .expect("`pending` is non-empty"); let pending_i = pending .position(|(_idx, signature) signature == done_tx_sig) .expect("`pending` must contain the completed transaction"); let done_tx_idx = pending.swap_remove(pending_i).0; state[done_tx_idx] = State::Done; } Ok(()) }

Considering it is testing code, not sure if it really matters.

I wonder if instead of going through all dependency_graph.len() entries of the vec every time, and doing a search for the signature via a hash map it would make sense to just have a few mutating vectors.

Something like ...

yeah, it looks competitive with current algo, at least. cc: @buffalu possible algo improve for fast replay.

the .position part is linear scan but should fare well with HashMap as pending_transactions shouldn't be that large to begin with.

also, i think reviving receiver.try_iter() might worth.

... Considering it is testing code, not sure if it really matters.

np. thanks for suggestion. I leave these code as-is as you pointed out that it is bench code (the bench isn't trying to eval this code; it just want small scheduler logic).

but i just cc-ed people who care about it. :)

ilya-bobyr · 2023-06-24T05:12:46Z

scheduler-pool/src/lib.rs

+    pub fn self_arc(&self) -> Arc<Self> {
+        self.weak_self
+            .upgrade()
+            .expect("self-referencing Arc-ed pool")
+    }
+}
+
+impl InstalledSchedulerPool for SchedulerPool {
+    fn take_from_pool(&self, context: SchedulingContext) -> InstalledSchedulerBox {
+        let mut schedulers = self.schedulers.lock().expect("not poisoned");
+        // pop is intentional for filo, expecting relatively warmed-up scheduler due to having been
+        // returned recently
+        let maybe_scheduler = schedulers.pop();
+        if let Some(mut scheduler) = maybe_scheduler {
+            scheduler.replace_context(context);
+            scheduler
+        } else {
+            Box::new(PooledScheduler::<_, DefaultScheduleExecutionArg>::spawn(
+                self.self_arc(),
+                context,
+                DefaultTransactionHandler,
+            ))
+        }
+    }


Arc<Self> is object safe: https://doc.rust-lang.org/reference/items/traits.html#object-safety

This is one of the things that Rust can do that C++ can not.

ilya-bobyr · 2023-06-24T05:29:44Z

Maybe there is a way to name BankWithScheduler using something other than the actual list of the contained items? TxReadyBank, ReadyBank, SchedulableBank, just Scheduler(?), though perhaps a better name can be found. In which case, those bank.bank_cloned() calls might change into scheduler.bank_cloned(), or something of that sort.[1]

Yeah, i know i named BankWithScheduler poorly. ;) After all, I was kinda tired of all the code changes to move scheduer out of bank at that time... I'll think about naming a bit now.

hi, I'm back! sadly, i couldn't come up with definitive better name. ExtendedBank, BankContext, BankContainer, BankWrapper or WrappedBank..

These all seem overly generic. It looks to me, you are trying to create a pretty specific API.
One that allows transaction scheduling and that is it.
All these names look like a God pattern in disguise, especially the ExtendedBank.
They say nothing about the intent of the API - and that is bad.

Also, i like SchedulableBank among your suggestions. but none of these are rings my bell..

You can use it if you like it :P
I do not have any other suggestions at the moment.
Ultimately, it is your choice, of course.

@ilya-bobyr at the same time, i really dislike bank.bank_cloned() and bank.bank(). so tried to remove it entirely: 74b9439 maybe a bit controversial. what do you think?

Looking at just the diff, it seems nice that bank.bank_clone() is gone.
I would need to re-read the whole change once again to give a more holistic opinion.
It is not a big deal either way. And you like it more, I assume.
So, I call it a win :)

ilya-bobyr · 2023-06-24T05:31:22Z

runtime/src/installed_scheduler_pool.rs

+    pub fn clone_with_scheduler(&self) -> BankWithScheduler {
+        BankWithScheduler {
+            inner: self.inner.clone(),
+        }
+    }


Should this be impl Clone for BankWithScheduler?

That would shadow <Arc<Bank> as Clone>::clone(), given the existence of impl Deref for BankWithScheduler... so i can't do it. maybe i need to comment about this deliberate subtlety.

I had a suspicious that Deref abuse will bring all that bad interactions that the C++ implicit conversions are bringing.
Here is a one indication of that :P

Maybe remove the Deref? I think Rust style is to avoid implicit conversions except for a very few clearly defined cases that are almost part of the language in the first place.
Sorry if this suggestion goes completely against what you were doing.
Just making sure you consider it as a possibility :)

I had a suspicious that Deref abuse will bring all that bad interactions that the C++ implicit conversions are bringing.
Here is a one indication of that :P

yeah, i was reluctant as well for the use of Deref because of these unwanted effects...

Maybe remove the Deref? I think Rust style is to avoid implicit conversions except for a very few clearly defined cases that are almost part of the language in the first place.

... that said, i still think removing Deref isn't desired, considering yet another rather large code churns by doing so..

also, it seems there are other funny Derefs already (just run $ git grep -A5 "impl.*Deref"). so, I'm not the first.. xD

Sorry if this suggestion goes completely against what you were doing.
Just making sure you consider it as a possibility :)

thanks for raising voices. I'll go as-is with some good (hopefully) doc: 3843793

ryoqun · 2023-06-29T04:33:08Z

Also, i like SchedulableBank among your suggestions. but none of these are rings my bell..

You can use it if you like it :P I do not have any other suggestions at the moment. Ultimately, it is your choice, of course.

I've settled on BankWithScheduler, seems this name resonates well with ::clone_{with,without}_scheduler()s. ref: 3843793

Add new execution code-path for unified scheduler

3597dfb

ryoqun requested a review from apfitzge April 18, 2023 05:21

ryoqun commented Apr 18, 2023

View reviewed changes

scheduler-pool/Cargo.toml

Copy link

Member Author

ryoqun Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is must firstly be settled on...

ryoqun commented Apr 18, 2023

View reviewed changes

scheduler/Cargo.toml

Copy link

Member Author

ryoqun Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is must firstly be settled on...

apfitzge reacted with thumbs up emoji

ryoqun commented Apr 18, 2023

View reviewed changes

apfitzge reviewed Apr 18, 2023

View reviewed changes

apfitzge self-requested a review April 18, 2023 18:54

bw-solana reviewed Apr 18, 2023

View reviewed changes

ilya-bobyr reviewed Apr 20, 2023

View reviewed changes

ryoqun added 3 commits April 20, 2023 23:07

Adjust to use the installed prefix consistently

a8d66fa

Reword comment for simplicity

16a1e9c

Make replay_stats locking more granular

3a7810c

ryoqun force-pushed the new-exec-code-path branch from 0ad961f to 3a7810c Compare April 23, 2023 05:25

ryoqun added 3 commits April 23, 2023 14:38

Clean up the local cluster test a bit

8fc00c7

Remove extra blank line

3ba582b

Assert there's no active scheduler before freezing

9769288

ryoqun force-pushed the new-exec-code-path branch from 4dbbcbd to 9769288 Compare April 23, 2023 07:45

Move to ClusterConfig

eed7db0

ryoqun mentioned this pull request Apr 24, 2023

ci: treewide: deny used_underscore_binding #31319

Merged

ryoqun commented Apr 24, 2023

View reviewed changes

Rename replace_{scheduler,scheduling}_context()

26607d1

Fix ci....

5158a2d

ryoqun mentioned this pull request Jun 16, 2023

Add special feature for inter-crate safe dev-related code reuse #32169

Merged

ilya-bobyr reviewed Jun 24, 2023

View reviewed changes

ryoqun added 5 commits June 27, 2023 10:04

Simplify BankWithScheduler::new() arguments

2e614f6

Move const to its sole use-site

b16d201

Release lock more early

9f32fac

Simplify ScheduledTransactionHandler::handle args

df692d7

add comment

e2d3d7c

ryoqun mentioned this pull request Jun 28, 2023

Make prepare_simulation_batch() more usable #32304

Merged

Properly document BankWithScheduler

3843793

Add comment for weak_self

a241a8b

ryoqun mentioned this pull request Jun 29, 2023

Define ClusterConfig::new_with_equal_stakes() #32330

Merged

ryoqun added 8 commits June 29, 2023 16:30

Give up maintaining rather meaningless comments

7549b41

Fix comment

f9df200

Don't mix inlined and non-inlined

1dc0c2c

Remove actually used SEA from PhantomData

1415748

Name actual types which are causing dyn trait

f2e6442

Remove explicit spelling of droped type

9f82921

Disable cargo-audit for now

7958435

Disable downstream for now

60edf6b

ryoqun force-pushed the new-exec-code-path branch from bb37d91 to 60edf6b Compare June 29, 2023 14:30

ryoqun added 2 commits June 29, 2023 23:56

Add doc comment to SchedulingContext

b39e2c9

Remove needless variable assignment

465205e

ryoqun mentioned this pull request Jul 11, 2023

Relax dcou dep. in order-crates-for-publishing.py #32432

Merged

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 14, 2023

github-actions bot closed this Jul 24, 2023

ryoqun mentioned this pull request Oct 29, 2023

New exec code path rebased #33070

Closed

8 tasks

	description = "The solana scheduler pool"
	description = "The Solana scheduler pool"

		pub type SchedulerPoolArc = Arc<dyn InstalledSchedulerPool>;
		pub(crate) type InstalledSchedulerPoolArc = Option<SchedulerPoolArc>;

		let debug = format!("{pool:#?}");
		assert!(!debug.is_empty());

Add new execution code-path for unified scheduler #31239

Add new execution code-path for unified scheduler #31239

Conversation

ryoqun commented Apr 18, 2023 • edited Loading

Problem

Summary of Changes

todos

codecov bot commented Apr 18, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun commented Apr 18, 2023

apfitzge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

apfitzge Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge commented Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun commented Apr 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun May 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilya-bobyr Apr 19, 2023 • edited by ryoqun Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun commented Apr 18, 2023 •

edited

Loading

codecov bot commented Apr 18, 2023 •

edited

Loading

ryoqun Apr 23, 2023 •

edited

Loading

ryoqun Apr 23, 2023 •

edited

Loading

ryoqun Apr 24, 2023 •

edited

Loading

ryoqun Apr 19, 2023 •

edited

Loading

ryoqun Apr 19, 2023 •

edited

Loading

apfitzge Apr 19, 2023 •

edited

Loading

ryoqun Apr 20, 2023 •

edited

Loading

ryoqun Apr 20, 2023 •

edited

Loading

apfitzge commented Apr 18, 2023 •

edited

Loading

ryoqun Apr 26, 2023 •

edited

Loading

ryoqun May 8, 2023 •

edited

Loading

ryoqun May 9, 2023 •

edited

Loading

ryoqun Jun 9, 2023 •

edited

Loading

ilya-bobyr Apr 19, 2023 •

edited by ryoqun

Loading

ryoqun Apr 26, 2023 •

edited

Loading