Updates
- Tue Sep 10 20:14:30 PDT 2024: Reran experiment + figure after bug fix
Sequential Halving is a simple yet powerful multi-armed bandit algorithm used in SOTA RL algorithms such as Gumbel MuZero and EfficientZero V2.
And it turns out SH can be slightly improved with a simple 1-line code change!
Sequential Halving works by allocating a fixed budget of samples across a sequence of sampling rounds. In round one, every arm is sampled. After each round, the worst half of the arms are eliminated. This process continues until you have only one arm (the best arm estimate)[1].
SeqHalving is great because it performs well across many scenarios[1]. Also, it's easy to use: there aren't any hyperparameters to tune when using it.
It turns out that Sequential Halving can be improved with a simple change: Rerank all the arms between rounds without doing any elimination. The sampling round size is still halved at the end of each round, but all arms are eligible to participate in subsequent rounds.
Doing so increases the chances of finding the true best arm. I doubt this is new/novel, and it's only a slight improvement. But still, it's useful to know!