Hubspot cell balancer & normalizer #126

szabowexler · 2024-12-03T20:21:47Z

Overview

HubSpot is experimenting internally with a cellular architecture for some of our most critical hbase tables. At a high level, this involves using the first two bytes of a row key to partition the set of rows into n (fixed to 360 in this early draft) distinct partitions. Invariants that establish that regions are entirely contained by a single cell (so no merging across cell lines) combined with an update to the balancer that makes region to server assignment cell aware allow us to trade off performance (maximized by a uniform distribution of data to regions to servers) against isolation (maximized by placing all cell data on a minimal set of servers with no overlap with other cells), with a minimal (or even zero) impact on the size of the cluster required, or the operational burden of running it. There are two major parts to this PR, and I'll address them in separate sections. At a high level, here are the outcomes this PR enables:

All regions contain the data for exactly one cell, regardless of how long the cluster runs, and how many merge or split operations occur
Subject to configuration, the number of regions per server will generally be uniform
Subject to configuration, the number of cells per server will generally be uniform
Subject to configuration, the number of cells per server will be capped offering the ability to control the performance to isolation tradeoff dynamically

In testing, the cluster we use has 550-600 region servers, 31,000-34,000 regions, and is about 130TB compressed.

Normalizer

We assume that the initial set of regions in the cluster are cell-aligned (that is, all of the data within every region comes from a single cell, but a single cell may (and probably will) require more than one region to represent it). Then, we update the normalizer such that routine cluster operations (merging regions that are too small, or splitting regions that are too large) cannot break it. Because every region starts contained within a cell, we do not need to adjust the split logic. We do need to adjust merging. In particular: when computing runs of small regions that may be merged, cell aware tables should stop when a run crosses a boundary and subsequent regions no longer belong to the same cell. These changes are contained within the SimpleRegionNormalizer, and are relatively simple.

Balancer

The cell-aware balancer represents the bulk of the complexity for this change. A quick review of hbase's balancer. hbase models the balance of a cluster (or table on a cluster) as an objective function, which is a scaled sum of constituent parts. A static threshold is set that determines when the table is unbalanced enough to require rebalancing. When a rebalance is required, a stochastic greedy algorithm is used which approximates a method of gradient descent - generator functions are randomly selected to propose iterative cluster state changes, which are accepted only if they improve the global balance state for the cluster (differentiating the approach from e.g. simulated annealing which might permit a temporary worsening, to avoid getting caught in a local optimum).

So, this PR introduces a new cost function (the HubSpotCellCostFunction) and a new generator (HubSpotCellBasedCandidateGenerator) to propose iterative mutative steps for the cluster.

Cell cost function

Our cost function captures two guiding principles for a cellular table, with s servers, r regions, and c cells:

The number of regions per server should be uniform (i.e. one of ⌊r/s⌋ or ⌊r/s⌋ + 1)
The number of cells per server should be maximized, but no greater than a configurable bound

Our cost function has two parts, one for each of these principles. To capture (1), it includes the raw count of the number of servers with a count of regions outside of the two permissible values listed above. To capture (2), we consider the concept of a server-cell (i.e. the unique instance of a given cell on a given server), and consider a desired capacity defined by the max cells per server scaled by s. In our cluster (with 585 servers), if the bound on cells per server were 36 (10%), then the capacity of server cells would be 21,060. The imbalance of the cluster (from a cell to server perspective) is represented as the number of server-cells overloaded on the cluster, scaled by that capacity. A server-cell is overloaded if the server (excluding that cell) already has the configurable bound per server of cells. In a concrete example: if a server has 50 regions (each of which is for a different cell), 50 distinct cells, and the bound is 36, then that server will have 14 excess server-cells. The sum of all excess server-cells over all servers, scaled by the overall capacity of server-cells, provides a normalized (ranging from 0-1) metric describing imbalance.

This construction relies on the step generator acting to maximize the count of distinct cells per server, and the stochastic harness to discard proposed solutions that might go over the configured limit.

Cell step generator

The step generator is by far the most complex component of this change. It is highly stochastic: at every step when "pick" is used, we utilize a reservoir random online sampling approach to do an efficient single-pass random selection (subject to some filtering/optimizing criterion). With that in mind, the step generator operates as follows.

Consider a cellular table, with s servers, r regions, and c cells:

If there are any "underloaded" servers (count of region is less than ⌊r/s⌋):
i.If there are any servers with excess supply (count of region is more than ⌊r/s⌋), pick a server with the most regions, and move a region from it to the underloaded server
If there are any "overloaded" servers (count of region is greater than ⌊r/s⌋ + 1)
i. If there are any servers with excess capacity (count of region is exactly ⌊r/s⌋), pick one such server and move a region from the overloaded server
If all servers have exactly the target cell count, return NOOP
If any servers have more than the target cell count:
i. pick a server with the most cells
ii. pick the least frequent cell (by representing region) on that server
iii. pick another server and cell such that the other server's cell is present on the picked server from (i)
iv. swap our selected server/cell pairs
If any servers have less than the target cell count:
i. pick a server with the fewest cells
ii. pick the most frequent cell (by representing region) on that server
iii. pick another server and cell such that the other cell is not present on our selected server
iv. swap our selected server/cell pairs

This is not recommended for target cell per server counts approaching 1 (anything <4 is likely to not converge), but should work very well for bounds that represent 30-70% of a given server's region capacity.

Future work

The cell candidate generator is currently combining two distinct priorities: (1) that all servers have the same number of regions, and (2) that those regions obey certain properties. This overlaps other balancer functions, and is not fully aligned with the intent of the design of the balancer. At the same time, it dramatically simplifies the logic of cell distribution to be able to assume that when the candidate generator begins shuffling cells it knows exactly how many regions exist per server, and the only way this can be achieved is to force that balance to cope with other action generators mutating the cluster state between rounds of step generation by the cell balancer
Cell based balancing is really just balancing subject to specific constraints/groupings on the first n bytes of rowkeys. It should be possible to generalize this concept to effectively be a "prefix-based" balance, where enabling this mode, setting the number of bytes, and defining the grouping mode and constraints are passed via TableDescriptor. We are already starting to think about the concept of balancer conditionals (see initial work here), and a future implementation of this, well generalized, might build on that concept. In keeping with this concept, the normalizer currently offers restrictions on how regions can be split (with e.g. the DelimitedKeyPrefixRegionSplitRestriction ), but there is no corresponding restriction on merges. A simple prefix-based merge restriction could also be implemented to simplify how we maintain groupings and isolation.
When a given server has reached saturation (either at ⌊r/s⌋ cells, or the configured ceiling whichever is lower), if it has excess capacity for regions the current logic randomizes the distribution of cells for those regions. In a fictitious case where a server has 50 regions and a cap of 36 cells, it's possible that we have one region for each of 35 cells, and 15 regions for the remaining cell. This may be highly undesirable from a performance standpoint (and we would be better served by a distribution with 2 regions for each of 14 cells, and 1 region for the remaining 22). This would require that the logic for redistributing cell-regions takes into account that we want to minimize the distinct regions per cell per server, which is not present today.

…xactly one cell

Prioritize balance by region and THEN evening out cell isolation

… if it goes negative

Remove all the debugging changes, generally make ready for real review & merge

ndimiduk

Sorry mate, this is dense, I've done what I can this afternoon.

One macro-comment is that we probably don't want this merged into this repo, but to ship it as an extension like we do a coprocessor or custom filter (if we have custom filters). Instead of patching the balancer or normalizer, you'd swap in your own implementation via configuration injection, i.e.,

hbase/hbase-common/src/main/resources/hbase-default.xml

Line 1637 in 9cf8134

<name>hbase.master.normalizer.class</name>

.

ndimiduk · 2024-12-06T09:48:21Z