Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
  • Loading branch information
Naarcha-AWS and natebower authored Jan 9, 2025
1 parent 578bf32 commit da62200
Showing 1 changed file with 17 additions and 18 deletions.
35 changes: 17 additions & 18 deletions _benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ grand_parent: User guide

# Randomizing queries

By default, OpenSearch Benchmark runs identical queries for multiple benchmark iterations. However, running the same queries repeatedly isnt ideal for every test. For example, simulating real-world caching with many iterations of the same query isn’t accurate, as it results in one cache miss, followed by many hits. OpenSearch Benchmark lets you randomize queries in a configurable way.
By default, OpenSearch Benchmark runs identical queries for multiple benchmark iterations. However, running the same queries repeatedly isn't ideal in every test scenario. For example, simulating real-world caching with many iterations of the same query results in one cache miss followed by many hits. OpenSearch Benchmark lets you randomize queries in a configurable way.

For example, changing `"gte"` and `"lt"` in the following `nyc_taxis` operation creates distinct queries, resulting in unique cache entries:

Expand All @@ -30,24 +30,24 @@ For example, changing `"gte"` and `"lt"` in the following `nyc_taxis` operation
```


You cant completely randomize the values, as then the cache would not get any hits. To get cache hits, the cache must sometimes see the same values. To account for the same values while randomizing, OpenSearch Benchmark generates a number `N` of value pairs for each randomized operation at the beginning of the benchmark. OpenSearch Benchmark stores these values in a saved list where each pair gets an index from `1` to `N`.
You can't completely randomize the values because the cache would not get any hits. To get cache hits, the cache must sometimes see the same values. To account for the same values while randomizing, OpenSearch Benchmark generates a number `N` of value pairs for each randomized operation at the beginning of the benchmark. OpenSearch Benchmark stores these values in a saved list where each pair gets an index from `1` to `N`.

Every time OpenSearch sends a query, OpenSearch Benchmark decides whether to use a pair of values from this saved list in the query. It does this a configurable fraction of the time, called repeat frequency (`rf`). If OpenSearch pair has encountered the value pair before, this might cause for a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. This ratio could cause a hit, depending on the benchmarks duration and cache size.
Every time OpenSearch sends a query, OpenSearch Benchmark decides whether to use a pair of values from this saved list in the query. It does this a configurable fraction of the time, called _repeat frequency_ (`rf`). If OpenSearch has encountered the value pair before, this might cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. This ratio could cause a hit, depending on the benchmark's duration and cache size.

Saved value pairs are based on the `Zipf` distribution, which empirically matches usage traces for many real caches. For example, if the benchmark draws pair `i` with a probability proportional to `1 / i^alpha`, `alpha` is another parameter controlling how spread out the distribution is. Pairs with low indexes appear with a much higher frequency than those with high indexes.
Saved value pairs are based on the `Zipf` distribution, which empirically matches usage traces for many real caches. For example, if the benchmark draws pair `i` with a probability proportional to `1 / i^alpha`, `alpha` is another parameter controlling the spread of the distribution. Pairs with low indexes appear with a much higher frequency than those with high indexes.

Otherwise, the other `1-rf` fraction of the time, we generate a totally new random pair of values. Because OpenSearch Benchmark has not seen these value pairs before, the pairs should miss the cache.
Otherwise, the other `1-rf` fraction of the time, a new random pair of values is generated. Because OpenSearch Benchmark has not encountered these value pairs before, the pairs should miss the cache.


## Usage

To use this feature on a workload you must make some changes to your workload's `workload.py` and supply some CLI flags when running OSB.
To use this feature in a workload, you must make some changes to `workload.py` and supply some CLI flags when running OpenSearch Benchmark.

### Modifying `workload.py`

Specify how to generate the saved value pairs for each operation by registering a standard value source for that operation. This Python function accepts no arguments and returns a dictionary. The keys mirror those in the input query, but are randomized. Finally, change the `register()` method to register this function with the operation name and field name, which is randomized.
Specify how to generate the saved value pairs for each operation by registering a "standard value source" for that operation. This Python function accepts no arguments and returns a dictionary. The keys mirror those in the input query but are randomized. Finally, change the `register()` method so that it registers this function with the operation name and field name, which are randomized.

For example, to randomize the `"total_amount"` field in the `"range"` operation from earlier, a standard value source might look like the following function:
For example, a standard value source used to randomize the `"total_amount"` field in the preceding `"range"` operation might appear similar to the following function:

```py
def random_money_values(max_value):
Expand All @@ -62,18 +62,18 @@ def range_query_standard_value_source():
return random_money_values(120.00)
```

Similarly, you can randomize the registration behavior with the following function:
Similarly, you can randomize the registration behavior using the following function:

```py
def register(registry):
registry.register_standard_value_source("range", "total_amount", range_query_standard_value_source)
```

There may already be code in this function. Leave it there if so. If `workload.py` does not exist or lacks a `register(registry)` function, you can create them.
This function may already contain code. Retain it if so. If `workload.py` does not exist or lacks a `register(registry)` function, you can create them.

#### Randomizing non-range queries

By default, OSB assumes the query to be randomized is a `"range"` query with values `"gte"`/`"gt"`, `"lte"`/`"lt"`, and optionally `"format"`. If this isnt the case, you can configure it to use a different query type name and different values.
By default, OpenSearch Benchmark assumes that the query to be randomized is a `"range"` query with values `"gte"`/`"gt"`, `"lte"`/`"lt"`, and, optionally, `"format"`. If this isn't the case, you can configure it to use a different query type name and different values.

For example, to randomize the following workload operation:

Expand Down Expand Up @@ -106,15 +106,14 @@ The first argument, `"bbox"`, is the operation's name.

The second argument, `"geo_bounding_box"`, is the query type name.

The third argument is a list of lists: `[[“top_left”], [“bottom_right”]]`. The outer lists entries specify parameters for randomization, because there might be different versions of the same name that represent roughly the same parameters, for example, `"gte"` or `"gt"`. Here, theres only one option for each parameter name. At least one version of each parameter's name must be present in the original query for it to be randomized.
The third argument is a list of lists: `[[“top_left”], [“bottom_right”]]`. The outer list's entries specify parameters for randomization because there might be different versions of the same name that represent roughly the same parameters, for example, `"gte"` or `"gt"`. Here, there's only one option for each parameter name. At least one version of each parameter's name must be present in the original query in order for it to be randomized.

The last argument is a list of optional parameters. If an optional parameter is present in the random standard value source, OpenSearch Benchmark puts the parameter into the randomized version of the query. If its not in the source, its ignored. There are no optional parameters in the following example, but the typical use case would be `"format"` in a range query.
The last argument is a list of optional parameters. If an optional parameter is present in the random standard value source, OpenSearch Benchmark inserts the parameter into the randomized version of the query. If it's not in the source, it's ignored. There are no optional parameters in the following example, but the typical use case would be `"format"` in a range query.

If there’s no registration, it uses the default; equivalent to registering `registry.register_query_randomization_info(<operation_name>, “range”, [[“gte”, “gt”], [“lte”, “lt”]], [“format”])`.

The `dict` returned by the standard value source should match the parameter names you are randomizing. For example, the standard value source for the earlier example is:

The `dict` returned by the standard value source should match the parameter names you are randomizing. For example, the standard value source for the earlier example is:
The `dict` returned by the standard value source should match the parameter names you are randomizing. For example, the following is the standard value source for the preceding example:

```py
def bounding_box_source():
Expand All @@ -136,10 +135,10 @@ def bounding_box_source():

Use the following CLI flags to customize randomization:

- `--randomization-enabled` turns randomization on and off. If randomization is not enabled, none of the randomization flags will have an effect.
- `--randomization-enabled` turns randomization on and off. If randomization is not enabled, none of the randomization flags will be applied.

- `--randomization-repeat-frequency` or `-rf` sets the fraction of pairs drawn from the saved value pairs generated at the start of the benchmark. The value should be between `0.0` and `1.0`. Default is `0.3`.
- `--randomization-repeat-frequency` or `-rf` sets the fraction of pairs drawn from the saved value pairs generated at the start of the benchmark. The value should be between `0.0` and `1.0`. Default is `0.3`.

- `--randomization-n` sets the number `N` of value pairs generated for each operation. Default is `5000`.

- `--randomization-alpha` sets the parameter `alpha` controlling the shape of the `Zipf` distribution. The value should be `>=0`. Lower values spread out the distribution more. Default is `1.0`.
- `--randomization-alpha` sets the `alpha` parameter, which controls the spread of the `Zipf` distribution. The value should be `>=0`. Lower values increase the spread of the distribution. Default is `1.0`.

0 comments on commit da62200

Please sign in to comment.