Skip to content

Coordinated omission

Pierre Laporte edited this page Jul 13, 2018 · 1 revision

Coordinated Omission (CO) if a frequently discussed topic in performance tests. This page defines CO as well as give pointers to avoid it.

Definitions

Before CO is defined, some terms related to latency must be described.

  • The "response time" denotes the duration between an action trigger and its response. Typically, the duration between a hyperlink click and the display of the new page. For Gatling, that trigger is the expected send time of a request. It is usually referred to as the latency from the user point of view.

  • The "queue time" denotes the duration between a trigger and the actual send time of the request. This duration is completely defined by technical concepts. In Gatling, this means the duration between the expected request send time and the time the first byte is sent on the network.

  • The "service time" denotes the server-centric duration of processing the request. It means the elapsed time between the moment a complete request is received and the response is sent back to the client.

A rough approximation of things is to represent the response time as follows: Response time = Queue time + Service time. It could be detailed further, especially with network times But this is not related to Coordinated Omission.

<------------------------- Response time ------------------------->
<--- Queue time ---><--------------- Service time ---------------->

Detect coordinated omission

CO is when service time is measured and reported as the response time. It happens mostly in load injectors that do not account for queue time.

The easiest way to detect CO is to ask this question: When should request #12345 be sent? If you cannot answer this question precisely and consistently, then your results are most likely suffering from CO.

In the case of Gatling, this question is answered as follows. Assuming a workload that runs exactly 10000 users per second during one minute. Gatling will fire 10 new users every millisecond. Precisely. Therefore, request #12345 will be send at t=1.234s

The same reasoning can be done with ramp-up injection profiles. The maths are a little more elaborate, but the core information is still there. The expected send time of every request is known in advance.

If you run a load injector like cassandra-stress without enabling the CO fix, you cannot predict request send time. It means that what you are measuring is really the service time. Not the response time.

In other words, you need a predictable throughput in order not to run into CO.

"Coordinates" to what?

The term "coordinated omission" indeed implies that the client behavior is linked to something. But it does not say what. It is actually on purpose.

Let’s take back the example of a load testing framework that does not measure queue time. Should a GC pause in the injector occur during that queue time, requests will be delayed. They will not arrive at the correct time on the server. Consequently, this delay will first reduce the throughput and the increase it back. If queue time is ignored by the injector, that effect will not be reported.

CO can also happen when the client code coordinates with server events. For instance, consider a load testing framework that issues synchronous requests. Whenever a GC pause happens on the server side, the client is blocked. It completely stops sending new queries since all its threads are waiting for responses. This has the nice effect of letting the server absorb the load before resuming normal operations. But it means that the injected throughput was not what was configured. Taking decisions based on such results is bad.

Is it a big deal?

In short, yes.

Frequently, a user migrating from a CO suffering load injector to Gatling will experience a lot of request errors. A naive approach is to configure Gatling with the same throughput that was advertised by the CO-impacted injector. Gatling continues to send queries during server side pauses. This results in a large queue of requests forming at the server side. The server may never recover.

What if I want to measure service time?

Set -Dgatling.dse.plugin.measure_service_time.

Going further

This article gives much more information about response time and service time.