Skip to content

Brave 4.12

Compare
Choose a tag to compare
@codefromthecrypt codefromthecrypt released this 14 Dec 04:01
· 927 commits to master since this release

Brave 4.12 re-introduces support for Spring WebMVC 2.5 and reduces overhead under load

Spring WebMVC 2.5

Brave once supported Spring WebMVC 2.5, but this fell off radar as many applications updated to Spring 3 or later. Through your multiple requests, we realized Spring WebMVC 2.5 is still important.

Now, brave-instrumentation-spring-webmvc is usable on XML-driven Spring WebMVC 2.5 apps. We've also introduced zipkin-reporter-spring-beans which lets you more flexibly configure things like kafka topics. A number of small changes were made to ensure older libraries work, including maven invoker tests and a new example.

Thanks for your patience, and remember.. asking for what you want is the best start at getting it. You can find us on gitter or watch our repo to see what others are asking for.

Less overhead under load

Through coordinated effort in zipkin-reporter, Brave 4.12 performs much better under heavy load. This means by simply upgrading you will have a lot less overhead when you get a surge of requests. Thanks to @tramchamploo and @wu-sheng for keeping us honest.

Long story on overhead under heavy load

In the past, particularly in the zipkin-reporter project, users like @tramchamploo raised concerns about locking and the amount of spans one can send to zipkin. @adriancole dismissed some of these concerns, due the unlikelihood of being able to query large orders of spans and relatively good JMH scores on the related components.

This was unfortunate, because the data collection concern exists regardless of intent when a system is under load, and JMH benchmarks don't reflect how systems behave under load. This led to a problem going dormant until the next person @wu-sheng noticed overhead, and our focus changed.

Lately, users have asked for a "firehose mode" where 100% of data is collected and reported to something non-zipkin, like a stats aggregator. This would be independent of the sampling mechanism. To test impact of this, we had to understand if such was affordable. We benchmarked our example apps, using wrk and found something quite odd: an order of magnitude latency spike when tracing 100% under load.

Here's an example app tested with tracing disabled via forced "not sampled" decision

$ wrk -t4 -c64 -d30s http://localhost:8081 -H'X-B3-Sampled: 0'
Running 30s test @ http://localhost:8081
  4 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.93ms   32.24ms 541.71ms   94.02%
    Req/Sec     1.38k   482.57     2.42k    82.08%
  80567 requests in 30.08s, 11.15MB read
  Non-2xx or 3xx responses: 596
Requests/sec:   2678.03
Transfer/sec:    379.67KB

Here's the same app with 100% sampled. Notice an order of magnitude different latency

$ wrk -t4 -c64 -d30s http://localhost:8081
Running 30s test @ http://localhost:8081
  4 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   135.80ms  143.86ms 945.27ms   84.16%
    Req/Sec   207.60    150.83   630.00     69.07%
  19370 requests in 30.07s, 2.63MB read
Requests/sec:    644.11
Transfer/sec:     89.42KB

We expect overhead to increase when sending to zipkin, as we are collecting data like IP addresses and that has a cost. What we expected was a percentage, not an order of magnitude increase. Discovering this latency spike was well timed, as thanks to Sheng, we were already thinking about encoding.

The order of magnitude bump wasn't due to encoding on the client threads, rather it was was caused by a separate thread which bundles encoded spans into messages. The problem was this bundling happened under a lock shared with client threads. While "performance people" can often spot this immediately, it can also be detected with contended integrated benchmarks. Even benchmarks against silly hello world example apps. The fix wasn't hard, but story still worth telling.

The morale of the story is don't make the same mistake @adriancole did. If you get a complaint about performance in a performance-sensitive library, look at multiple angles before dismissing. Most importantly, use realistic benchmarks before prioritizing. If you don't have time to do that, ask the requestor to, or file a bug so that someone else (even if later you) can.

If you like this sort of work, please join us! For example, we are looking for a disruptor based alternative to further reduce overhead. Even if you aren't writing zipkin code, comments help as we all have a lot to learn.