Skip to content
James McMullan edited this page Sep 18, 2024 · 4 revisions

Performance of Atomics

Atomics, being lock-free, are generally very fast. Incrementing an Atomic counter has nearly the same performance as incrementing a typical integer when accessed from a single core. However, the performance becomes more complex when multiple CPU cores are involved, especially if they are accessing the same Atomic or if the Atomic resides in a shared cache line.

When multiple CPU cores modify the same Atomic, each core must first gain access to the cache line storing the Atomic. This often results in L1 or L2 cache misses, with each core waiting for the cache line to be transferred from one of its sibling cores. A few cache misses here and there may not significantly affect performance, but when many threads are involved and all need the same cache line, a scenario can arise where CPU cores become stalled, repeatedly passing the same cache line back and forth.

This exact scenario occurred with the FileUtility in HPCC4j. We were incrementing an AtomicLong after reading each record. While this worked fine with only a few threads, we encountered a strange issue when debugging a recent problem. Initially, read performance was normal, but after increasing the number of reading threads past a certain threshold, per-thread read performance plummeted.

Using insights from the OpenTelemetry Tracing we've recently incorporated into HPCC4j, we observed that read performance decreased as more threads were started. The bottleneck appeared after the data had been streamed from the remote servers, which suggested that the problem was somewhere in the record processing code. Given that each record processing thread operates independently and doesn't share resources, a bottleneck that worsens with more threads was unexpected.

Upon further investigation, we found the issue: a shared AtomicLong used to track record counts. By modifying the design to keep track of record counts per thread and only updating the shared AtomicLong after each data partition, we achieved a 6x speed improvement in this scenario.

For more details, see the PR here: https://github.com/hpcc-systems/hpcc4j/pull/756