-
Notifications
You must be signed in to change notification settings - Fork 24
Performance Notes
Atomics being lockless are generally very fast; incrementing and Atomic counter is about the same performance as a typical integer if being accessed from a single core, but the performance question becomes more complicated when more than one CPU core is utilizing the same Atomic or an Atomic in the same cache line.
In the case of multiple CPU cores modifying the same Atomic, each CPU core must first gain access to the cache line that is storing the Atomic, which often results in an L1 or L2 cache miss with that CPU core waiting for the cache line to moved to its local cache from one of its sibling CPU cores. A few L1 or L2 cache misses here and there won't affect performance much, but when you have a large number of threads all needing the same CPU cache line you can run into a scenario where your CPU cores become stalled while bouncing the same cache line back and forth.
This was the exact scenario that we recently ran into with the FileUtility in HPCC4j. We were incrementing an AtomicLong after reading each record; which while there were only a few threads didn't pose a problem. However, when using the FileUtility to debug a recent issue we noticed a weird phenomenon where read performance looked normal until we increased the number of reading threads passed a threshold, after which the read performance per thread fell off a cliff.
Based on insights from the OpenTelemetry Tracing we have been recently incorporating into HPCC4j we were able to see that read performance decreased as more reading threads were started up and that the bottleneck was occurring after the data was streamed from the remote servers, which indicated that the issue must be somewhere in the record processing code. Assuming that there are adequate CPU cores available a bottleneck in the record processing code that gets worse with more threads is unexpected as each record processing thread works independently and doesn't share resources. After looking through the record processing code in the HPCC4j FileUtility we found the offending shared AtomicLong used to track record counts. Keeping record counts per thread and updating the shared AtomicLong after each data partition resulted in a 6x speed increase in this scenario.
See the PR for the issue for more information: https://github.com/hpcc-systems/hpcc4j/pull/756