UCS: Introduce lightweight rwlock #10355

Artemy-Mellanox · 2024-12-05T08:09:09Z

[ RUN      ] test_rwlock.perf <> <>
[     INFO ] 8.22035 ms builtin 1 threads 1 writers per 256 
[     INFO ] 9.91718 ms builtin 1 threads 25 writers per 256 
[     INFO ] 15.6226 ms builtin 1 threads 128 writers per 256 
[     INFO ] 8.53102 ms builtin 1 threads 250 writers per 256 
[     INFO ] 14.8544 ms builtin 2 threads 1 writers per 256 
[     INFO ] 20.3778 ms builtin 2 threads 25 writers per 256 
[     INFO ] 25.2576 ms builtin 2 threads 128 writers per 256 
[     INFO ] 18.0413 ms builtin 2 threads 250 writers per 256 
[     INFO ] 15.9894 ms builtin 4 threads 1 writers per 256 
[     INFO ] 25.9471 ms builtin 4 threads 25 writers per 256 
[     INFO ] 30.6886 ms builtin 4 threads 128 writers per 256 
[     INFO ] 23.7099 ms builtin 4 threads 250 writers per 256 
[     INFO ] 109.585 ms builtin 64 threads 1 writers per 256 
[     INFO ] 498.465 ms builtin 64 threads 25 writers per 256 
[     INFO ] 981.667 ms builtin 64 threads 128 writers per 256 
[     INFO ] 891.406 ms builtin 64 threads 250 writers per 256 
[       OK ] test_rwlock.perf (2699 ms)
[ RUN      ] test_rwlock.pthread <> <>
[     INFO ] 15.7748 ms pthread 1 threads 1 writers per 256 
[     INFO ] 17.1273 ms pthread 1 threads 25 writers per 256 
[     INFO ] 26.9475 ms pthread 1 threads 128 writers per 256 
[     INFO ] 15.2211 ms pthread 1 threads 250 writers per 256 
[     INFO ] 29.4559 ms pthread 2 threads 1 writers per 256 
[     INFO ] 215.037 ms pthread 2 threads 25 writers per 256 
[     INFO ] 104.325 ms pthread 2 threads 128 writers per 256 
[     INFO ] 41.5187 ms pthread 2 threads 250 writers per 256 
[     INFO ] 35.4409 ms pthread 4 threads 1 writers per 256 
[     INFO ] 196.073 ms pthread 4 threads 25 writers per 256 
[     INFO ] 122.957 ms pthread 4 threads 128 writers per 256 
[     INFO ] 62.3138 ms pthread 4 threads 250 writers per 256 
[     INFO ] 73.8066 ms pthread 64 threads 1 writers per 256 
[     INFO ] 198.412 ms pthread 64 threads 25 writers per 256 
[     INFO ] 165.476 ms pthread 64 threads 128 writers per 256 
[     INFO ] 118.79 ms pthread 64 threads 250 writers per 256 
[       OK ] test_rwlock.pthread (1439 ms)

yosefe · 2024-12-05T23:47:46Z

src/ucs/arch/cpu.h

+#ifndef UCS_HAS_CPU_RELAX
+static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
+{
+    sched_yield();


sched_yield is very different than __mm_pause
maybe better use some kind of builtin? or just do nothing?

IMO better define cpu_relax for each arch than introduce macro UCS_HAS_CPU_RELAX defined in x86 h file

yosefe · 2024-12-05T23:49:40Z

src/ucs/type/rwlock.h

+#define UCS_RWLOCK_WAIT  0x1 /* Writer is waiting */
+#define UCS_RWLOCK_WRITE 0x2 /* Writer has the lock */
+#define UCS_RWLOCK_MASK  (UCS_RWLOCK_WAIT | UCS_RWLOCK_WRITE)
+#define UCS_RWLOCK_READ  0x4 /* Reader increment */


can we define it as enum and use explicit bit consts such as UCS_BIT or "1<<0", "1<<1" etc

yosefe · 2024-12-05T23:50:12Z

src/ucs/type/rwlock.h

+ * Read-write lock.
+ */
+typedef struct {
+    volatile int l;


"l" is stub name, maybe "state", "flags" etc

yosefe · 2024-12-05T23:50:25Z

src/ucs/type/rwlock.h

+{
+    int x;
+
+    while (1) {


yosefe · 2024-12-05T23:51:05Z

src/ucs/type/rwlock.h

+            ucs_cpu_relax();
+        }
+
+        x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);


maybe we can use the atomic operations defined in atomic.h?

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

yosefe · 2024-12-05T23:55:38Z

src/ucs/type/rwlock.h

+    if ((x < UCS_RWLOCK_WRITE) &&
+        (__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
+                                     __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
+        return 0;


return 1 on success and 0 on failure, same as ucs spinlock

yosefe · 2024-12-05T23:57:05Z

test/gtest/ucs/test_type.cc

+                        w();
+                    } else {
+                        r();


w and r are stub names

yosefe · 2024-12-06T00:19:35Z

test/gtest/ucs/test_type.cc

+        int m = std::thread::hardware_concurrency();
+        std::vector<int> threads = {1, 2, 4, m};
+        std::vector<int> writers_per_256 = {1, 25, 128, 250};
+


i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock

can you pls post example output in the PR description?

you mean write percent 0?

didn't i do that?

yosefe · 2024-12-06T00:20:22Z

test/gtest/ucs/test_type.cc

+    sleep();
+    EXPECT_FALSE(write_taken); /* first read lock still holding lock */
+
+    int read_taken = 0;


yosefe · 2024-12-06T00:20:27Z

test/gtest/ucs/test_type.cc

+
+    ucs_rwlock_read_lock(&lock); /* second read lock should pass */
+
+    int write_taken = 0;


iyastreb · 2024-12-06T07:05:55Z

test/gtest/ucs/test_type.cc

+        std::this_thread::sleep_for(std::chrono::milliseconds(1));
+    }
+
+    void measure_one(int num, int writers, const std::function<void()> &r,


maybe introduce a typedef with using:
using run_func_t = std::function<void()>;

I also propose more explicit names:
r -> reader
w -> writer
num -> thread_count

iyastreb · 2024-12-06T07:08:28Z

test/gtest/ucs/test_type.cc

+    void measure_one(int num, int writers, const std::function<void()> &r,
+                     const std::function<void()> &w, const std::string &name)
+    {
+        std::vector<std::thread> tt;


minor: maybe name it threads

iyastreb · 2024-12-06T07:16:17Z

test/gtest/ucs/test_type.cc

+    {
+        int m = std::thread::hardware_concurrency();
+        std::vector<int> threads = {1, 2, 4, m};
+        std::vector<int> writers_per_256 = {1, 25, 128, 250};


maybe use percent to set the number of writers? It will be easier to understand

std::vector<int> writers_percent = {1, 25, 50, 75, 90};

iyastreb · 2024-12-06T07:19:24Z

src/ucs/type/rwlock.h

+            ucs_cpu_relax();
+        }
+
+        x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);


maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

iyastreb · 2024-12-06T07:27:01Z

test/gtest/ucs/test_type.cc

+
+UCS_TEST_F(test_rwlock, perf) {
+    ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;
+    measure(


This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:

// Invariant: counter2 is 2 times bigger than counter1 int counter1 = 1; int counter2 = 2; measure( [&]() { ucs_rwlock_read_lock(&lock); UCS_ASSERT_EQ(counter1 * 2, counter2); ucs_rwlock_read_unlock(&lock); }, [&]() { ucs_rwlock_write_lock(&lock); counter1 += counter1; counter2 += counter2; if (counter2 > 100000) { counter1 = 1; counter2 = 2; } ucs_rwlock_write_unlock(&lock); },

I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.

Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.

Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc

iyastreb · 2024-12-06T07:28:16Z

test/gtest/ucs/test_type.cc

+UCS_TEST_F(test_rwlock, pthread) {
+    pthread_rwlock_t plock;
+    pthread_rwlock_init(&plock, NULL);
+    measure(


same here, update/verify some state

iyastreb · 2024-12-06T07:37:01Z

src/ucs/type/rwlock.h

+
+static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)
+{
+    __atomic_fetch_sub(&lock->l, UCS_RWLOCK_READ, __ATOMIC_RELAXED);


I'm not sure that relaxed memory ordering is enough here (btw litmus test should fail in this case)
I think we need __ATOMIC_RELEASE to ensures memory operations before releasing the lock are completed.

btw litmus test doesn't detect it

great, so it can be RELAXED then
Maybe we can also add a test case with nested mutex, I still believe there might be an issue here, when you have 2 unlocks one after another, and the ordering must be preserved

But I'm glad that you see the usefulness of litmus test - you can judge if concurrent algorithm is correct based on it

iyastreb · 2024-12-06T07:40:47Z

src/ucs/type/rwlock.h

+    int x;
+
+    while (1) {
+        while (lock->l & UCS_RWLOCK_MASK) {


I'm not sure why we read atomic without mem order guarantees, normally it should be something

__atomic_load_n(&lock->l, __ATOMIC_RELAXED)

Same in other read cases

why? AFAIU relaxed mem order means - no mem order guarantees

Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_loads

iyastreb · 2024-12-06T08:56:30Z

src/ucs/type/rwlock.h

+ * Read-write lock.
+ */
+typedef struct {
+    volatile int l;


If we do all the loads with __atomic_load_n then volatile shouldn't be needed

tvegas1 · 2024-12-09T10:07:58Z

src/ucs/type/rwlock.h

+
+static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
+{
+    __atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);


i think it is usually __ATOMIC_RELEASE to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?

tvegas1 · 2024-12-09T10:10:16Z

test/gtest/ucs/test_type.cc

+    }
+};
+
+UCS_TEST_F(test_rwlock, lock) {


Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?

tvegas1 · 2024-12-09T10:12:55Z

src/ucs/type/rwlock.h

+
+
+/**
+ * Read-write lock.


tvegas1 · 2024-12-09T10:16:59Z

src/ucs/type/rwlock.h

+#include <errno.h>
+
+/**
+ * The ucs_rwlock_t type.


suggestion: ucs_rw_spinlock_t and for all apis?

tvegas1 · 2024-12-09T10:27:01Z

src/ucs/type/rwlock.h

+        }
+
+        while (lock->l > UCS_RWLOCK_WAIT) {
+            ucs_cpu_relax();


would we benefit from either using sched_yeld(), or X times ucs_cpu_relax() before retrying to read lock->l?

tvegas1 · 2024-12-09T10:32:15Z

src/ucs/type/rwlock.h

+}
+
+
+static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)


add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)

tvegas1 · 2024-12-09T10:32:47Z

src/ucs/type/rwlock.h

+
+static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
+{
+    __atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);


detect underflow on debug builds?

tvegas1 · 2024-12-09T10:33:42Z

src/ucs/type/rwlock.h

+ * Read-write lock.
+ */
+typedef struct {
+    volatile int l;


use unsigned int to have defined behavior on overflow/underflow (and detect it)?

tvegas1 · 2024-12-09T10:41:07Z

src/ucs/type/rwlock.h

+}
+
+
+static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)


maybe annotate for coverity with something like /* coverity[lock] */, to help it with sanity checks?

same for lock paths

Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch from 24a457b to 65f025c Compare December 5, 2024 14:46

UCS: Introduce lightweight rwlock

89415df

Artemy-Mellanox force-pushed the topic/gdrcopy-perf-2 branch from 65f025c to 89415df Compare December 5, 2024 15:38

yosefe reviewed Dec 6, 2024

View reviewed changes

iyastreb reviewed Dec 6, 2024

View reviewed changes

tvegas1 reviewed Dec 9, 2024

View reviewed changes


		ucs_rwlock_read_lock(&lock); /* second read lock should pass */

		int write_taken = 0;

		}


		static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)

		}


		static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)

UCS: Introduce lightweight rwlock #10355

Are you sure you want to change the base?

UCS: Introduce lightweight rwlock #10355

Conversation

Artemy-Mellanox commented Dec 5, 2024 • edited by yosefe Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iyastreb Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Artemy-Mellanox commented Dec 5, 2024 •

edited by yosefe

Loading

iyastreb Dec 6, 2024 •

edited

Loading