Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCS: Introduce lightweight rwlock #10355

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/ucs/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ noinst_HEADERS = \
time/timerq.h \
time/timer_wheel.h \
type/serialize.h \
type/rwlock.h \
type/float8.h \
async/async.h \
async/pipe.h \
Expand Down
8 changes: 8 additions & 0 deletions src/ucs/arch/cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include <ucs/sys/compiler_def.h>
#include <stddef.h>
#include <sched.h>

BEGIN_C_DECLS

Expand Down Expand Up @@ -176,6 +177,13 @@ static inline int ucs_cpu_prefer_relaxed_order()
const char *ucs_cpu_vendor_name();
const char *ucs_cpu_model_name();

#ifndef UCS_HAS_CPU_RELAX
static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
{
sched_yield();
yosefe marked this conversation as resolved.
Show resolved Hide resolved
}
#endif

END_C_DECLS

#endif
11 changes: 11 additions & 0 deletions src/ucs/arch/x86_64/cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@
#ifdef __AVX__
# include <immintrin.h>
#endif
#ifdef __SSE2__
# include <emmintrin.h>
#endif

BEGIN_C_DECLS

Expand Down Expand Up @@ -132,6 +135,14 @@ ucs_memcpy_nontemporal(void *dst, const void *src, size_t len)
ucs_x86_memcpy_sse_movntdqa(dst, src, len);
}

#ifdef __SSE2__
static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
{
_mm_pause();
}
#define UCS_HAS_CPU_RELAX
#endif

END_C_DECLS

#endif
Expand Down
120 changes: 120 additions & 0 deletions src/ucs/type/rwlock.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
/*
* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2024. ALL RIGHTS RESERVED.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyrights would need 2025

*
* See file LICENSE for terms.
*/

#ifndef UCS_RWLOCK_H
#define UCS_RWLOCK_H

#include <ucs/arch/cpu.h>
#include <errno.h>

/**
* The ucs_rwlock_t type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ucs_rw_spinlock_t and for all apis?

*
* Readers increment the counter by UCS_RWLOCK_READ (4)
* Writers set the UCS_RWLOCK_WRITE bit when lock is held
* and set the UCS_RWLOCK_WAIT bit while waiting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe detail something in the line of "the wait bit is meant for all subsequent read lock to let any write lock go first"

*
* 31 2 1 0
* +-------------------+-+-+
* | readers | | |
* +-------------------+-+-+
* ^ ^
* | |
* WRITE: lock held ----/ |
* WAIT: writer pending --/
*/

#define UCS_RWLOCK_WAIT 0x1 /* Writer is waiting */
#define UCS_RWLOCK_WRITE 0x2 /* Writer has the lock */
#define UCS_RWLOCK_MASK (UCS_RWLOCK_WAIT | UCS_RWLOCK_WRITE)
#define UCS_RWLOCK_READ 0x4 /* Reader increment */
yosefe marked this conversation as resolved.
Show resolved Hide resolved

#define UCS_RWLOCK_STATIC_INITIALIZER {0}


/**
* Read-write lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spinlock

*/
typedef struct {
volatile int l;
yosefe marked this conversation as resolved.
Show resolved Hide resolved
iyastreb marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unsigned int to have defined behavior on overflow/underflow (and detect it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't expect overflow by design so signed int is more suitable i this case IMO

} ucs_rwlock_t;


static inline void ucs_rwlock_read_lock(ucs_rwlock_t *lock)
yosefe marked this conversation as resolved.
Show resolved Hide resolved
{
int x;

while (1) {
yosefe marked this conversation as resolved.
Show resolved Hide resolved
while (lock->l & UCS_RWLOCK_MASK) {
Copy link
Contributor

@iyastreb iyastreb Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we read atomic without mem order guarantees, normally it should be something

__atomic_load_n(&lock->l, __ATOMIC_RELAXED)

Same in other read cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? AFAIU relaxed mem order means - no mem order guarantees

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_loads

ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use the atomic operations defined in atomic.h?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

if (!(x & UCS_RWLOCK_MASK)) {
return;
}

__atomic_fetch_sub(&lock->l, UCS_RWLOCK_READ, __ATOMIC_RELAXED);
}
}


static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)
yosefe marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)

{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_READ, __ATOMIC_RELAXED);
iyastreb marked this conversation as resolved.
Show resolved Hide resolved
}


static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe annotate for coverity with something like /* coverity[lock] */, to help it with sanity checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for lock paths

{
int x;

while (1) {
x = lock->l;
if ((x < UCS_RWLOCK_WRITE) &&
(__atomic_compare_exchange_n(&lock->l, &x, UCS_RWLOCK_WRITE, 0,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return;
}
yosefe marked this conversation as resolved.
Show resolved Hide resolved

if (!(x & UCS_RWLOCK_WAIT)) {
__atomic_fetch_or(&lock->l, UCS_RWLOCK_WAIT, __ATOMIC_RELAXED);
}

while (lock->l > UCS_RWLOCK_WAIT) {
ucs_cpu_relax();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we benefit from either using sched_yeld(), or X times ucs_cpu_relax() before retrying to read lock->l?

}
}
}


static inline int ucs_rwlock_write_trylock(ucs_rwlock_t *lock)
{
int x;

x = lock->l;
if ((x < UCS_RWLOCK_WRITE) &&
(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return 0;
yosefe marked this conversation as resolved.
Show resolved Hide resolved
}

return -EBUSY;
}


static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is usually __ATOMIC_RELEASE to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect underflow on debug builds?

}


static inline void ucs_rwlock_init(ucs_rwlock_t *lock)
{
lock->l = 0;
}

#endif
143 changes: 143 additions & 0 deletions test/gtest/ucs/test_type.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@ extern "C" {
#include <ucs/type/serialize.h>
#include <ucs/type/status.h>
#include <ucs/type/float8.h>
#include <ucs/type/rwlock.h>
}

#include <time.h>
#include <thread>
#include <chrono>
#include <vector>

class test_type : public ucs::test {
};
Expand Down Expand Up @@ -138,6 +142,145 @@ UCS_TEST_F(test_type, pack_float) {
UCS_FP8_PACK_UNPACK(TEST_LATENCY, 200000000));
}

class test_rwlock : public ucs::test {
protected:
void sleep()
{
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

void measure_one(int num, int writers, const std::function<void()> &r,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe introduce a typedef with using:
using run_func_t = std::function<void()>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also propose more explicit names:
r -> reader
w -> writer
num -> thread_count

const std::function<void()> &w, const std::string &name)
{
std::vector<std::thread> tt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe name it threads


tt.reserve(num);
auto start = std::chrono::high_resolution_clock::now();
for (int c = 0; c < num; c++) {
tt.emplace_back([&]() {
unsigned seed = time(0);
for (int i = 0; i < 1000000 / num; i++) {
if ((rand_r(&seed) % 256) < writers) {
w();
} else {
r();
yosefe marked this conversation as resolved.
Show resolved Hide resolved
}
}
});
}


for (auto &t : tt) {
t.join();
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;

UCS_TEST_MESSAGE << elapsed.count() * 1000 << " ms " << name << " "
<< std::to_string(num) << " threads "
<< std::to_string(writers) << " writers per 256 ";
}

void measure(const std::function<void()> &r,
const std::function<void()> &w, const std::string &name)
{
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use percent to set the number of writers? It will be easier to understand

std::vector<int> writers_percent = {1, 25, 50, 75, 90};


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock
  2. can you pls post example output in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you mean write percent 0?
  2. didn't i do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes ( isee you added it)
  2. yes, pls update with percent 0

for (auto t : threads) {
for (auto writers : writers_per_256) {
measure_one(t, writers, r, w, name);
}
}
}
};

UCS_TEST_F(test_rwlock, lock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?

ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;

ucs_rwlock_read_lock(&lock);
EXPECT_EQ(-EBUSY, ucs_rwlock_write_trylock(&lock));

ucs_rwlock_read_lock(&lock); /* second read lock should pass */

int write_taken = 0;
yosefe marked this conversation as resolved.
Show resolved Hide resolved
std::thread w([&]() {
ucs_rwlock_write_lock(&lock);
write_taken = 1;
ucs_rwlock_write_unlock(&lock);
});
sleep();
EXPECT_FALSE(write_taken); /* write lock should wait for read lock release */

ucs_rwlock_read_unlock(&lock);
sleep();
EXPECT_FALSE(write_taken); /* first read lock still holding lock */

int read_taken = 0;
yosefe marked this conversation as resolved.
Show resolved Hide resolved
std::thread r1([&]() {
ucs_rwlock_read_lock(&lock);
read_taken = 1;
ucs_rwlock_read_unlock(&lock);
});
sleep();
EXPECT_FALSE(read_taken); /* read lock should wait while write lock is waiting */

ucs_rwlock_read_unlock(&lock);
sleep();
EXPECT_TRUE(write_taken); /* write lock should be taken */
w.join();

sleep();
EXPECT_TRUE(read_taken); /* read lock should be taken */
r1.join();

EXPECT_EQ(0, ucs_rwlock_write_trylock(&lock));
read_taken = 0;
std::thread r2([&]() {
ucs_rwlock_read_lock(&lock);
read_taken = 1;
ucs_rwlock_read_unlock(&lock);
});
sleep();
EXPECT_FALSE(read_taken); /* read lock should wait for write lock release */

ucs_rwlock_write_unlock(&lock);
sleep();
EXPECT_TRUE(read_taken); /* read lock should be taken */
r2.join();
}

UCS_TEST_F(test_rwlock, perf) {
ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:

    // Invariant: counter2 is 2 times bigger than counter1
    int counter1 = 1;
    int counter2 = 2;
    measure(
            [&]() {
                ucs_rwlock_read_lock(&lock);
                UCS_ASSERT_EQ(counter1 * 2, counter2);
                ucs_rwlock_read_unlock(&lock);
            },
            [&]() {
                ucs_rwlock_write_lock(&lock);
                counter1 += counter1;
                counter2 += counter2;
                if (counter2 > 100000) {
                    counter1 = 1;
                    counter2 = 2;
                }
                ucs_rwlock_write_unlock(&lock);
            },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.

Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc

[&]() {
ucs_rwlock_read_lock(&lock);
ucs_rwlock_read_unlock(&lock);
},
[&]() {
ucs_rwlock_write_lock(&lock);
ucs_rwlock_write_unlock(&lock);
},
"builtin");
}

UCS_TEST_F(test_rwlock, pthread) {
pthread_rwlock_t plock;
pthread_rwlock_init(&plock, NULL);
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, update/verify some state

[&]() {
pthread_rwlock_rdlock(&plock);
pthread_rwlock_unlock(&plock);
},
[&]() {
pthread_rwlock_wrlock(&plock);
pthread_rwlock_unlock(&plock);
},
"pthread");
pthread_rwlock_destroy(&plock);
}

class test_init_once: public test_type {
protected:
test_init_once() : m_once(INIT_ONCE_INIT), m_count(0) {};
Expand Down
Loading