Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCS: Introduce lightweight rwlock #10355

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Artemy-Mellanox
Copy link
Contributor

@Artemy-Mellanox Artemy-Mellanox commented Dec 5, 2024

[ RUN      ] test_rwlock.perf <> <>
[     INFO ] 8.22035 ms builtin 1 threads 1 writers per 256 
[     INFO ] 9.91718 ms builtin 1 threads 25 writers per 256 
[     INFO ] 15.6226 ms builtin 1 threads 128 writers per 256 
[     INFO ] 8.53102 ms builtin 1 threads 250 writers per 256 
[     INFO ] 14.8544 ms builtin 2 threads 1 writers per 256 
[     INFO ] 20.3778 ms builtin 2 threads 25 writers per 256 
[     INFO ] 25.2576 ms builtin 2 threads 128 writers per 256 
[     INFO ] 18.0413 ms builtin 2 threads 250 writers per 256 
[     INFO ] 15.9894 ms builtin 4 threads 1 writers per 256 
[     INFO ] 25.9471 ms builtin 4 threads 25 writers per 256 
[     INFO ] 30.6886 ms builtin 4 threads 128 writers per 256 
[     INFO ] 23.7099 ms builtin 4 threads 250 writers per 256 
[     INFO ] 109.585 ms builtin 64 threads 1 writers per 256 
[     INFO ] 498.465 ms builtin 64 threads 25 writers per 256 
[     INFO ] 981.667 ms builtin 64 threads 128 writers per 256 
[     INFO ] 891.406 ms builtin 64 threads 250 writers per 256 
[       OK ] test_rwlock.perf (2699 ms)
[ RUN      ] test_rwlock.pthread <> <>
[     INFO ] 15.7748 ms pthread 1 threads 1 writers per 256 
[     INFO ] 17.1273 ms pthread 1 threads 25 writers per 256 
[     INFO ] 26.9475 ms pthread 1 threads 128 writers per 256 
[     INFO ] 15.2211 ms pthread 1 threads 250 writers per 256 
[     INFO ] 29.4559 ms pthread 2 threads 1 writers per 256 
[     INFO ] 215.037 ms pthread 2 threads 25 writers per 256 
[     INFO ] 104.325 ms pthread 2 threads 128 writers per 256 
[     INFO ] 41.5187 ms pthread 2 threads 250 writers per 256 
[     INFO ] 35.4409 ms pthread 4 threads 1 writers per 256 
[     INFO ] 196.073 ms pthread 4 threads 25 writers per 256 
[     INFO ] 122.957 ms pthread 4 threads 128 writers per 256 
[     INFO ] 62.3138 ms pthread 4 threads 250 writers per 256 
[     INFO ] 73.8066 ms pthread 64 threads 1 writers per 256 
[     INFO ] 198.412 ms pthread 64 threads 25 writers per 256 
[     INFO ] 165.476 ms pthread 64 threads 128 writers per 256 
[     INFO ] 118.79 ms pthread 64 threads 250 writers per 256 
[       OK ] test_rwlock.pthread (1439 ms)

#ifndef UCS_HAS_CPU_RELAX
static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
{
sched_yield();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. sched_yield is very different than __mm_pause
    maybe better use some kind of builtin? or just do nothing?
  2. IMO better define cpu_relax for each arch than introduce macro UCS_HAS_CPU_RELAX defined in x86 h file

Comment on lines +30 to +33
#define UCS_RWLOCK_WAIT 0x1 /* Writer is waiting */
#define UCS_RWLOCK_WRITE 0x2 /* Writer has the lock */
#define UCS_RWLOCK_MASK (UCS_RWLOCK_WAIT | UCS_RWLOCK_WRITE)
#define UCS_RWLOCK_READ 0x4 /* Reader increment */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define it as enum and use explicit bit consts such as UCS_BIT or "1<<0", "1<<1" etc

* Read-write lock.
*/
typedef struct {
volatile int l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"l" is stub name, maybe "state", "flags" etc

{
int x;

while (1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (;;)

ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use the atomic operations defined in atomic.h?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

if ((x < UCS_RWLOCK_WRITE) &&
(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return 1 on success and 0 on failure, same as ucs spinlock

Comment on lines +164 to +166
w();
} else {
r();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w and r are stub names

Comment on lines +187 to +190
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock
  2. can you pls post example output in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you mean write percent 0?
  2. didn't i do that?

sleep();
EXPECT_FALSE(write_taken); /* first read lock still holding lock */

int read_taken = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool


ucs_rwlock_read_lock(&lock); /* second read lock should pass */

int write_taken = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool

std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

void measure_one(int num, int writers, const std::function<void()> &r,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe introduce a typedef with using:
using run_func_t = std::function<void()>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also propose more explicit names:
r -> reader
w -> writer
num -> thread_count

void measure_one(int num, int writers, const std::function<void()> &r,
const std::function<void()> &w, const std::string &name)
{
std::vector<std::thread> tt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe name it threads

{
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use percent to set the number of writers? It will be easier to understand

std::vector<int> writers_percent = {1, 25, 50, 75, 90};

ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?


UCS_TEST_F(test_rwlock, perf) {
ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:

    // Invariant: counter2 is 2 times bigger than counter1
    int counter1 = 1;
    int counter2 = 2;
    measure(
            [&]() {
                ucs_rwlock_read_lock(&lock);
                UCS_ASSERT_EQ(counter1 * 2, counter2);
                ucs_rwlock_read_unlock(&lock);
            },
            [&]() {
                ucs_rwlock_write_lock(&lock);
                counter1 += counter1;
                counter2 += counter2;
                if (counter2 > 100000) {
                    counter1 = 1;
                    counter2 = 2;
                }
                ucs_rwlock_write_unlock(&lock);
            },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.

Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc

UCS_TEST_F(test_rwlock, pthread) {
pthread_rwlock_t plock;
pthread_rwlock_init(&plock, NULL);
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, update/verify some state


static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_READ, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that relaxed memory ordering is enough here (btw litmus test should fail in this case)
I think we need __ATOMIC_RELEASE to ensures memory operations before releasing the lock are completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw litmus test doesn't detect it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, so it can be RELAXED then
Maybe we can also add a test case with nested mutex, I still believe there might be an issue here, when you have 2 unlocks one after another, and the ordering must be preserved

But I'm glad that you see the usefulness of litmus test - you can judge if concurrent algorithm is correct based on it

int x;

while (1) {
while (lock->l & UCS_RWLOCK_MASK) {
Copy link
Contributor

@iyastreb iyastreb Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we read atomic without mem order guarantees, normally it should be something

__atomic_load_n(&lock->l, __ATOMIC_RELAXED)

Same in other read cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? AFAIU relaxed mem order means - no mem order guarantees

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_loads

* Read-write lock.
*/
typedef struct {
volatile int l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do all the loads with __atomic_load_n then volatile shouldn't be needed


static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is usually __ATOMIC_RELEASE to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?

}
};

UCS_TEST_F(test_rwlock, lock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?



/**
* Read-write lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spinlock

#include <errno.h>

/**
* The ucs_rwlock_t type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ucs_rw_spinlock_t and for all apis?

}

while (lock->l > UCS_RWLOCK_WAIT) {
ucs_cpu_relax();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we benefit from either using sched_yeld(), or X times ucs_cpu_relax() before retrying to read lock->l?

}


static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)


static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect underflow on debug builds?

* Read-write lock.
*/
typedef struct {
volatile int l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unsigned int to have defined behavior on overflow/underflow (and detect it)?

}


static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe annotate for coverity with something like /* coverity[lock] */, to help it with sanity checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for lock paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants