Reliable storage lock #2014

IvoDD · 2024-11-25T12:00:58Z

What does this implement or fix?

Introduces a reliable storage lock

Introduces a new ReliableStorageLock and ReliableStorageLockGuard to
be used as a slower but more reliable alternative to the existing
StorageLock.

It uses the new If-None-Match atomic put operations in S3.

First commit (Reliable storage lock):

Upgrades the aws-sdk-cpp in vcpkg (which needed a few additions
because of some problematic dependencies)
Adds write_if_none_match capability to AsyncStore's S3 and to
InMemoryStore
Logic for ReliableStorageLock
C++ tests using the InMemoryStore

Second commit (Real S3 storage python tests for ReliableStorageLock)

Adds a ReliableStorageLockManager which exposes functions to aquire and free the lock in python. (The guard structure is unusable for python)
Adds a new storage lock test with the existing real_s3_version_store fixture.

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

IvoDD · 2024-11-26T09:51:58Z

Evidence of real s3 storage tests passing here

poodlewars · 2024-11-27T09:23:02Z

cpp/arcticdb/entity/key.cpp

@@ -60,6 +60,7 @@ KeyData get_key_data(KeyType key_type) {
    STRING_REF(KeyType::APPEND_REF, aref, 'a')
    STRING_KEY(KeyType::MULTI_KEY, mref, 'm')
    STRING_REF(KeyType::LOCK, lref, 'x')
+    STRING_REF(KeyType::SLOW_LOCK, lref, 'x')


This clashes with the character and prefix for LOCK above which is very odd. Is this intentional? Doesn't it mean we'd list out both sorts of lock when we iterate type?

They all have to be unique, otherwise bad things will happen. Also, I'm not sure why we're characterising this as a slow lock, since if anything it will be faster than the other lock which required variable (and ultimately unknowable) wait periods to ensure that it really was the holder of the lock.

My preference would be that the eventual lock key (the copy destination) is just a normal KeyType::LOCK, and the one that gets copied to it (the first one written is called a PENDING_LOCK or a LOCK_ATTEMPT or something like that

Agreed, renaming it to ATOMIC_LOCK (with alref and A) as it relies on storage specific atomic operations. I will eventually replace it with the If-Match atomic aws s3 operations which would make it more meaningful.

I thought the letter here is just for users to easily identify key types from their name. Listing works by putting all keys from same key type in a directory (e.g. "tdata") I think, so I don't this this would cause issues.

If this is indeed an issue (notice that both the existing TOMBSTONE and LOCK use the same x) we should fix them as well.

poodlewars · 2024-11-27T09:25:46Z

cpp/arcticdb/storage/s3/s3_client_wrapper.hpp

@@ -57,7 +57,8 @@ class S3ClientWrapper {
    virtual S3Result<std::monostate> put_object(
            const std::string& s3_object_name,
            Segment&& segment,
-            const std::string& bucket_name) = 0;
+            const std::string& bucket_name,
+            bool if_none_match = false) = 0;


I hate bools on public APIs. Could we have two different public methods without the bool that push down to a single private method with the bool or something?

also bools tend to multiply out of control, it's preferable to use an enum, that way callers have to have a name on them (WriteConditions::IF_NONE_MATCH is way more self-explanatory at the call site than 'true')

Thanks, will use a PutHeader::IF_NONE_MATCH enum

poodlewars · 2024-11-27T09:26:42Z

cpp/arcticdb/storage/s3/s3_mock_client.cpp

@@ -83,7 +83,8 @@ S3Result<Segment> MockS3Client::get_object(
 S3Result<std::monostate> MockS3Client::put_object(
        const std::string &s3_object_name,
        Segment &&segment,
-        const std::string &bucket_name) {
+        const std::string &bucket_name,
+        bool if_none_match[[maybe_unused]]) {


We should extend the mock client so it can give us precondition failed response codes shouldn't we?

Ah agreed, I'll fix that. We didn't use it for tests but thinking more it might have been better to test with a MockS3Client rather than with the InMemoryStore but it would require some extra work to protect the in memory nature of it with some mutexes.

Will add this to a combined ticket for rework the lock to use If-Match.

poodlewars · 2024-11-27T09:30:28Z

cpp/arcticdb/stream/piloted_clock.hpp

@@ -14,4 +16,11 @@ struct PilotedClock {
    }
 };

+struct PilotedClockNoAutoIncrement {


I think there's already a clock type in the codebase that works like this - ManualClock?

poodlewars · 2024-11-27T09:31:36Z

cpp/arcticdb/stream/piloted_clock.hpp

@@ -1,3 +1,5 @@
+#pragma once


poodlewars · 2024-11-27T09:33:21Z

cpp/arcticdb/util/reliable_storage_lock.hpp

+// provide a slower but more reliable lock than the StorageLock. It should be completely consistent unless a process
+// holding a lock get's paused for times comparable to the lock timeout.
+// It lock follows the algorithm described here:
+// https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/


Maybe we should save of copy of this blog somewhere, will suck to have no docs of how this works if it vanishes

Given that we'll soon replace the locking with If-Match based locks, we can probably skip this.

poodlewars · 2024-11-27T09:33:34Z

cpp/arcticdb/util/reliable_storage_lock.hpp

+
+// The ReliableStorageLock is a storage lock which relies on atomic If-None-Match Put and ListObject operations to
+// provide a slower but more reliable lock than the StorageLock. It should be completely consistent unless a process
+// holding a lock get's paused for times comparable to the lock timeout.


get's should be gets

poodlewars · 2024-11-27T13:53:57Z

cpp/arcticdb/util/reliable_storage_lock.hpp

+
+}
+
+#include "arcticdb/util/reliable_storage_lock.tpp"


We normally put all the templated definitions straight in the .hpp file. It's an arbitrary choice, so I think we should be consistent with the rest of the codebase here.

The other alternative is to put them in reliable_storage_lock-inl.hpp, which is how we usually indicate that really this is implementation but it's templated implementation. You can look at any of the other -inl.hpp files for the usual mechanism

As in normally we include them at the bottom of the header file, and ensure that they're not included anywhere else, that way you get file separation, but you don't need to know about the additional file in your cpp files, you just include the header as normal

poodlewars · 2024-11-27T14:00:09Z

cpp/arcticdb/util/reliable_storage_lock.tpp

+template <class ClockType>
+ReliableStorageLock<ClockType>::ReliableStorageLock(const std::string &base_name, const std::shared_ptr<Store> store, timestamp timeout) :
+    base_name_(base_name), store_(store), timeout_(timeout) {
+    auto s3_timeout = ConfigsMap::instance()->get_int("S3Storage.RequestTimeoutMs", 200000) * ONE_MILLISECOND;


It seems quite weird to have references to S3 in this layer of the code. Also didn't we discover some quite odd properties of the s3 request timeout in the sense that it does not prevent s3 requests taking longer than the value set but instead definitions a window of time in which we expect a certain number of bytes to be transferred?

Yeah we'll want to use it for azure etc which already has similar operations, it would be better to pass it in as a policy from the storage where we just expect a function that is void() that does whatever we need

Agreed, it would be best to pass some sort of MaxTimeoutPolicy which allows us to configure the timeout of the storage. And yeah the RequestTimeout was indeed some period of time to determine a threshold for transfer speed.

I've just removed the check for now and will add to the follow up ticket to be able to configure and enforce a certain maximum timeout for all stores.

poodlewars · 2024-11-27T14:19:03Z

cpp/arcticdb/util/reliable_storage_lock.hpp

+
+namespace lock {
+
+using Epoch = uint64_t;


I know it's the nomenclature that the blog used, but I find this Epoch name really confusing. I think most people would expect it to be some sort of timestamp, but it's a counter (and then we have timestamps in the segment itself, to add to the confusion).

Thanks, used AcquiredLockId

poodlewars · 2024-11-27T16:53:16Z

cpp/arcticdb/util/reliable_storage_lock.tpp

+        return;
+    }
+    auto lock_stream_id = get_stream_id(held_lock_epoch);
+    auto expiration = ClockType::nanos_since_epoch(); // Write current time to mark lock as expired as of now


Managing the expiry with this process' clock isn't right is it? The blog uses the last modification time on S3. This is probably something we can improve afterwards.

Yeah you're right. The difference is not big though. Just 2x the clock skew (if we e.g. expect clock skew to be 100ms for a process this would enstill a 200ms total skew between the two processes, otherwise if we use the s3 time as a source of truth we would suffer only the 100ms skew).

Let's do this together with the If-Match rework (which will also require passing in the ETag from s3 to the Store API and we can do a similar change to use the last-modified time)

I think that's reasonable on the basis that clock drift can still cause locking failures even if we make the change I suggested.

poodlewars · 2024-11-29T08:53:28Z

cpp/arcticdb/util/test/test_reliable_storage_lock.cpp

+    Clock::time_ = 10;
+    ASSERT_EQ(lock2.try_take_lock(), std::nullopt);
+    Clock::time_ = 19;
+    ASSERT_EQ(lock1.try_take_lock(), std::nullopt);


The std::nullopt assertions should all be for both lock1 and lock2, right? We should document that the lock API is not re-entrant

Thanks, agreed done

poodlewars · 2024-11-29T08:56:54Z

cpp/arcticdb/util/test/test_reliable_storage_lock.cpp

+            that->lock_lost_ = true;
+        });
+        auto value_before_sleep = cnt_;
+        // std::cout<<"Taken a lock with "<<value_before_sleep<<std::endl;


Remove the print statements

poodlewars · 2024-11-29T08:58:20Z

cpp/vcpkg.json

    {
      "name": "aws-sdk-cpp",
-      "$version reason": "Minimum version in the baseline that works with aws-c-io above.",
+      "version>=": "1.11.405",


Why isn't the version down in the overrides section like all the other packages?

I needed to enforce fairly new versions of all aws packages because otherwise newer versions of aws-sdk-cpp would not compile.

To do this I used version>= so in the future if aws packages fix their vcpkg dependencies we would not have to update all aws overrides when we want o upgrade the aws-sdk-cpp.

I think if we use an override we don't allow vcpkg to resolve dependency versions for us, but given it's currently broken either way I'll just add an override only for aws-sdk-cpp. I'll leave the other aws packages with version>= to not require upgrading all overrides when we only want to upgrade the aws-sdk-cpp

poodlewars · 2024-11-29T10:31:11Z

cpp/arcticdb/util/test/test_reliable_storage_lock.cpp

+
+TEST(ReliableStorageLock, StressMultiThreaded) {
+    // It is hard to use a piloted clock for these tests because the folly::FunctionScheduler we use for the lock
+    // extensions doesn't support a custom clock. Thus this test will need to run for about 2 minutes.


It's useful to keep test_unit_arcticdb running fast. Might be better to add this stress test to a new cmake target. Can be done as a follow up.

I agree it's useful to separate slower stress tests. I rememember other stress tests e.g. some of the tests in test_slab_alloc.cpp were also quite slow, we should probably move them together.

Adding a note to fix in the If-match cleanup.

poodlewars · 2024-11-29T10:35:15Z

python/tests/integration/arcticdb/test_storage_lock.py

+
+    read_df = lib.read(symbol).data
+    expected_df = pd.DataFrame({"col": [num_processes]})
+    assert_frame_equal(read_df, expected_df)


Also assert on the version number after all these writes?

I've noticed when running these that the persistent storage might not get cleaned up after a previous test run especially when running locally. So I think it might be possible to do these writes on a library which already has the counter written with some arbitrary version. I.e. it's possible we get a version 200 because the test was run before.

I think the counter being equal to 100 is essentially the same assertions as writing 100 versions, so there isn't much benefit and it makes local testing slightly more annoying. This ifine for the CI I think. What do you think?

Would be simple to do a force delete symbol at the start of the test?

Ah didn't know about the force_delete_symbol API. Done, thanks!

willdealtry · 2024-11-29T11:17:03Z

cpp/arcticdb/entity/key.cpp

@@ -60,6 +60,7 @@ KeyData get_key_data(KeyType key_type) {
    STRING_REF(KeyType::APPEND_REF, aref, 'a')
    STRING_KEY(KeyType::MULTI_KEY, mref, 'm')
    STRING_REF(KeyType::LOCK, lref, 'x')
+    STRING_REF(KeyType::SLOW_LOCK, lref, 'x')


They all have to be unique, otherwise bad things will happen. Also, I'm not sure why we're characterising this as a slow lock, since if anything it will be faster than the other lock which required variable (and ultimately unknowable) wait periods to ensure that it really was the holder of the lock.

My preference would be that the eventual lock key (the copy destination) is just a normal KeyType::LOCK, and the one that gets copied to it (the first one written is called a PENDING_LOCK or a LOCK_ATTEMPT or something like that

willdealtry · 2024-11-29T11:19:22Z

cpp/arcticdb/storage/s3/s3_client_wrapper.hpp

@@ -57,7 +57,8 @@ class S3ClientWrapper {
    virtual S3Result<std::monostate> put_object(
            const std::string& s3_object_name,
            Segment&& segment,
-            const std::string& bucket_name) = 0;
+            const std::string& bucket_name,
+            bool if_none_match = false) = 0;


also bools tend to multiply out of control, it's preferable to use an enum, that way callers have to have a name on them (WriteConditions::IF_NONE_MATCH is way more self-explanatory at the call site than 'true')

willdealtry · 2024-11-29T11:23:11Z

cpp/arcticdb/util/reliable_storage_lock.hpp

+
+}
+
+#include "arcticdb/util/reliable_storage_lock.tpp"


The other alternative is to put them in reliable_storage_lock-inl.hpp, which is how we usually indicate that really this is implementation but it's templated implementation. You can look at any of the other -inl.hpp files for the usual mechanism

willdealtry · 2024-11-29T11:24:22Z

cpp/arcticdb/util/reliable_storage_lock.hpp

+
+}
+
+#include "arcticdb/util/reliable_storage_lock.tpp"


As in normally we include them at the bottom of the header file, and ensure that they're not included anywhere else, that way you get file separation, but you don't need to know about the additional file in your cpp files, you just include the header as normal

willdealtry · 2024-11-29T11:35:30Z

cpp/arcticdb/util/reliable_storage_lock.tpp

+const auto EXTENDS_PER_TIMEOUT = 5u;
+
+inline StreamDescriptor lock_stream_descriptor(const StreamId &stream_id) {
+    return StreamDescriptor{stream_descriptor(


I don't think it needs the other constructor as the function already returns a stream descriptor

willdealtry · 2024-11-29T11:37:11Z

cpp/arcticdb/util/reliable_storage_lock.tpp

+template <class ClockType>
+ReliableStorageLock<ClockType>::ReliableStorageLock(const std::string &base_name, const std::shared_ptr<Store> store, timestamp timeout) :
+    base_name_(base_name), store_(store), timeout_(timeout) {
+    auto s3_timeout = ConfigsMap::instance()->get_int("S3Storage.RequestTimeoutMs", 200000) * ONE_MILLISECOND;


Yeah we'll want to use it for azure etc which already has similar operations, it would be better to pass it in as a policy from the storage where we just expect a function that is void() that does whatever we need

willdealtry · 2024-11-29T11:38:09Z

cpp/arcticdb/util/reliable_storage_lock.tpp

+}
+
+template <class ClockType>
+StreamId ReliableStorageLock<ClockType>::get_stream_id(Epoch e) const {


e is commonly used for error, no harm in calling it 'epoch'

Used lock_id because I renamed Epoch to AcquiredLockId

IvoDD · 2024-12-02T09:58:40Z

Some of the discussed follow up fixes will be done when addressing #2028

vasil-pashov · 2024-12-04T11:14:59Z

cpp/CMakeLists.txt

-        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wextra")
+        # Required to be able to include headers from glog since glog 0.7
+        # See: https://github.com/google/glog/pull/1030
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wextra -DGLOG_USE_GLOG_EXPORT")


You don't need to change this but setting the CMAKE_CXX_FLAGS is an outdated approach. Unfortunately most of our code is setup like this :(

The correct thing to do is to use target_compile_definitions

Introduces a new `ReliableStorageLock` and `ReliableStorageLockGuard` to be used as a slower but more reliable alternative to the existing `StorageLock`. It uses the new If-None-Match atomic put operations in S3. This commit: - Upgrades the aws-sdk-cpp in vcpkg (which needed a few additions because of some problematic dependencies) - Adds `write_if_none_match` capability to `AsyncStore`'s S3 and to `InMemoryStore` - Logic for `ReliableStorageLock` - C++ tests using the `InMemoryStore` Follow up commit will introduce a python integration test with real aws s3.

Adds a real s3 storage test (currently to be run with persistent storage tests mark) for the lock.

Currently all backends are unsupported apart from S3 (for which only some providers like AWS support it). Unfotrunately it's impossible to differentiate between aws and e.g. vast backends apart from looking at the endpoint which can be subject to rerouting etc. This commit also reworks the Guard to work only with aquired locks.

- Addressing review comments on ReliableStorageLock pr - Mostly renames - Longer timeout before lock cleanup - Fixes mamba dependencies

This will be useful if we want to construct a ReliableStorageLockGuard but decide later what to do in case a lock is lost. Also changes ReliableStorageLockGuard to hold a copy of the lock instead of just a reference. The lock is a cheap object to copy, so it makes easier to avoid errors where lock is destructed before the guard for it.

Because aws-sdk-cpp transforms PreconditionFailed errors into S3Error::Unkown we do string munging on the exception name to identify it. Also fixes some block version ref key conflicts

IvoDD requested review from alexowens90, willdealtry and poodlewars as code owners November 25, 2024 12:00

IvoDD force-pushed the reliable-storage-lock branch 5 times, most recently from 776bc2d to e140bca Compare November 25, 2024 18:41

IvoDD force-pushed the reliable-storage-lock branch from 3add5e2 to 7f4fac4 Compare November 26, 2024 14:24

poodlewars reviewed Nov 27, 2024

View reviewed changes

cpp/arcticdb/stream/piloted_clock.hpp Outdated

@@ -1,3 +1,5 @@

#pragma once

Copy link

Collaborator

poodlewars Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuck

IvoDD reacted with thumbs up emoji

poodlewars reviewed Nov 27, 2024

View reviewed changes

IvoDD force-pushed the reliable-storage-lock branch 4 times, most recently from 3b41a46 to 614d8c2 Compare November 27, 2024 18:32

poodlewars requested changes Nov 29, 2024

View reviewed changes

willdealtry reviewed Nov 29, 2024

View reviewed changes

IvoDD force-pushed the reliable-storage-lock branch from 614d8c2 to b647dd1 Compare December 2, 2024 09:53

IvoDD mentioned this pull request Dec 2, 2024

Use a better ReliableStorageLock based on If-Match S3 atomic operations #2028

Open

poodlewars approved these changes Dec 2, 2024

View reviewed changes

IvoDD force-pushed the reliable-storage-lock branch 4 times, most recently from e21f7a9 to 1bc9ddb Compare December 2, 2024 14:28

IvoDD force-pushed the reliable-storage-lock branch 2 times, most recently from 6843bd9 to dc7204a Compare December 3, 2024 09:04

vasil-pashov reviewed Dec 4, 2024

View reviewed changes

IvoDD force-pushed the reliable-storage-lock branch 2 times, most recently from d7ed925 to 39b460d Compare December 4, 2024 16:05

IvoDD mentioned this pull request Dec 6, 2024

Enable conda protobuf 4 #2045

Closed

5 tasks

IvoDD force-pushed the reliable-storage-lock branch 2 times, most recently from ea698e2 to 92c203a Compare December 6, 2024 12:08

IvoDD added 6 commits December 9, 2024 14:28

Real S3 storage python tests for ReliableStorageLock

b8ce22b

Adds a real s3 storage test (currently to be run with persistent storage tests mark) for the lock.

Better naming for ReliableStorageLock

87a5e06

- Addressing review comments on ReliableStorageLock pr - Mostly renames - Longer timeout before lock cleanup - Fixes mamba dependencies

Denoise Precondition failed logs

a025892

Because aws-sdk-cpp transforms PreconditionFailed errors into S3Error::Unkown we do string munging on the exception name to identify it. Also fixes some block version ref key conflicts

IvoDD force-pushed the reliable-storage-lock branch from 92c203a to a025892 Compare December 9, 2024 12:30

IvoDD merged commit 2a6b3cd into master Dec 9, 2024
137 of 138 checks passed

IvoDD deleted the reliable-storage-lock branch December 9, 2024 14:17


		}

		#include "arcticdb/util/reliable_storage_lock.tpp"


		}

		#include "arcticdb/util/reliable_storage_lock.tpp"


		}

		#include "arcticdb/util/reliable_storage_lock.tpp"

Reliable storage lock #2014

Reliable storage lock #2014

Conversation

IvoDD commented Nov 25, 2024

What does this implement or fix?

Any other comments?

Checklist

IvoDD commented Nov 26, 2024 • edited Loading

poodlewars Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvoDD Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvoDD commented Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

IvoDD commented Nov 26, 2024 •

edited

Loading

poodlewars Nov 27, 2024 •

edited

Loading

IvoDD Dec 2, 2024 •

edited

Loading

poodlewars Nov 27, 2024 •

edited

Loading

IvoDD commented Dec 2, 2024 •

edited

Loading