High level Rust API for Golem's transactional host functions #9

vigoo · 2024-04-10T16:27:51Z

afsalthaj · 2024-04-11T03:23:37Z

README.md


-This crate contains couple of Rust macros that facilitate writing Golem Cloud backends in Rust:
-1. Derives `From<>` and `Into<>` typeclasses between wit-bindgen derived data types and custom domain model data types.
-2. Generates wit file from rust code.


Just out of curiosity. So this PR is removing these features and just exposing transactional host functions? Or is it a mix of different things that help users write golem backend?

Ok, i think I got the answer

Just moved the macro into golem-rust-macro.

vigoo · 2024-04-11T16:09:20Z

Possible improvement: #10

jdegoes · 2024-04-11T17:23:53Z

golem-rust/src/transaction.rs

+///
+/// Implement this trait and use it within a `transaction` block.
+/// Operations can also be constructed from closures using `operation`.
+pub trait Operation<In, Out, Err>: Clone {


This is a very data-oriented API, which makes it more difficult for the user to interleave other things, such as logging, whose behavior does not need to be compensated in the event of a rollback.

I wonder if we could achieve a less-invasive API by executing the actions immediately, and only storing the rollbacks.

Let's imagine two possible APIs:

transaction(move |tx| { // action is executed now, but compensation is stored in tx: tx.execute(|| { ... }, || { ... }); });

Or:

let txn = Transaction::new(); txn.begin_op(); // Arbitrary rust code: do_something(); txn.end_op(|| { my_compensation(); });

With type classes and macros, we could improve upon this a lot:

pub trait Operation { type Out; type Err; fn execute(&self) -> Result<Out, Err>; fn compensate(&self, out: Out) -> Result<(), Err>; } #[golem_operation] #[golem_compensation(my_compensation)] pub fn my_operation() -> Result<bool, Error> { } pub fn my_compensation(b: bool) -> Result<(), Error> { } ... log("some logging stuff here"); txn.execute(my_operation); log("some logging stuff there");

You don't need to separate the operation definition from the execution in my version either:

tx.add(operation::<String, (), ()>( move |input: String| { log1.borrow_mut().push(format!("op1 execute {input}")); Ok(()) }, move |input: String| { log2.borrow_mut().push(format!("op1 rollback {input}")); }, "input".to_string() ));

Of course it can be simplified like the execute above, but we can only do that with this version and not my proposed "advanced one" (in #10)
Note that the reason I added an input is also this advanced proposal. If we don't want to do that, there is no need to pass an input to the operation and it can just be two closures.

The begin_op/end_op version cannot catch the errors happening in between, so there is no way to perform the rollback.

The macro to generate Operation instances is probably a good idea (although a bit magical to me :) and seems easy to do.

What I mean by separate is that you cannot put code around the execution of the operation, because you are building a data structure, which will be executed later. It would be nice to "execute now" the operation, and keep track of the compensation, because it allows arbitrary code before and after the compensated operation.

Ah I think I just named it confusingly. My add immediately executes the operation, and you can definitely have code around it. That's the whole reason transaction get's a f: impl FnOnce(&mut Transaction) -> Out and not a list of operations.

I did not know this. Maybe we can rename to execute?

Let's create a ticket for the macro for operation. I think it can become bearable syntax with the macro, added to top-level functions.

Added a first version of the macro.

I'm going to extend Operation and the macro as well to have a way to write compensations that require the output of the action - with the current one it only gets the input, which is the only thing we can do when we want to run compensations on non-domain failures too. But for the simpler saga pattern it is easier if you don't need to derive the compensation from the input (and probably it is not even possible in many cases) so I am going to support both.

jdegoes · 2024-04-11T17:36:13Z

golem-rust/src/transaction.rs

+    }
+}
+
+pub fn transaction<Out>(f: impl FnOnce(&mut Transaction) -> Out) -> Out {


We have to carefully think through what we want the user API to be here.

I think the requirements are something like:

The transaction is attempted.

If there is an error, rollback is attempted.

If the transaction succeeds, then the user gets back the successful result of the transaction.

If the transaction fails, due to an error, then the user gets back the failure of the transaction, along with knowledge about whether or not all compensation actions succeeded.

So in my mind, our transaction function looks something like:

pub fn transaction<Out>(f: impl FnOnce(&mut Transaction) -> Out) -> TransactionResult<Out> { ... } enum TransactionResult<Out> { Success(Out), FailedWithFullRollback, FailedWithPartialRollback(<error>), }

Moreover, I am not sure we need set_oplog_index if we are using the closure. Because we can pop out of the closure at any time to short circuit and return the rollback error, if any.

Does this make sense?

I was thinking (and implemented accordingly) that in case of a failure we want to use Golem's capabilities to recover and retry. This means that if there was an error this function never returns. After the rollback operations are finished, it "crashes" with set-oplog-index and retries from the beginning of the transactions.

I thought that was the primary reason why we added set-oplog-index - otherwise would not this be a fully "user-land" feature which is not golem-specific at all?

I think this is a useful semantic (transactions must succeed). But also in some cases (the checkout example), transactions can fail, and if they do, we want to know about that and not auto-retry. Probably both semantics are useful, and we need a terminology to describe the distinction, and then an API to separate them.

I see, did not think of that case. So why would someone use golem-rust to define operations etc. if they don't want anything Golem specific (recovery) to happen instead of just writing their business logic that calls A, B and C and does things in case of failure?

Well, we are talking business failure versus infrastructure failure: a checkout API needs to explicitly handle the case that the payment processor fails, whereupon it must revert prior steps in the process (including reservations held on inventory); the payment processor will not just succeed if tried more times; it could have failed because the user entered payment information incorrectly, or because they have reached their account or balance limit, etc.

I think Golem Rust can contain a lightweight implementation of the saga pattern, which can be used for scenarios like these. But upon further reflection, I think your "transaction never fails" semantics are indeed useful. For example: when you are writing to a database and also to a message broker, the transaction can always succeed eventually and you do want always-retry semantics. But for other cases, I think our value add is the saga pattern itself, which becomes reliable on Golem, but there's nothing Golem specific about that Saga pattern implementation.

Does this make sense or am I overlooking something?

Do you think the proposed advanced version with the kv store the etc makes sense as well? As a third level of guarantees?

Yes, but I would not do that at this point in time (I think it does not require Golem-specific support, only wasi-keyvalue, and of course the runtime API).

However, I have been thinking about the use of atomic region for a transaction: I do not think this is correct.

Let's take the checkout example:

reserve inventory

process the payment

store the success

create the shipping label & tracking number

dispatch the order

send the email confirmation

The user wants transactional guarantees around all 6 steps. What does that mean?

It means that the execution of these steps must be durable, so that if steps 1 - 2 are executed, and a fault occurs, then after recovery, step 3 will be the next step. Moreover, and unique to Golem, they must be fine-grained, such that if step (1) is actually 20 steps, and a fault occurs somewhere inside those 20 steps, then we'll recover wherever we left off when the fault occurred.

It also means that if there is a domain / business error, then compensations are applied automatically in reverse order. An example is that processing the payment fails, because of insufficient balance or overextended credit. If processing the payment falls, then the inventory must be released, which is the compensation for reserve inventory.

Overall, and around the whole 6 step process, the user expects: the steps execute completely, successfully, and exactly-once (up to the limit of idempotent APIs); OR the process fails somewhere in the middle, for a business / domain error, in which case, the transaction is rolled back / aborted, and control returns to the user, who can handle the domain / business error in a case-specific way.

In order to achieve these goals, we must not use atomic around the transaction, because that prevents durability around the individual operations that make up the steps.

One can argue atomic should be used around operations. That emulates the Temporal semantic (common to other durable execution platforms). However, I don't think we should do that. If a user wants to do that, they can opt-in. It's better to have fine-grained durability by default, IMO.

This makes sense and it also means that if the worker gets restarted during compensation, the compensation process will be continued as expected (because the failure and the already performed compensations are already persisted).

I think it is needed in the other version (which I called RetryingTransaction now) where after the compensation the transaction is retried with set-oplog-index. This means that for domain errors the transaction is treated as atomic because we drop all the previous oplog entries for the retry. By using the atomic region we say that it should behave the same in case of a non-domain error.

I think the key-value-store based version cannot be done with WASI only, for example you need to mark the key-value read part as non-persistent otherwise you always get an empty list of compensations during retry.

jdegoes · 2024-04-12T09:03:06Z

@vigoo

I think the changes we should make:

Alter semantics to match those discussed for the checkout use case
Introduce two types of transactions, but extensible for a third:
- Fallible transactions: steps like payment processing can fail and these rollback and return control to the user to handle the failure
- Infallible transactions: failures are always retried according to whatever policy is defined in Golem
Tweak the naming a bit (e.g. execute instead of add, etc.)
Ensure Operation is designed to be macro-friendly (see 5)
Document (but not implement) a Rust attribute macro that can make it easy to define operations using ordinary Rust functions, albeit processed with macros

vigoo · 2024-04-12T09:10:09Z

Yes these are all done, including the actual implementation of the macro. I'm trying to do one more change regarding the compensation function's parameters.

vigoo · 2024-04-12T10:17:42Z

Ready for new review

vigoo · 2024-04-12T11:19:26Z

golem-rust/src/transaction.rs

+    fn compensate(&self, input: Self::In) -> Result<(), Self::Err>;
+
+    /// Executes a compensation action for the operation which ended up with the given result.
+    fn compensate_with_result(


We can get rid of this variant (and other things with _with_result postfix if we make a single compensate that gets result: Option<Result<Out, Err>> - and it would only be called with None in the (not-yet-implemented) advanced case of trying to run compensations after a non-domain level error.

jdegoes · 2024-04-12T12:19:32Z

golem-rust/src/transaction.rs

+/// operation's compensation actions are executed in reverse order and the transaction
+/// returns with a failure.
+pub fn fallible_transaction<Out, Err: Clone + 'static>(
+    f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>,


Suggested change

f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>,

f: impl FnOnce(&mut FallibleTransaction<Err>) -> Result<Out, Err>,

I don't the user should be able to generate a TransactionResult. I think they can return Result and then we will wrap that in TransactionResult as appropriate.

I did this because the actual transaction result is coming from a call within this function (a tx.execute) currently. But I will try to restructure it so it is not necessary (moving the error handling into the outer level)

vigoo · 2024-04-12T15:00:02Z

The latest changes allow the golem_operation macro's compensation function to be any of the following shapes:

#[golem_operation(compensation=compensate_something)
fn do_something(input_1, ... input_n) -> Result<Out, Err> { ... }

// Not taking any input
fn compensate_something() -> Result<(), Err>

// Only using the operation's result
fn compensate_something(result: Option<Result<Out, Err>>) -> Result<(), Err>

// Also taking the inputs
fn compensate_something(result: Option<Result<Out, Err>>, input_1, ..., input_N) -> Result<(), Err>

We cannot have variants with no result but one or more inputs because that would lead to overlapping instances.

jdegoes · 2024-04-12T15:42:06Z

golem-rust/src/transaction/mod.rs

+    fn compensate(
+        &self,
+        input: Self::In,
+        result: Option<Result<Self::Out, Self::Err>>,


I don't understand this. If the operation fails, then compensate should not be called, because there's nothing to compensate. So can't this be simplified to:

Suggested change

result: Option<Result<Self::Out, Self::Err>>,

result: Option<Self::Out>,

To eliminate the Option, could we just serialize Out, to make it:

Suggested change

result: Option<Result<Self::Out, Self::Err>>,

result: Self::Out,

Finally, how are you making this work without Out being Clone?

Ok. I was assuming that we want the user to be able to perform compensation in case of an error too - so currently the first compensation action executed is the one for the failed operation, passing the failure cause to it.
I don't have a real-world example for it, it just felt right.

So if we say a failed operation never has to be compensated, then you are right in all points:

we don't have to pass a Result

and we can serialize Outs of the successful steps so there is no need for the Option either. In my version None would have been passed when we know an operation was started, but it never finished, to allow the user to try to compensate it somehow (check if it actually happened and do something conditionally, for example)

Should I make this simplification?

The Clone constraint is there, just not on the type. Probably it should. (It's on the execute function)

vigoo added 4 commits April 10, 2024 18:19

Converted to a workspace and added a new golem-rust crate

3d074f8

Fixes

f5f718e

Fix CI

071c7e9

Added WIT dir

dbe03d1

afsalthaj reviewed Apr 11, 2024

View reviewed changes

justcoon approved these changes Apr 11, 2024

View reviewed changes

jdegoes reviewed Apr 11, 2024

View reviewed changes

vigoo added 4 commits April 12, 2024 09:26

Multiple transaction modes and renames

24891fb

golem_operation macro

4070420

Do not mark normal transaction as atomic operation

333d9a6

Clippy

9cdb5dc

vigoo added 4 commits April 12, 2024 11:54

Compensation with result

562ee7f

Macro support for compensation_with_result

3da44f6

Macro generates extension methods

bf986ce

Fix macro

3e3b424

Empty commit

7140dba

vigoo commented Apr 12, 2024

View reviewed changes

jdegoes reviewed Apr 12, 2024

View reviewed changes

vigoo added 5 commits April 12, 2024 15:22

Not using TransactionFailure in user-defined functions

8610644

Got rid compensate_with_result

47e62f3

Flexible compensation function invocation

47c49f7

Removed commented out code

a2ebee4

Fix comment

a68b456

jdegoes reviewed Apr 12, 2024

View reviewed changes

vigoo added 2 commits April 12, 2024 18:54

Simplify compensation to not get called for failed operations

948dfaa

Adjust doc comments

baffd04

jdegoes approved these changes Apr 12, 2024

View reviewed changes

jdegoes merged commit 93cbc82 into main Apr 12, 2024
2 checks passed

jdegoes deleted the golem-rust-api-1 branch April 12, 2024 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High level Rust API for Golem's transactional host functions #9

High level Rust API for Golem's transactional host functions #9

vigoo commented Apr 10, 2024

afsalthaj Apr 11, 2024

afsalthaj Apr 11, 2024

vigoo Apr 11, 2024

vigoo commented Apr 11, 2024

jdegoes Apr 11, 2024

vigoo Apr 11, 2024 •

edited

Loading

jdegoes Apr 11, 2024

vigoo Apr 11, 2024

jdegoes Apr 11, 2024

vigoo Apr 12, 2024

jdegoes Apr 11, 2024

vigoo Apr 11, 2024

jdegoes Apr 11, 2024

vigoo Apr 11, 2024

jdegoes Apr 11, 2024

vigoo Apr 11, 2024

jdegoes Apr 12, 2024

vigoo Apr 12, 2024

vigoo Apr 12, 2024

jdegoes commented Apr 12, 2024

vigoo commented Apr 12, 2024

vigoo commented Apr 12, 2024

vigoo Apr 12, 2024

jdegoes Apr 12, 2024

vigoo Apr 12, 2024

vigoo commented Apr 12, 2024

jdegoes Apr 12, 2024

vigoo Apr 12, 2024

vigoo Apr 12, 2024

	f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>,
	f: impl FnOnce(&mut FallibleTransaction<Err>) -> Result<Out, Err>,

	result: Option<Result<Self::Out, Self::Err>>,
	result: Option<Self::Out>,

	result: Option<Result<Self::Out, Self::Err>>,
	result: Self::Out,

High level Rust API for Golem's transactional host functions #9

High level Rust API for Golem's transactional host functions #9

Conversation

vigoo commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vigoo commented Apr 11, 2024

Choose a reason for hiding this comment

vigoo Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdegoes commented Apr 12, 2024

vigoo commented Apr 12, 2024

vigoo commented Apr 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vigoo commented Apr 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vigoo Apr 11, 2024 •

edited

Loading