Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level Rust API for Golem's transactional host functions #9

Merged
merged 20 commits into from
Apr 12, 2024

Conversation

vigoo
Copy link
Contributor

@vigoo vigoo commented Apr 10, 2024


This crate contains couple of Rust macros that facilitate writing Golem Cloud backends in Rust:
1. Derives `From<>` and `Into<>` typeclasses between wit-bindgen derived data types and custom domain model data types.
2. Generates wit file from rust code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity. So this PR is removing these features and just exposing transactional host functions? Or is it a mix of different things that help users write golem backend?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i think I got the answer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved the macro into golem-rust-macro.

@vigoo
Copy link
Contributor Author

vigoo commented Apr 11, 2024

Possible improvement: #10

///
/// Implement this trait and use it within a `transaction` block.
/// Operations can also be constructed from closures using `operation`.
pub trait Operation<In, Out, Err>: Clone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very data-oriented API, which makes it more difficult for the user to interleave other things, such as logging, whose behavior does not need to be compensated in the event of a rollback.

I wonder if we could achieve a less-invasive API by executing the actions immediately, and only storing the rollbacks.

Let's imagine two possible APIs:

transaction(move |tx| {
  // action is executed now, but compensation is stored in tx:
  tx.execute(|| { ... }, || { ... });
});

Or:

let txn = Transaction::new();

txn.begin_op();

// Arbitrary rust code:
do_something();

txn.end_op(|| { my_compensation(); });

With type classes and macros, we could improve upon this a lot:

pub trait Operation {
  type Out;
  type Err;

  fn execute(&self) -> Result<Out, Err>;
  fn compensate(&self, out: Out) -> Result<(), Err>;
}

#[golem_operation]
#[golem_compensation(my_compensation)]
pub fn my_operation() -> Result<bool, Error> {
}

pub fn my_compensation(b: bool) -> Result<(), Error> {
}

...
log("some logging stuff here");
txn.execute(my_operation);
log("some logging stuff there");

Copy link
Contributor Author

@vigoo vigoo Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to separate the operation definition from the execution in my version either:

tx.add(operation::<String, (), ()>(
            move |input: String| {
                log1.borrow_mut().push(format!("op1 execute {input}"));
                Ok(())
            },
            move |input: String| {
                log2.borrow_mut().push(format!("op1 rollback {input}"));
            },
           "input".to_string()
));

Of course it can be simplified like the execute above, but we can only do that with this version and not my proposed "advanced one" (in #10)
Note that the reason I added an input is also this advanced proposal. If we don't want to do that, there is no need to pass an input to the operation and it can just be two closures.

The begin_op/end_op version cannot catch the errors happening in between, so there is no way to perform the rollback.

The macro to generate Operation instances is probably a good idea (although a bit magical to me :) and seems easy to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean by separate is that you cannot put code around the execution of the operation, because you are building a data structure, which will be executed later. It would be nice to "execute now" the operation, and keep track of the compensation, because it allows arbitrary code before and after the compensated operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I just named it confusingly. My add immediately executes the operation, and you can definitely have code around it. That's the whole reason transaction get's a f: impl FnOnce(&mut Transaction) -> Out and not a list of operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know this. Maybe we can rename to execute?

Let's create a ticket for the macro for operation. I think it can become bearable syntax with the macro, added to top-level functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a first version of the macro.

I'm going to extend Operation and the macro as well to have a way to write compensations that require the output of the action - with the current one it only gets the input, which is the only thing we can do when we want to run compensations on non-domain failures too. But for the simpler saga pattern it is easier if you don't need to derive the compensation from the input (and probably it is not even possible in many cases) so I am going to support both.

}
}

pub fn transaction<Out>(f: impl FnOnce(&mut Transaction) -> Out) -> Out {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to carefully think through what we want the user API to be here.

I think the requirements are something like:

  1. The transaction is attempted.
  2. If there is an error, rollback is attempted.
  3. If the transaction succeeds, then the user gets back the successful result of the transaction.
  4. If the transaction fails, due to an error, then the user gets back the failure of the transaction, along with knowledge about whether or not all compensation actions succeeded.

So in my mind, our transaction function looks something like:

pub fn transaction<Out>(f: impl FnOnce(&mut Transaction) -> Out) -> TransactionResult<Out> {  ... }

enum TransactionResult<Out> {
  Success(Out),
  FailedWithFullRollback,
  FailedWithPartialRollback(<error>),
}

Moreover, I am not sure we need set_oplog_index if we are using the closure. Because we can pop out of the closure at any time to short circuit and return the rollback error, if any.

Does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking (and implemented accordingly) that in case of a failure we want to use Golem's capabilities to recover and retry. This means that if there was an error this function never returns. After the rollback operations are finished, it "crashes" with set-oplog-index and retries from the beginning of the transactions.

I thought that was the primary reason why we added set-oplog-index - otherwise would not this be a fully "user-land" feature which is not golem-specific at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a useful semantic (transactions must succeed). But also in some cases (the checkout example), transactions can fail, and if they do, we want to know about that and not auto-retry. Probably both semantics are useful, and we need a terminology to describe the distinction, and then an API to separate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, did not think of that case. So why would someone use golem-rust to define operations etc. if they don't want anything Golem specific (recovery) to happen instead of just writing their business logic that calls A, B and C and does things in case of failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we are talking business failure versus infrastructure failure: a checkout API needs to explicitly handle the case that the payment processor fails, whereupon it must revert prior steps in the process (including reservations held on inventory); the payment processor will not just succeed if tried more times; it could have failed because the user entered payment information incorrectly, or because they have reached their account or balance limit, etc.

I think Golem Rust can contain a lightweight implementation of the saga pattern, which can be used for scenarios like these. But upon further reflection, I think your "transaction never fails" semantics are indeed useful. For example: when you are writing to a database and also to a message broker, the transaction can always succeed eventually and you do want always-retry semantics. But for other cases, I think our value add is the saga pattern itself, which becomes reliable on Golem, but there's nothing Golem specific about that Saga pattern implementation.

Does this make sense or am I overlooking something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the proposed advanced version with the kv store the etc makes sense as well? As a third level of guarantees?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I would not do that at this point in time (I think it does not require Golem-specific support, only wasi-keyvalue, and of course the runtime API).

However, I have been thinking about the use of atomic region for a transaction: I do not think this is correct.

Let's take the checkout example:

  1. reserve inventory
  2. process the payment
  3. store the success
  4. create the shipping label & tracking number
  5. dispatch the order
  6. send the email confirmation

The user wants transactional guarantees around all 6 steps. What does that mean?

  1. It means that the execution of these steps must be durable, so that if steps 1 - 2 are executed, and a fault occurs, then after recovery, step 3 will be the next step. Moreover, and unique to Golem, they must be fine-grained, such that if step (1) is actually 20 steps, and a fault occurs somewhere inside those 20 steps, then we'll recover wherever we left off when the fault occurred.
  2. It also means that if there is a domain / business error, then compensations are applied automatically in reverse order. An example is that processing the payment fails, because of insufficient balance or overextended credit. If processing the payment falls, then the inventory must be released, which is the compensation for reserve inventory.
  3. Overall, and around the whole 6 step process, the user expects: the steps execute completely, successfully, and exactly-once (up to the limit of idempotent APIs); OR the process fails somewhere in the middle, for a business / domain error, in which case, the transaction is rolled back / aborted, and control returns to the user, who can handle the domain / business error in a case-specific way.

In order to achieve these goals, we must not use atomic around the transaction, because that prevents durability around the individual operations that make up the steps.

One can argue atomic should be used around operations. That emulates the Temporal semantic (common to other durable execution platforms). However, I don't think we should do that. If a user wants to do that, they can opt-in. It's better to have fine-grained durability by default, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and it also means that if the worker gets restarted during compensation, the compensation process will be continued as expected (because the failure and the already performed compensations are already persisted).

I think it is needed in the other version (which I called RetryingTransaction now) where after the compensation the transaction is retried with set-oplog-index. This means that for domain errors the transaction is treated as atomic because we drop all the previous oplog entries for the retry. By using the atomic region we say that it should behave the same in case of a non-domain error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key-value-store based version cannot be done with WASI only, for example you need to mark the key-value read part as non-persistent otherwise you always get an empty list of compensations during retry.

@jdegoes
Copy link
Contributor

jdegoes commented Apr 12, 2024

@vigoo

I think the changes we should make:

  1. Alter semantics to match those discussed for the checkout use case
  2. Introduce two types of transactions, but extensible for a third:
    • Fallible transactions: steps like payment processing can fail and these rollback and return control to the user to handle the failure
    • Infallible transactions: failures are always retried according to whatever policy is defined in Golem
  3. Tweak the naming a bit (e.g. execute instead of add, etc.)
  4. Ensure Operation is designed to be macro-friendly (see 5)
  5. Document (but not implement) a Rust attribute macro that can make it easy to define operations using ordinary Rust functions, albeit processed with macros

@vigoo
Copy link
Contributor Author

vigoo commented Apr 12, 2024

Yes these are all done, including the actual implementation of the macro. I'm trying to do one more change regarding the compensation function's parameters.

@vigoo
Copy link
Contributor Author

vigoo commented Apr 12, 2024

Ready for new review

fn compensate(&self, input: Self::In) -> Result<(), Self::Err>;

/// Executes a compensation action for the operation which ended up with the given result.
fn compensate_with_result(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of this variant (and other things with _with_result postfix if we make a single compensate that gets result: Option<Result<Out, Err>> - and it would only be called with None in the (not-yet-implemented) advanced case of trying to run compensations after a non-domain level error.

/// operation's compensation actions are executed in reverse order and the transaction
/// returns with a failure.
pub fn fallible_transaction<Out, Err: Clone + 'static>(
f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>,
f: impl FnOnce(&mut FallibleTransaction<Err>) -> Result<Out, Err>,

I don't the user should be able to generate a TransactionResult. I think they can return Result and then we will wrap that in TransactionResult as appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because the actual transaction result is coming from a call within this function (a tx.execute) currently. But I will try to restructure it so it is not necessary (moving the error handling into the outer level)

@vigoo
Copy link
Contributor Author

vigoo commented Apr 12, 2024

The latest changes allow the golem_operation macro's compensation function to be any of the following shapes:

#[golem_operation(compensation=compensate_something)
fn do_something(input_1, ... input_n) -> Result<Out, Err> { ... }

// Not taking any input
fn compensate_something() -> Result<(), Err>

// Only using the operation's result
fn compensate_something(result: Option<Result<Out, Err>>) -> Result<(), Err>

// Also taking the inputs
fn compensate_something(result: Option<Result<Out, Err>>, input_1, ..., input_N) -> Result<(), Err>

We cannot have variants with no result but one or more inputs because that would lead to overlapping instances.

fn compensate(
&self,
input: Self::In,
result: Option<Result<Self::Out, Self::Err>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If the operation fails, then compensate should not be called, because there's nothing to compensate. So can't this be simplified to:

Suggested change
result: Option<Result<Self::Out, Self::Err>>,
result: Option<Self::Out>,

To eliminate the Option, could we just serialize Out, to make it:

Suggested change
result: Option<Result<Self::Out, Self::Err>>,
result: Self::Out,

Finally, how are you making this work without Out being Clone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I was assuming that we want the user to be able to perform compensation in case of an error too - so currently the first compensation action executed is the one for the failed operation, passing the failure cause to it.
I don't have a real-world example for it, it just felt right.

So if we say a failed operation never has to be compensated, then you are right in all points:

  • we don't have to pass a Result
  • and we can serialize Outs of the successful steps so there is no need for the Option either. In my version None would have been passed when we know an operation was started, but it never finished, to allow the user to try to compensate it somehow (check if it actually happened and do something conditionally, for example)

Should I make this simplification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Clone constraint is there, just not on the type. Probably it should. (It's on the execute function)

@jdegoes jdegoes merged commit 93cbc82 into main Apr 12, 2024
2 checks passed
@jdegoes jdegoes deleted the golem-rust-api-1 branch April 12, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design a Rust version of the high level transaction API
4 participants