-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High level Rust API for Golem's transactional host functions #9
Conversation
|
||
This crate contains couple of Rust macros that facilitate writing Golem Cloud backends in Rust: | ||
1. Derives `From<>` and `Into<>` typeclasses between wit-bindgen derived data types and custom domain model data types. | ||
2. Generates wit file from rust code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity. So this PR is removing these features and just exposing transactional host functions? Or is it a mix of different things that help users write golem backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i think I got the answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved the macro into golem-rust-macro
.
Possible improvement: #10 |
golem-rust/src/transaction.rs
Outdated
/// | ||
/// Implement this trait and use it within a `transaction` block. | ||
/// Operations can also be constructed from closures using `operation`. | ||
pub trait Operation<In, Out, Err>: Clone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very data-oriented API, which makes it more difficult for the user to interleave other things, such as logging, whose behavior does not need to be compensated in the event of a rollback.
I wonder if we could achieve a less-invasive API by executing the actions immediately, and only storing the rollbacks.
Let's imagine two possible APIs:
transaction(move |tx| {
// action is executed now, but compensation is stored in tx:
tx.execute(|| { ... }, || { ... });
});
Or:
let txn = Transaction::new();
txn.begin_op();
// Arbitrary rust code:
do_something();
txn.end_op(|| { my_compensation(); });
With type classes and macros, we could improve upon this a lot:
pub trait Operation {
type Out;
type Err;
fn execute(&self) -> Result<Out, Err>;
fn compensate(&self, out: Out) -> Result<(), Err>;
}
#[golem_operation]
#[golem_compensation(my_compensation)]
pub fn my_operation() -> Result<bool, Error> {
}
pub fn my_compensation(b: bool) -> Result<(), Error> {
}
...
log("some logging stuff here");
txn.execute(my_operation);
log("some logging stuff there");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to separate the operation definition from the execution in my version either:
tx.add(operation::<String, (), ()>(
move |input: String| {
log1.borrow_mut().push(format!("op1 execute {input}"));
Ok(())
},
move |input: String| {
log2.borrow_mut().push(format!("op1 rollback {input}"));
},
"input".to_string()
));
Of course it can be simplified like the execute
above, but we can only do that with this version and not my proposed "advanced one" (in #10)
Note that the reason I added an input is also this advanced proposal. If we don't want to do that, there is no need to pass an input to the operation and it can just be two closures.
The begin_op
/end_op
version cannot catch the errors happening in between, so there is no way to perform the rollback.
The macro to generate Operation
instances is probably a good idea (although a bit magical to me :) and seems easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean by separate is that you cannot put code around the execution of the operation, because you are building a data structure, which will be executed later. It would be nice to "execute now" the operation, and keep track of the compensation, because it allows arbitrary code before and after the compensated operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I just named it confusingly. My add
immediately executes the operation, and you can definitely have code around it. That's the whole reason transaction get's a f: impl FnOnce(&mut Transaction) -> Out
and not a list of operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know this. Maybe we can rename to execute
?
Let's create a ticket for the macro for operation. I think it can become bearable syntax with the macro, added to top-level functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a first version of the macro.
I'm going to extend Operation
and the macro as well to have a way to write compensations that require the output of the action - with the current one it only gets the input, which is the only thing we can do when we want to run compensations on non-domain failures too. But for the simpler saga pattern it is easier if you don't need to derive the compensation from the input (and probably it is not even possible in many cases) so I am going to support both.
golem-rust/src/transaction.rs
Outdated
} | ||
} | ||
|
||
pub fn transaction<Out>(f: impl FnOnce(&mut Transaction) -> Out) -> Out { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to carefully think through what we want the user API to be here.
I think the requirements are something like:
- The transaction is attempted.
- If there is an error, rollback is attempted.
- If the transaction succeeds, then the user gets back the successful result of the transaction.
- If the transaction fails, due to an error, then the user gets back the failure of the transaction, along with knowledge about whether or not all compensation actions succeeded.
So in my mind, our transaction function looks something like:
pub fn transaction<Out>(f: impl FnOnce(&mut Transaction) -> Out) -> TransactionResult<Out> { ... }
enum TransactionResult<Out> {
Success(Out),
FailedWithFullRollback,
FailedWithPartialRollback(<error>),
}
Moreover, I am not sure we need set_oplog_index
if we are using the closure. Because we can pop out of the closure at any time to short circuit and return the rollback error, if any.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking (and implemented accordingly) that in case of a failure we want to use Golem's capabilities to recover and retry. This means that if there was an error this function never returns. After the rollback operations are finished, it "crashes" with set-oplog-index
and retries from the beginning of the transactions.
I thought that was the primary reason why we added set-oplog-index
- otherwise would not this be a fully "user-land" feature which is not golem-specific at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a useful semantic (transactions must succeed). But also in some cases (the checkout example), transactions can fail, and if they do, we want to know about that and not auto-retry. Probably both semantics are useful, and we need a terminology to describe the distinction, and then an API to separate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, did not think of that case. So why would someone use golem-rust
to define operations etc. if they don't want anything Golem specific (recovery) to happen instead of just writing their business logic that calls A, B and C and does things in case of failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we are talking business failure versus infrastructure failure: a checkout API needs to explicitly handle the case that the payment processor fails, whereupon it must revert prior steps in the process (including reservations held on inventory); the payment processor will not just succeed if tried more times; it could have failed because the user entered payment information incorrectly, or because they have reached their account or balance limit, etc.
I think Golem Rust can contain a lightweight implementation of the saga pattern, which can be used for scenarios like these. But upon further reflection, I think your "transaction never fails" semantics are indeed useful. For example: when you are writing to a database and also to a message broker, the transaction can always succeed eventually and you do want always-retry semantics. But for other cases, I think our value add is the saga pattern itself, which becomes reliable on Golem, but there's nothing Golem specific about that Saga pattern implementation.
Does this make sense or am I overlooking something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the proposed advanced version with the kv store the etc makes sense as well? As a third level of guarantees?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I would not do that at this point in time (I think it does not require Golem-specific support, only wasi-keyvalue, and of course the runtime API).
However, I have been thinking about the use of atomic region for a transaction: I do not think this is correct.
Let's take the checkout example:
- reserve inventory
- process the payment
- store the success
- create the shipping label & tracking number
- dispatch the order
- send the email confirmation
The user wants transactional guarantees around all 6 steps. What does that mean?
- It means that the execution of these steps must be durable, so that if steps 1 - 2 are executed, and a fault occurs, then after recovery, step 3 will be the next step. Moreover, and unique to Golem, they must be fine-grained, such that if step (1) is actually 20 steps, and a fault occurs somewhere inside those 20 steps, then we'll recover wherever we left off when the fault occurred.
- It also means that if there is a domain / business error, then compensations are applied automatically in reverse order. An example is that processing the payment fails, because of insufficient balance or overextended credit. If processing the payment falls, then the inventory must be released, which is the compensation for reserve inventory.
- Overall, and around the whole 6 step process, the user expects: the steps execute completely, successfully, and exactly-once (up to the limit of idempotent APIs); OR the process fails somewhere in the middle, for a business / domain error, in which case, the transaction is rolled back / aborted, and control returns to the user, who can handle the domain / business error in a case-specific way.
In order to achieve these goals, we must not use atomic around the transaction, because that prevents durability around the individual operations that make up the steps.
One can argue atomic should be used around operations. That emulates the Temporal semantic (common to other durable execution platforms). However, I don't think we should do that. If a user wants to do that, they can opt-in. It's better to have fine-grained durability by default, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense and it also means that if the worker gets restarted during compensation, the compensation process will be continued as expected (because the failure and the already performed compensations are already persisted).
I think it is needed in the other version (which I called RetryingTransaction
now) where after the compensation the transaction is retried with set-oplog-index
. This means that for domain errors the transaction is treated as atomic because we drop all the previous oplog entries for the retry. By using the atomic region we say that it should behave the same in case of a non-domain error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key-value-store based version cannot be done with WASI only, for example you need to mark the key-value read part as non-persistent otherwise you always get an empty list of compensations during retry.
I think the changes we should make:
|
Yes these are all done, including the actual implementation of the macro. I'm trying to do one more change regarding the compensation function's parameters. |
Ready for new review |
golem-rust/src/transaction.rs
Outdated
fn compensate(&self, input: Self::In) -> Result<(), Self::Err>; | ||
|
||
/// Executes a compensation action for the operation which ended up with the given result. | ||
fn compensate_with_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of this variant (and other things with _with_result
postfix if we make a single compensate
that gets result: Option<Result<Out, Err>>
- and it would only be called with None
in the (not-yet-implemented) advanced case of trying to run compensations after a non-domain level error.
golem-rust/src/transaction.rs
Outdated
/// operation's compensation actions are executed in reverse order and the transaction | ||
/// returns with a failure. | ||
pub fn fallible_transaction<Out, Err: Clone + 'static>( | ||
f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f: impl FnOnce(&mut FallibleTransaction<Err>) -> TransactionResult<Out, Err>, | |
f: impl FnOnce(&mut FallibleTransaction<Err>) -> Result<Out, Err>, |
I don't the user should be able to generate a TransactionResult
. I think they can return Result
and then we will wrap that in TransactionResult
as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because the actual transaction result is coming from a call within this function (a tx.execute
) currently. But I will try to restructure it so it is not necessary (moving the error handling into the outer level)
The latest changes allow the
We cannot have variants with no |
golem-rust/src/transaction/mod.rs
Outdated
fn compensate( | ||
&self, | ||
input: Self::In, | ||
result: Option<Result<Self::Out, Self::Err>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. If the operation fails, then compensate should not be called, because there's nothing to compensate. So can't this be simplified to:
result: Option<Result<Self::Out, Self::Err>>, | |
result: Option<Self::Out>, |
To eliminate the Option, could we just serialize Out
, to make it:
result: Option<Result<Self::Out, Self::Err>>, | |
result: Self::Out, |
Finally, how are you making this work without Out
being Clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I was assuming that we want the user to be able to perform compensation in case of an error too - so currently the first compensation action executed is the one for the failed operation, passing the failure cause to it.
I don't have a real-world example for it, it just felt right.
So if we say a failed operation never has to be compensated, then you are right in all points:
- we don't have to pass a
Result
- and we can serialize
Out
s of the successful steps so there is no need for theOption
either. In my versionNone
would have been passed when we know an operation was started, but it never finished, to allow the user to try to compensate it somehow (check if it actually happened and do something conditionally, for example)
Should I make this simplification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Clone
constraint is there, just not on the type. Probably it should. (It's on the execute
function)
Resolves golemcloud/golem#166