Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to specify MaxHistoryEvents and constraint history loading to include the latest generation only #1119

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

davidmrdavid
Copy link
Collaborator

This PR makes two small (diff-wise) but significant changes to the Azure Storage provider.

New MaxHistoryEvents configuration

First - it introduces a new MaxHistoryEvents configuration. It allows the user to specify the maximum number of events they want to allow an orchestration to have before it is automatically terminated. Many incidents, and sometimes outages, result from orchestrators having too large of a history, creating application instability. This configuration should give users peace of mind so that, even if they accidentally generate a large history, the framework will protect itself and terminate offending orchestrations. This is an opt-in feature.

ContinueAsNew bug fix

It also fixes a longstanding bug that was only recently discovered - it appears that, in many cases, an orchestrator that was "continued as new" will still load fragments of it's old history (those that were not already overridden by the new history). This is because we don't always know the latest executionID of an orchestrator (which allows us to distinguish between generations), meaning that we loaded more data than necessary. The end result was this: if an orchestrator instanceID ever generated a history of 10k records, it did not matter that later generations only generated 1k records, we would always load 10k (the 1k of the current generation, and 9k stale records that haven't been overridden yet).

The fix for this bug is simple: we need to get the latest instanceID by first reading only sentinel row of the History table, which reports the latest executionID. Then, using that executionID as an extra filter, we read the history as normal. This extra table read will be a new cost to our users, but it is necessary for correctness. I'm hoping the cost is low enough to be 'ok'.

I did try to make this work without the extra table read, by a paginated query and read the history table "in parts" and stopped as soon as the executionID of the results changed, which tells us that now we're reading a stale history. However, if we did this, we would stop before reading the sentinel row (the sentinel row is the last row of the table), which DTFx needs for it's CheckpointCompletedTimestamp which it uses to determine out of order messages. In the end, I found it simpler and safer to make that extra request, but this is up for debate.

@AnatoliB
Copy link
Collaborator

How do we expect the user to choose a good MaxHistoryEvents value? How would they know that their value is not too high (so it doesn't matter) and not too low (so it terminates orchestrations prematurely)? Can we provide any guidance?

An alternative idea to consider: measure time required to read the history, and let the user configure the maximum time.

@davidmrdavid
Copy link
Collaborator Author

davidmrdavid commented Jun 21, 2024

@AnatoliB - I also considered measuring time instead, but I think that's a much more unstable statistic - the time to read from Azure Storage can vary greatly depending on network availability, IO resource constraints, throttling from Azure Storage, etc. So I think that, as a signal, it can vary greatly between measurements (one load of some history might be slow, but it might be fast again a bit later).

So I think measuring the history size might provide more predictable behavior for users, and for ourselves. It also correspond more closely to the terminology we use in our best practices doc - to keep history sizes small, not "fast".

As for guidance - I think that'll be a matter of documentation. For instance - the IntelliSense could suggest a default (in my experience - anything in the low thousands is fine, ideally below 1k, but apps are generally healthy enough up to ~10k events) I'd be happy to add that here, and in the DF-level PR (which will need to re-export this feature)

return tableEntitiesResponseInfo;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this extra whitespace before merging. Did not notice it

Copy link
Collaborator

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the downside of doing the extra sentinel load only occurs when we don't know the execution ID, correct? If that's the case, it seems like it will only happen in a smaller number of conditions, like when the new events don't contain execution IDs, like if the new events are only event raised events, etc. If that's true, then I'm not really worried about the extra load. This new logic wouldn't impact the most common scenarios.

Some code specific comments:


// Fail the orchestrator by creating Terminate event
ExecutionTerminatedEvent terminateEvent = new ExecutionTerminatedEvent(-1, $"Orchestrator exceeded maximum history size: {this.settings.MaxHistoryEvents}");
history.Events.Add(terminateEvent);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add this logic to CompleteTaskOrchestrationWorkItemAsync rather than doing it here. That way you can terminate the orchestration without needing to load and execute the excessive history, which is required in this case. There would be some logging benefits as well since the termination happens on the same logical thread as the orchestration execution, rather than on a background thread, which is the case for history loading.

@@ -226,22 +225,36 @@ public override async Task<OrchestrationHistory> GetHistoryEventsAsync(string in

return new OrchestrationHistory(historyEvents, checkpointCompletionTime, eTagValue, trackingStoreContext);
}

#nullable enable
async Task<TableEntitiesResponseInfo<DynamicTableEntity>> GetHistoryEntitiesResponseInfoAsync(string instanceId, string expectedExecutionId, IList<string> projectionColumns, CancellationToken cancellationToken = default(CancellationToken))
{
var sanitizedInstanceId = KeySanitation.EscapePartitionKey(instanceId);
string filterCondition = TableQuery.GenerateFilterCondition(PartitionKeyProperty, QueryComparisons.Equal, sanitizedInstanceId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're always overwriting this variable value with this change. To make the intent more clear, I suggest we rename this variable to to something like instanceIdFilterCondition and then declare filterCondition below (on line 256) as a new variable.

@AnatoliB
Copy link
Collaborator

As for guidance - I think that'll be a matter of documentation. For instance - the IntelliSense could suggest a default (in my experience - anything in the low thousands is fine, ideally below 1k, but apps are generally healthy enough up to ~10k events) I'd be happy to add that here, and in the DF-level PR (which will need to re-export this feature)

@davidmrdavid I'm sure we can recommend a number, but my concern is about being confident that using this number is better than having no limit at all. By suggesting any specific number, we can be easily off by orders of magnitude. If the number is significantly higher than what a specific configuration can manage (considering all the factors: I/O performance, memory, etc.), this limit will not matter. If the number is significantly lower, it will unnecessarily hurt a potentially healthy app. So, why would the customer want to set any specific number we suggest? What value are they getting from it? Where would their peace of mind come from?

As a customer, I would want to set this limit only if it is connected to a metric that is truly meaningful to me: e.g. time, memory consumption, etc. Based on my use case, I can often claim that, for example, spending more than 5 minutes on loading history feels way too long. We can address timing fluctuations by measuring averages, medians, or percentiles instead of acting on any single measurement.

I'm not strictly against introducing a limit in the proposed form, but I do worry that it's easy to measure but difficult to meaningfully apply. At the very least, we need to provide enough guidance for the user to understand when 1k is better than 10k or 100k, and not just suggest a seemingly arbitrary number.

@davidmrdavid
Copy link
Collaborator Author

I see your point and agree that history size much less tangible to customers than measuring time. Let me make a few comments on your points to help provide more context on how I view this.

We can address timing fluctuations by measuring averages, medians, or percentiles instead of acting on any single measurement.

Yes, that's true. However, for me to measure averages/median/etc I need to be able to save multiple history timing samples per instanceID. Today, we don't have the storage for that - we only have the history and instances table for long-term storage. In theory, I suppose we could augment either table to start storing this sort of data, but that would come at an extra customer cost (more columns in some table) as being a more intrusive change. In general, if we have a good place to store these measurements (so we can calculate averages/medians and other aggregate statistics), I'd be more open to making this configuration work in terms of time.

At the very least, we need to provide enough guidance for the user to understand when 1k is better than 10k or 100k, and not just suggest a seemingly arbitrary number.

Yeah, I see your point. One possible reason why 1k is particularly good, is that Azure Storage can only load 1k rows at a table from tables. Therefore, any history larger than 1k rows requires multiple reads to load into memory. So perhaps that's a principled reason to suggest 1k rows as having some special significance? I still believe the ultimately users need to handle this on a case-by-case basis, but Azure Storage is already giving special significance to 1k rows so perhaps we can rely on that precedent.

So, why would the customer want to set any specific number we suggest?

One way to help them decide what number to choose is to write a detector that uses telemetry to determine what history sizes take too long to load, and have the detector dynamically suggest a number depending on the size of those histories and how long they took to load. We already have a "slow histories" detector, so we could easily augment that to provide personalized guidance.

@jviau
Copy link
Collaborator

jviau commented Jun 24, 2024

I think we could avoid the extra query for execution ID if we create a special ExecuteQueryAsunc. Azure storage has pagination, but we are buffering it all into a list. With a special implementation, we could start the query, grab the first execution ID that appears, then stop loading more pages when the execution ID changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants