Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get instance id for desired control-queue(s) #1069

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pasaini-microsoft
Copy link

@pasaini-microsoft pasaini-microsoft commented Apr 19, 2024

Motivation

#1079

Issue: No way of targeting an orchestrator instance to a desired control-queue.

  • We have been facing issues where DTF orchestration used to get stuck at random. Given that customer load is not very regular in our service, it was challenging to understand upfront if the orchestration would be processed or will be stuck.
  • More often customers used to reach out with incidents complaining their request not completing for long time.
  • This is where we needed orchestration instances to observe health of each queue by targeting one instance for desired control-queue.

Motivation:

  • motivation was to reduce the TTD for finding if orchestration can be stuck/waiting-forever in a control-queue irrespective of the cause.

Issue: No way to load lightly loaded control-queues.

  • We have face a few situations where some of control-queues are overwhelmed with orchestration instances while the others are happily processing almost nothing.

Motivation:

  • motivation was to target new instances of orchestration instances to set control-queue which are not heavily loaded

Proposal

API to generate instance id for a set of control-queues.

  • This API receives set of control-queues and prefix for instance id.
  • Implementation detail is: Allow special way of creating instance id with a suffix unsigned integer after delimiter '!' and explicitly use that value to route to control-queue (say suffixNumber % partitionCount). If this pattern is not used, it would goes back to default (current) which is hash(instance-id)%partition-count.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

Comment on lines +2090 to +2091
var partitionId = index % totalPartitions;
return (uint)partitionId;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this allows for the following scenario:
Assume we have 4 partitions,
And user creates instanceID "abc!7", they will still land on some queue, but it won't be the 7th queue, because that doesn't exist.

From first principles, I would think we'd want to error out in this case. But it seems this behavior is consistent with Netherite. I would prefer to throw in this case, but curious to know what others think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pasaini-microsoft - now that I think about it, it may be a good idea to keep the behavior you implemented. If in the future, we make the partitionCount something that users can change 'on the fly' (not possible today, but I'm working to make this happen), then this behavior would be resilient to changes in the number of partitions.

Let's keep this behavior for now but let's also try to emit a warning for when the total number of partitions is less than the customer's specified target number. That will help notify the customer that something possibly unintuitive is taking place. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually in favor of not generating a warning. The reason is that the warning would actually fire almost constantly in all the applications where I have used this.

The expected use is that applications want to distribute things over the queues but dont actually know or care how many queues there are (e.g. like partition keys in Azure Storage).

Copy link
Collaborator

@jviau jviau Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sebastianburckhardt - I feel this will be more noisy than helpful. I would only consider it if this leads to undefined behavior. But it isn't, it is by design hence the % operator. We need to make sure this behavior is well documented.

Copy link
Collaborator

@davidmrdavid davidmrdavid Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's ignore the warning, I'm convinced. Agreed on the need to document it. We can do that documentation in an azure-docs PR.

Comment on lines 62 to 69
controlQueueNumberToNameMap = new Dictionary<string, int>();

for (int i = 0; i < partitionCount; i++)
{
var controlQueueName = AzureStorageOrchestrationService.GetControlQueueName(settings.TaskHubName, i);
controlQueueNumberToNameMap[controlQueueName] = i;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still using this in the new tests? No, right?

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this for external events as well?

Comment on lines +322 to +326
/// <summary>
/// Whether to allow instanceIDs to use special syntax to land on a specific partition.
/// If enabled, when an instanceID ends with suffix '!nnn', where 'nnn' is an unsigned number, the instance will land on the partition/queue for to that number.
/// </summary>
public bool EnableExplicitPartitionPlacement { get; set; } = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider - it is not safe to change this from false to true (or vice-versa) while an orchestrator with the special syntax is in-flight. If we do that, any pre-existing messages for that orchestrator may now be considered to be "in the wrong queue".

Let's call this out in the intellisense


int placementSeparatorPosition = instanceId.LastIndexOf('!');

// if the instance id ends with !nnn, where nnn is an unsigned number, it indicates explicit partition placement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test that documents the behavior if the customer uses an instanceID with multiple ! in there? Say instanceID "A!1!B!3` should probably map to partition "3", right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test that checks that instanceID myinstanceID!NotANumber does not trigger any errors / that it correctly ignores the explicit placement logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants