Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Cluster.Sharding: make entity passivization aware of _all_ messages being sent to entities, not just messages sent via the ShardRegion #7395

Open
Aaronontheweb opened this issue Nov 21, 2024 · 2 comments
Labels
akka-cluster-sharding discussion DX Developer experience issues - papercuts, footguns, and other non-bug problems.

Comments

@Aaronontheweb
Copy link
Member

Is your feature request related to a problem? Please describe.

This is the entity passivization code inside the Shard class - it passivates actors based on how long ago they processed their last message. We do this in order to free up memory from unused entity actors:

private void PassivateIdleEntities()
{
var deadline = DateTime.UtcNow - _settings.PassivateIdleEntityAfter;
var refsToPassivate = _lastMessageTimestamp
.Where(i => i.Value < deadline)
.Select(i => _entities.Entity(i.Key))
.Where(i => i != null).ToList();
if (refsToPassivate.Count > 0)
{
Log.Debug("{0}: Passivating [{1}] idle entities", _typeName, refsToPassivate.Count);
foreach (var r in refsToPassivate)
Passivate(r!, _handOffStopMessage);
}
}

The passivization window is configurable and this feature can be disabled altogether - but that's not really what this is about. The problem is that the Shard actor only uses data from the Cluster.Sharding system itself to keep entity actors alive:

private IActorRef GetOrCreateEntity(EntityId id)
{
var child = _entities.Entity(id);
if (child != null)
return child;
var name = Uri.EscapeDataString(id);
var a = Context.ActorOf(_entityProps(id), name);
Context.WatchWith(a, new EntityTerminated(a));
Log.Debug("{0}: Started entity [{1}] with entity id [{2}] in shard [{3}]", _typeName, a, id, _shardId);
_entities.AddEntity(id, a);
TouchLastMessageTimestamp(id);
EntityCreated(id);
return a;
}

private void TouchLastMessageTimestamp(EntityId id)
{
if (_passivateIdleTask != null)
{
_lastMessageTimestamp[id] = DateTime.UtcNow;
}
}

This can result in weird scenarios where "busy" entity actors can still be killed, such as:

  1. Queueing a ton of messages to an entity actor upfront, where it takes the entity actor more than 2m to process;
  2. Entity actor that receives traffic from non-ShardRegion sources can die (i.e. DistributedPubSub)
  3. Entity actors receiving messages directly from other actors can die (via its IActorRef, rather than the ShardRegion) can die.

I think we can probably broaden the definition of "passivate" to include all sources of message traffic that are not recurring messages (i.e. IWithTimers or Context.System.Scheduler) and enforce that inside Akka.Cluster.Sharding. This should make the behavior of automatic entity passivization more closely aligned to what users expect without having to make distinctions between which messages count and which ones don't.

Describe the solution you'd like

Two ideas for this:

  1. A passivization protocol - similar to how the ReceiveTimeout works, but in this case there's two parties: passivator (A) and passivatee (B). A basically tells B to set a ReceiveTimeout for some duration and then B does all of the accounting. The rest of the normal INotInfluenceReceiveTimeout and timeout window code that's already in the ActorCell applies and A gets a notification when B hits its receive timeout. The only real thing we'd need to add here is a private message type, handled automatically via ActorCell, that allows someone else to set the ReceiveTimeout and then a notification message type back.
  2. Optional and not exclusive, but maybe we should add an override to ActorBase that allows user-defined actors to customize what happens when they receive a PoisonPill - that way we can avoid / address issues like Akka.Cluster.Sharding / Akka.Persistence: PoisonPill message gets processed, kills actor before Persist() callback executed #6321 - so if a passivating actor needs to do some work prior to shutting down, maybe it can be given some time to do that. Only problem here though - no guarantee you get the time you need when shutting down, so this might add some extra complexity the framework doesn't need. Did occur to me though.

Describe alternatives you've considered

I also considered the alternative of having the entity actors ping the Shard every time they receive a message but that would get incredibly noisy and would harm the throughput of the sharding system significantly. Better to push the decision making and state tracking to the edges where it belongs.

@Aaronontheweb Aaronontheweb added discussion akka-cluster-sharding DX Developer experience issues - papercuts, footguns, and other non-bug problems. labels Nov 21, 2024
@Aaronontheweb
Copy link
Member Author

Gonna mark this down as a "developer experience" issue because that's kind of how I feel about it - one knock-on effect this might have though, if we implement it, is an increase in memory usage for users who were used to the old behavior and relied on the sharding system ignoring non-sharding messages to keep actors alive.

@to11mtm
Copy link
Member

to11mtm commented Dec 1, 2024

The passivization window is configurable and this feature can be disabled altogether - but that's not really what this is about. The problem is that the Shard actor only uses data from the Cluster.Sharding system itself to keep entity actors alive:

Two things going on in my head:

First, Timing. Akka Clusters are sometimes fun math problems. I know of at least one HOCON commit on a project where the timing settings were explained in detail as to HOW I got to the timings and how they were then padded for full resiliency.

The second is, 'intent'. Most of the systems I worked with, either had an 'explicit ack' model (i.e. a webapi talking into cluster insisted on acknowledgement back for successful processing, and even then there was a 'buffer' layer involved)...

... Actually @Aaronontheweb this is one of those 'scheduled message' hobby horse things to me, now that I think about it 😅 . I ran into an issue like this in the past, and what I personally found, was for various scheduled messages/etc you just always schedule to the shard. That also had the benefit of having control over whether the messages should keep the actor alive or not. "Akka is a Toolkit".

Kinda also worth noting on a similar note, another 'DevEx' problem that I have seen frequently is misunderstanding/misgroking of default vs shutdown messages from region and how to deal with them.


All of that aside, I feel like there's some merit to "Akka.Net is a toolkit" but remembering "Akka.Net can give you a finger-guard". I'm not certain whether that's extension method APIs to make it easier to handle cases, 'Custom' actors exposing similar APIs in a DSL, or even simply better docs+examples.

I suppose my other questions would be around 'how' to do it and the complexity of flipping it on/off, as well as risks of heisenbugs since I can only assume there is at least one more layer of complexity added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
akka-cluster-sharding discussion DX Developer experience issues - papercuts, footguns, and other non-bug problems.
Projects
None yet
Development

No branches or pull requests

2 participants