Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ways to impose partial ordering of messages sent to a chain #11

Open
5 tasks
Tracked by #31
soareschen opened this issue Jul 28, 2023 · 2 comments
Open
5 tasks
Tracked by #31

Comments

@soareschen
Copy link
Collaborator

soareschen commented Jul 28, 2023

Summary

Find out if it is possible for a chain context to determine if an opaque message requires waiting for another message to be sent first before it can be sent.

Problem Definition

In relayer-next informalsystems/hermes#2478, we employ massively parallel relaying by spawning many async tasks for each relaying operation. This concurrency architecture is based on the assumption that each relaying operation can be executed in parallel independent of each other.

However, there are several cases of relaying operations depending on each others, and that makes it difficult to parallelize such operations. Following are some of the use cases:

Ordered Channels

When relaying packets from ordered channels, a packet needs to wait for all other packets with a lower sequence to be relayed first. If multiple async tasks are generating IBC messages for the ordered channel at the same time, they must coordinate to ensure that IBC messages for the packets with lower sequence must be sent first.

Solomachine Relaying #2924

In solomachine relaying, the Cosmos chain rely on using a sequence stored in the client state to verify the IBC messages. In order for a proof to be valid, the signed data must have an exact match of the sequence stored in the client state. If multiple async tasks are generating IBC messages from a solomachine to a Cosmos chain, they must coordinate to ensure that the IBC messages with a lower sequence must be sent first.

Naive Workaround

A naive workaround for the above problem is to limit the parallelism, and allow only one async task at a time for relaying operations that are dependent on each other. However, this can significantly limit the performance of the relaying, as only one operation can be performed per chain height.

Note that even with the naive workaround, it is still possible to have concurrent relaying on unrelated operations. For example, there can be 3 async tasks that perform relaying of unordered packets, ordered packets, and solomachine packets, all to the same chain at the same time. And that is fine, because these operations are not dependent of each other.

Action Item

This issue would track the work to find a general way to resolve the task dependency problem. At a high level, we would need to design synchronization primitives such that the async tasks can wait for other tasks to complete some operation. Furthermore, the synchronization cannot just end after the previous message is sent as a transaction, as that would be too late to achieve any concurrency. Instead, we would need dedicated tasks to sequentialize the other tasks for generating messages, but then collect the ordered messages to be sent in the same batch.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@soareschen
Copy link
Collaborator Author

Another complication to consider is that multiple partial ordering subsystems may affect each other and introduce deadlock.

Consider a case where a worker task for sending an ordered channel packet with sequence 2 across a solomachine channel, followed by a worker task of the same ordered channel sending packet sequence 1. This would result in a deadlock where the first worker acquired the lock for solomachine, the second work acquired the lock for the ordered channel, but none could acquire both locks to proceed.

@soareschen
Copy link
Collaborator Author

soareschen commented Jul 29, 2023

A possible way to solve this is to take advantage of context-generic programming and employ message polymorphism to differentiate different messages.

Recall that similar thing has already done for the HasEventType::Event abstract type. There are traits such as HasWriteAckEvent which allows extraction of a HasWriteAckEvent::WriteAckEvent type from a HasEventType::Event type. On the other hand, we currently only have a HasMessageType::Message type that is used to represent all possible messages.

In the new design, we will introduce new abstract message types such as HasRecvPacketMessage<Counterparty>::RecvPacketMessage, which would be different for each counterparty. Following that, the message sender traits will also be changed from CanSendIbcMessages<Target> to CanSendIbcMessages<Target, Message>. With the changes, a packet relayer that needs to send a RecvPacket message would require the relay context to implement a trait like CanSendIbcMessages<Target, TargetChain::RecvPacketMessage>.

With this new design, we will be able to provide different concrete message types and concrete message senders depending on the situation. At the same time, if there is no need to handle special case, it is also possible to default all abstract messages back to the same concrete message type. For instance, in a Cosmos-to-Cosmos implementation, both abstract types <CosmosChain as HasMessageType>::Message and <CosmosChain as HasRecvPacketMessage<CosmosChain>>::RecvPacketMessage would have the same concrete type, which is CosmosMessage. In such case, there is also only one message sender that needs to be implemented, which is CanSendIbcMessages<Target, CosmosMessage>.

On the other hand, we will also able special case the implementation of solomachine-to-cosmos relaying, such that <CosmosChain as HasRecvPacketMessage<Solomachine>>::RecvPacketMessage would have the concrete type SolomachineToCosmosMessage. In this case, the packet relayer for solomachine-to-cosmos relaying would send messages to a special sink implemented for CanSendIbcMessages<Target, SolomachineMessage>. We can then spawn a dedicated worker task that sequentialize and collects the SolomachineMessages, and order them based on the sequence number. The nice thing about this design is that there is no need for dynamic dispatch or dynamic cast, as everything is static typed!

Similarly, for ordered packet relaying, there will be special traits like HasOrderedRecvPacketMessage<Counterparty>::OrderedRecvPacketMessage, which distinguishes from the unordered RecvPacket message. In this case, even for cosmos-to-cosmos relaying, the implementation <CosmosChain as HasOrderedRecvPacket<CosmosChain>>::OrderedRecvPacket would have a special concrete type like CosmosOrderedRecvPacketMessage. This would in turn allow sending of ordered packet messages to special sinks as implemented for CanSendIbcMessages<Target, CosmosOrderedRecvPacketMessage>

Lastly, the constraint composition allows us to take care of doubly ordered messages such as <CosmosChain as HasOrderedRecvPacketMessage<SolomachineChain>>::OrderedRecvPacket, which would have a concrete type like SolomachineToCosmosOrderedRecvPacketMessage. We would then have to implement dedicated worker tasks for that, but also coordinate with the more generalized worker task for SolomachineToCosmosMessage in order to avoid any possible deadlock.

All in all, the relayer implementation is going to be more complicated because of this. However, this may be an unavoidable complexity, as regardless of which programming technique we use, we would still need to write special cases when handling the sending of different kinds of messages. Nevertheless, the use of context-generic programming provides the flexibility for the implementation to either use the same concrete type for all abstract message types when there is no special case, or to use as many concrete type for each abstract message type as the number of special cases needed.

A key downside of this approach is that it requires the full generality as provided by context-generic programming. This means that it is no longer possible to use the all-in-one traits to represent everything uniformly. The all-in-one traits do not allow customization or handling of special cases. If we modify the all-in-one traits to support the special cases, it would no longer be a special case, and instead become a general case that needs to be handled by all implementations, even if they do not have a special case. However, this may be a good opportunity for us to abandon the all-in-one traits, which was originally intended to ease development, but may be proven to be too restrictive.

@soareschen soareschen transferred this issue from informalsystems/hermes Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🩹 Triage
Development

No branches or pull requests

1 participant