Implementing a TLA+ Specification: EWD998Chan #75

lemmy · 2023-04-10T21:50:00Z

Implementation Process `0d341ee`

I implemented the EWD998 algorithm based on the TLA+ specification EWD998Chan. The implementation took two hours, during which I did not run or debug the code. Copy-pasting snippets from the specification to the Java code made the translation straightforward, with the exception of developing the low-level network receiver thread, which is abstracted away in the specification.

Disclaimer: Initially, the code utilized a custom wire format, which I later replaced with JSON to eliminate error-prone ad-hoc parsing.

[Divergence] Debugging and Error Resolution `2e24bf9`

When running the implementation for the first time, I discovered a bug. A node that receives a token must retain the token across the while loop. Resolving this translation error took approximately 15 minutes and required executing the code using a debugger.

Adding JSON Logging `69f2708`

In order to add JSON logging, I identified the appropriate places to log where the system is in a consistent state. Logging occurs at two points in the implementation:

When a message is received. A message is conceptually received, not when the receiver thread accepts the message from the ServerSocket, but when the message is taken out of the inbox. Note that logging in the receiver thread would break the happen-before relationship of some log statements due to multithreading, leading to inconsistencies.
When a message is sent.

Trace Validation `5e0b9f5`

The trace validation mapping, as suggested by Ron Pressler, is not well-suited for validating a trace that is extremely partial. EWD998's trace neither has the values of all variables nor logs every state of the system.

Using non-determinism to axiomatically define the missing values tends to become impractical, as it quickly results in a TLA+ fragment that TLC cannot handle. For example, consider the following sub-formula of the next-state relation where the first conjunct defines an infinite set of functions, from which the second conjunction filters an infinite subset:

SomeSubAction ==
    /\ ...
    /\ inbox' \in [Node -> Seq(Msg)]
    /\ \E idx \in DOMAIN inbox[Initiator]': 
            inbox[Initiator][idx]' = Trace[TLCGet("level")].msg
    /\ ...

A constructive definition, on the other hand, is cumbersome and, thus, error-prone in TLA+. My workaround involves defining TLC state and action constraints that match each line in the trace to the current state of a prefix of some behavior defined by the original, high-level specification. A trace is accepted iff TLC can match every line in the trace to the corresponding state in a (prefix of) a high-level spec behavior.

Absence of Global Clock `61d6be7`

A fundamental problem of distributed systems is the absence of a global clock, which is why the trace/log file may be causally unordered. For example, the receiver of a message might log the receipt of a message before the sender logs the corresponding send event. The canonical solution in distributed systems is a vector clock, which is modeled in EWD998ChanID to interface with ShiViz (https://bestchai.bitbucket.io/shiviz).

Updating JSON Logging for ShiViz Compatibility `7897628`

I included the node's identity in the JSON logging to ensure compatibility with ShiViz. The following regex was used:

^{"event":"(<|>|d)","node":(?<host>[0-9]?),"pkt":{("snd":(?<src>[0-9]?),"rcv":(?<rcv>[0-9]?),"msg":(?<event>({"type":"tok","q":-?[0-9]*,"color":"(white|black)"}|{"type":"(pl|trm|w|d)"})),|)"vc":(?<clock>.*)}}

Introducing Stuttering Steps for Token Passing `98c4d40`

In the high-level TLA+ specification EWD998Chan, token passing is modeled as a single atomic step. However, in real-world systems, token passing is not atomic and is transmitted via the network. To address this issue, I introduced explicit stuttering steps, where variables remain unchanged, at the price of the length of the behavior increasing.

Handling Unlogged Deactivation Events `ad8988f`

Trace validation faces challenges such as the absence of node deactivation events in the logs, which leads to a rejected trace when a seemingly active node passes the token. The limitations of TLC in supporting the composition of actions (as discussed in section 7.3, pages 76ff, of Lamport's Specifying Systems) hinder the TraceSpec from defining its next-state relation by conjoining Deactivate \cdot PassToken. However, this limitation can be overcome by defining a suitable action that deactivates the node when it passes the token.

Related: tlaplus/tlaplus#805

Improving Trace Validation Coverage `55e9a64`

The previous step demonstrated how manually composed actions can handle missing state changes in the trace or log file, and I see no reason why this approach won't generalize beyond this specific example. In the case of the EWD998 implementation, it is possible to consistently log when a node deactivates, thereby increasing the coverage of trace validation by including more system state.

[Divergence] Handling Termination Messages in Trace Validation `7f157c8`

Trace validation starts to reject traces due to the presence of termination ("trm") messages in the implementation, which are not included in the EWD998 model. This discrepancy between the specifications and the actual code could have been easily identified through a manual comparison between the two. To resolve this issue, I define the TraceNextConstraint to allow for "trm" messages and base it on finite stuttering, similar to the way the IsRecvToken is handled.

[Divergence] SendMsg Action `ff7a49a`

Trace validation detects another divergence between the high-level specification and the implementation. The SendMsg action in the high-level specification does not allow a message to be sent from a node to itself. Although self-messaging is safe and does not violate the properties of the specification, I choose to modify the implementation to match the specification more closely.

[Divergence] Single node configuration `75673c4`

Trace validation identified a divergence between the specification and the code implementation. In a system with only one node, the token does not appear to be moving. To handle this edge case, I have adjusted the IsInitiateToken predicate to account for this scenario.

Logging Exceptions `27d721c`

In the code implementation, all exceptions are logged as special events in the log file. To maintain consistency with the specification, I have ensured that exceptions are logged to the trace and cause the trace validation to fail.

[Divergence] Trace Validation Divergence `5665744`

Upon performing trace validation, I discovered another divergence after generating approximately 10,000 states within ~30 seconds. The initiator initiates another token round even when the previous token round is conclusive. This issue occurs when the system is in a state with all nodes being white and inactive except for an active, but white, initiator that owns the token.

[Divergence] Final Implementation `4da7373`

I had to revert the (bogus) bugfix attempt above. To address the issue, I have now implemented the EWD998 specification correctly, ensuring that the divergence is eliminated and the system functions as expected. Trace validation no longer finds divergences.

Note to self: Random stuff at /Users/markus/src/TLA/_specs/examples/specifications/ewd998 [mku-TraceValidation]

The implementation took two hours and did not involve running the code. Disclaimer: The code first used a home-grown wire format that we replaced with Json to remove error-prone ad-hoc parsing.

The resolution of this translation error took approximately 15 minutes and required the execution of the code using a debugger. [Bug]

Notice that logging happes at two points in time when the system is in a consistent state; a) when a message is received^1 and b) when a message is sent. 1 A message is conceptually received, not when the receiver thread accepts the message from the ServerSocket, but when the message is taken out of the inbox. While it might be tempting to log in the receiver thread, it would break the happen-before relationship of some log statements. For example, it could appear as if a node first received a payload message that activated the node, and it then passes the token (which a node is only allowed to do when it is inactive).

The approach by Ron Pressler [1] does not work well for validating a trace that is extremely partial, i.e., it only contains messages that cross the network. EWD998's trace neither has the values of all variables nor logs every state of the system. Using non-determinism to define the missing values and states is not practical, as we end up in a TLA+ fragment that TLC cannot handle, such as the following sub-formula of the next-state relation, where the first conjunct defines an infinite set of functions from which the second conjunction filters an infinite subset of functions. ```tla SomeAct == /\ ... /\ inbox' \in [Node -> Seq(Msg)] /\ \E idx \in DOMAIN inbox[Initiator]': inbox[Initiator][idx]' = Trace[TLCGet("level")].msg /\ ... ``` A workaround is to define TLC state and action constraints that match each line in the trace to the current state of a prefix of some behavior defined by the original spec. The trace can only be mapped to a behavior iff TLC can match every line in the trace to a state in the prefix. If not, a TLC postcondition will be violated the unmatched logline. [1] https://www.youtube.com/watch?v=TP3SY0EUV2A

…al clock, which is why the trace/log file is unordered. For example, the receiver of a message might log the receipt of a message before the sender logs the send event. The canonical solution in distributed systems is a vector clock. Note that EWD998ChanID models a vector clock to interface with https://bestchai.bitbucket.io/shiviz/

… ShiViz (https://bestchai.bitbucket.io/shiviz). Regex: ^{"event":"(<|>|d)","node":(?<host>[0-9]?),"pkt":{("snd":(?<src>[0-9]?),"rcv":(?<rcv>[0-9]?),"msg":(?<event>({"type":"tok","q":-?[0-9]*,"color":"(white|black)"}|{"type":"(pl|trm|w|d)"})),|)"vc":(?<clock>.*)}}

…98Chan models token passing as a single, atomic step, but in a real system, passing the token is not atomic. The token is transmitted via the network. To handle this problem, this technique introduces (explicit) stuttering steps, where the variables remain unchanged, but the length of the behavior increases.

…nodes deactivating are not logged in the implementation. This leads to a rejected trace when a seemingly active node passes the token. TLC's limitation in supporting the composition of actions (compare section 7.3 (page 76ff) of Lamport's Specifying Systems) prevents the TraceSpec from defining its next-state relation by conjoining Deactivate \cdot PassToken. However, this limitation can be overcome by defining a suitable action that deactivates the node when it passes the token.

… missing state changes in the trace/log file, and we see no reason why this approach won't generalize beyond this particular example. With the EWD998 implementation, however, we can (consistently) log when a node deactivates, thereby increasing the coverage of trace validation by including more system state.

…rmination ("trm") messages in the implementation, which are not included in the EWD998 model. This represents the first discrepancy that has been discovered between the specifications and the actual code, but it is something that could have easily been spotted through a manual comparison between the two. To resolve this issue, the TraceNextConstraint has been defined to allow for "trm" messages and is based on finite stuttering, similar to the way the IsRecvToken is handled.

The SendMsg action in the high-level specification does not allow a message to be sent from a node to itself. While self-messaging is safe and does not violate the properties of the specification, the implementation will be changed instead.

…nerating approximately 10k states. The initiator initiates another token round even when the previous token round is conclusive. The system is in this state after a token round with all nodes white and inactive except for an active but white initiator. This is a first attempt at a bugfix. ``` << "Failed matching the trace to (a prefix of) a behavior:", [ pkt |-> [ vc |-> [ 1 |-> 342, 3 |-> 300, 0 |-> 216, 2 |-> 341, 5 |-> 313, 4 |-> 313, 6 |-> 320, 7 |-> 413 ], msg |-> [type |-> "pl"], snd |-> 0, rcv |-> 7 ], event |-> "<" ], "TLA+ debugger breakpoint hit count 2576" >> FALSE Error: Assumption line 212, col 5 to line 215, col 84 of module EWD998ChanTrace is false. ... ```

after generating approximately 10k states." and finally implement the EWD998 specification.

Signed-off-by: Markus Alexander Kuppe <[email protected]>

… EWD998Chan where possible. Signed-off-by: Markus Alexander Kuppe <[email protected]>

…allowed ('config' was unexpected) [Build]

lemmy added 19 commits April 10, 2023 14:29

Implementation of EWD998 based on the TLA+ spec EWD998Chan.

0d341ee

The implementation took two hours and did not involve running the code. Disclaimer: The code first used a home-grown wire format that we replaced with Json to remove error-prone ad-hoc parsing.

This node possesses the token, and therefore must retain it unchanged.

2e24bf9

The resolution of this translation error took approximately 15 minutes and required the execution of the code using a debugger. [Bug]

Run EWD998 trace validation as part of a GitHub action (CI/CD).

b501a91

Trace validation detects another divergence.

ff7a49a

The SendMsg action in the high-level specification does not allow a message to be sent from a node to itself. While self-messaging is safe and does not violate the properties of the specification, the implementation will be changed instead.

A system with *one* node won't see the token moving.

75673c4

Minor optimization: Prevent TLC from generating all inital states.

3ec089e

Log exceptions to the trace and make them fail a trace validation.

27d721c

Add a few leads/ideas for future improvements.

bb52145

Revert "Trace validation found another divergence within ~30 seconds

4da7373

after generating approximately 10k states." and finally implement the EWD998 specification.

Ignore Toolbox files

3dc8465

lemmy added the enhancement label Apr 10, 2023

lemmy force-pushed the mku-ewd998_impl branch 5 times, most recently from a3304ed to fcae11a Compare April 12, 2023 03:54

Add EWD998ChanTrace.tla spec to CI.

15a360d

Signed-off-by: Markus Alexander Kuppe <[email protected]>

lemmy force-pushed the mku-ewd998_impl branch from fcae11a to 15a360d Compare April 12, 2023 04:05

Refactor (shorten) EWD998ChanTrace by reusing existing sub-actions of…

72e6509

… EWD998Chan where possible. Signed-off-by: Markus Alexander Kuppe <[email protected]>

lemmy force-pushed the mku-ewd998_impl branch from 448d36e to 72e6509 Compare April 24, 2023 19:07

Merge branch 'tlaplus:master' into mku-ewd998_impl

7a77cf3

Merge branch 'master' into mku-ewd998_impl

f6914f9

lemmy mentioned this pull request Dec 13, 2023

e2e testing approach eatonphil/raft-rs#1

Open

lemmy added 2 commits April 5, 2024 14:49

Merge branch 'master' into mku-ewd998_impl

c8c9159

jsonschema.exceptions.ValidationError: Additional properties are not …

88faa37

…allowed ('config' was unexpected) [Build]

lemmy force-pushed the mku-ewd998_impl branch from 9d13fc4 to 51cd716 Compare April 6, 2024 00:39

Mirror PR into markdown note.

115c50c

lemmy force-pushed the mku-ewd998_impl branch from 51cd716 to 115c50c Compare April 6, 2024 00:46

lemmy marked this pull request as ready for review April 6, 2024 00:59

lemmy merged commit 9de4025 into tlaplus:master Apr 6, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing a TLA+ Specification: EWD998Chan #75

Implementing a TLA+ Specification: EWD998Chan #75

lemmy commented Apr 10, 2023 •

edited

Loading

Implementing a TLA+ Specification: EWD998Chan #75

Implementing a TLA+ Specification: EWD998Chan #75

Conversation

lemmy commented Apr 10, 2023 • edited Loading

Implementation Process 0d341ee

[Divergence] Debugging and Error Resolution 2e24bf9

Adding JSON Logging 69f2708

Trace Validation 5e0b9f5

Absence of Global Clock 61d6be7

Updating JSON Logging for ShiViz Compatibility 7897628

Introducing Stuttering Steps for Token Passing 98c4d40

Handling Unlogged Deactivation Events ad8988f

Improving Trace Validation Coverage 55e9a64

[Divergence] Handling Termination Messages in Trace Validation 7f157c8

[Divergence] SendMsg Action ff7a49a

[Divergence] Single node configuration 75673c4

Logging Exceptions 27d721c

[Divergence] Trace Validation Divergence 5665744

[Divergence] Final Implementation 4da7373

lemmy commented Apr 10, 2023 •

edited

Loading

Implementation Process `0d341ee`

[Divergence] Debugging and Error Resolution `2e24bf9`

Adding JSON Logging `69f2708`

Trace Validation `5e0b9f5`

Absence of Global Clock `61d6be7`

Updating JSON Logging for ShiViz Compatibility `7897628`

Introducing Stuttering Steps for Token Passing `98c4d40`

Handling Unlogged Deactivation Events `ad8988f`

Improving Trace Validation Coverage `55e9a64`

[Divergence] Handling Termination Messages in Trace Validation `7f157c8`

[Divergence] SendMsg Action `ff7a49a`

[Divergence] Single node configuration `75673c4`

Logging Exceptions `27d721c`

[Divergence] Trace Validation Divergence `5665744`

[Divergence] Final Implementation `4da7373`