Name	Name	Last commit message	Last commit date
parent directory ..
README_files/figure-gfm	README_files/figure-gfm
impl	impl
AsyncTerminationDetection.cfg	AsyncTerminationDetection.cfg
AsyncTerminationDetection.tla	AsyncTerminationDetection.tla
AsyncTerminationDetection_proof.tla	AsyncTerminationDetection_proof.tla
EWD998.cfg	EWD998.cfg
EWD998.gif	EWD998.gif
EWD998.pdf	EWD998.pdf
EWD998.tla	EWD998.tla
EWD998Chan.cfg	EWD998Chan.cfg
EWD998Chan.tla	EWD998Chan.tla
EWD998ChanID.cfg	EWD998ChanID.cfg
EWD998ChanID.tla	EWD998ChanID.tla
EWD998ChanID_export.cfg	EWD998ChanID_export.cfg
EWD998ChanID_export.tla	EWD998ChanID_export.tla
EWD998ChanID_shiviz.cfg	EWD998ChanID_shiviz.cfg
EWD998ChanID_shiviz.tla	EWD998ChanID_shiviz.tla
EWD998ChanTrace.cfg	EWD998ChanTrace.cfg
EWD998ChanTrace.md	EWD998ChanTrace.md
EWD998ChanTrace.ndjson	EWD998ChanTrace.ndjson
EWD998ChanTrace.tla	EWD998ChanTrace.tla
EWD998Chan_opts.tla	EWD998Chan_opts.tla
EWD998PCal.cfg	EWD998PCal.cfg
EWD998PCal.tla	EWD998PCal.tla
EWD998Small.cfg	EWD998Small.cfg
EWD998_anim.tla	EWD998_anim.tla
EWD998_opts.R	EWD998_opts.R
EWD998_opts.gif	EWD998_opts.gif
EWD998_opts.tla	EWD998_opts.tla
EWD998_optsSC.tla	EWD998_optsSC.tla
EWD998_opts_pt3.gif	EWD998_opts_pt3.gif
EWD998_opts_pt4.gif	EWD998_opts_pt4.gif
EWD998_proof.tla	EWD998_proof.tla
README.Rmd	README.Rmd
README.md	README.md
SmokeEWD998.cfg	SmokeEWD998.cfg
SmokeEWD998.tla	SmokeEWD998.tla
SmokeEWD998_SC.cfg	SmokeEWD998_SC.cfg
SmokeEWD998_SC.tla	SmokeEWD998_SC.tla
SmokeEWD998_SC1651624910.csv	SmokeEWD998_SC1651624910.csv
TLAPS.tla	TLAPS.tla
Utils.tla	Utils.tla

EWD998

Markus A Kuppe (2022-05-03)

Statistics
Simulation
Analytical worst-case complexity
Randomized State Exploration (“Smoke Testing”)

The specification in this directory models termination detection in a ring as given by Shmuel Safra in EWD 998. EWD 998 is an extension of a simpler algorithm described in EWD 840. Compared to EWD 840, this algorithm supports asynchronous message delivery. However, it still assumes reliable message channels and non-faulty nodes. For TLA+ learners, it is best to study EWD840 first and then read the TLA+ spec of EWD998.tla.

For readers familiar with TLA+:

EWD998.tla refines the abstraction AsyncTerminationDetection under the refinement mapping given in EWD998.tla. This mapping has been model-checked with TLC. Additionally, the spec AsyncTerminationDetection_proof proves the main safety properties for module AsyncTerminationDetection, and refinement of SyncTerminationDetection.tla. Elsewhere, SyncTerminationDetection is refined by EWD840.tla.

SyncTerminationDetection < EWD840
                         < AsyncTerminationDetection 
                              < EWD998
                                  < EWD998Chan
                                      < EWD998ChanID

The spec EWD998Chan.tla refines EWD998.tla by adding inboxes. EWD998ChanID.tla refines EWD998Chan.tla and replace natural numbers as node identifiers with hostnames (arbitrary strings actually). EWD998ChanID.tla also shows how to model vector clocks in TLA+, which EWD998ChanID_shiviz.tla exploits to visualize traces with Shiviz. The spec EWD998ChanID_export.tla demonstrates how to export TLC traces to Json. Below is an animation of the termination detection process for six nodes. It was created with EWD998_anim.tla.

Statistics

Because of its simplicity, EWD998 has been chosen to explore the idea of inferring statistical properties by simulating TLA+ specs (with TLC’s simulator). In a nutshell, we randomly generates traces for real-world configurations (node numbers) or workloads (global state of the system when terminated), and predict the runtime behavior of the system. Such properties cannot be expressed with TLA+ because they are true or false of a set of behaviors; not a single behavior. If random simulation is good enough in practice to predict statistical properties, it could help short-circuit modeling of an algorithm that currently requires empirical analysis of an algorithm’s implementation. In other words, traditionally we would model EWD998 in TLA+ and check safey and liveness properties up to proving correctness. Then, we would implement EWD998 in a suitable programming language and analyze its statistical properties by running the implementation. If the statistical properties are unsatisfactory, we rinse and repeat and go back and optimize the algorithm.

In the case of EWD998, we pretend that we wish to study the efficiency of four different variants of the original termination detection algorithm:

“pt1”: An active node may pass the token if the node is black/tainted.
“pt2”: An active node may pass the token if the token is black/tainted.
“pt3”: Return the token to the initiator, thus, abort the token round, iff the node is black.
“pt4”: Return the token to the initiator, thus, abort the token round, iff the token is black.

pt3 and pt4 can be seen as “aborting” an inconclusive token round by directly returning the token to the initiator. The two variants come at the cost of all nodes knowing the identify of the initiator. However, this could be addressed by stamping the initiator’s id onto the token under the assumption that the underlying network allows for all nodes to send a token to the initiator (i.e. not a ring).

The way we are going to measure efficiency is by measuring the number of steps between termination of Environment and the detection of termination by System (see EWD998.tla for their definitions).

Simulation

In this section, we outline the simluation of the TLA+ specs and discuss our results.

The spec EWD998_opts.tla extends module EWD998 (EWD998Chan actually) and, depending on the given “feature flags” F, enables the variants pt1 to pt4 in the sub-action PassTokenOpts:

PassTokenOpts(n) ==
  /\ n # 0
  /\ \/ ~ active[n]
     \/ /\ "pt1" \in F
        /\ color[n] = "black"
     \/ /\ "pt2" \in F
        /\ \E j \in 1..Len(inbox[n]) : inbox[n][j].type = "tok" /\ inbox[n][j].color = "black"
  /\ \E j \in 1..Len(inbox[n]) : 
          /\ inbox[n][j].type = "tok"
          /\ LET tkn == inbox[n][j]
             IN  inbox' = [inbox EXCEPT ![CASE "pt3" \in F /\ color[n] = "black" -> 0
                                            [] "pt4" \in F /\ tkn.color ="black" -> 0
                                            [] OTHER    ->  n-1] = 
                                       Append(@, [tkn EXCEPT !.q = tkn.q + counter[n],
                                                             !.color = IF color[n] = "black"
                                                                       THEN "black"
                                                                       ELSE tkn.color]),
                                    ![n] = RemoveAt(@, j) ]
  /\ color' = [ color EXCEPT ![n] = "white" ]
  /\ UNCHANGED <<active, counter>>

The TLC action constraint AtTermination sets the first element of the list equal to TLCGet to the ordinal of the state in which all nodes are inactive and no messages are in-flight.

AtTermination ==
    IF EWD998!Termination # EWD998!Termination'
    THEN TLCSet(1, TLCGet("level"))
    ELSE TRUE

The TLC state constraint AtTerminationDetected resets the first element of the list TLCGet, after writing a record to the CSV file defined by the environment variable IOEnv.Out. AtTerminationDetected also writes the occurrences of all sub-actions in the current behavior to the CSV file (note new TLCGet("stats").behavior.actions), as well as the values of the constants F and N.

AtTerminationDetected ==
    EWD998!terminationDetected =>
    /\ LET o == TLCGet("stats").behavior.actions
       IN /\ CSVWrite("%1$s#%2$s#%3$s#%4$s#%5$s#%6$s#%7$s#%8$s#%9$s",
               << F, N, TLCGet("level"), TLCGet("level") - TLCGet(1),
                 o["InitiateProbe"],o["PassTokenOpts"], \* Note "Opts" suffix!
                 o["SendMsg"],o["RecvMsg"],o["Deactivate"]
                 >>,
               IOEnv.Out)
          /\ TLCSet(1, 0)

There is more boilerplate and re-definitions (SomeRingOfNodes, InitSim, …) to help TLC in EWD99_opts.tla, but none are relevant with regards to statistics. If we wouldn’t be studying pt1 and pt2, we could rewrite InitSim and SpecOpts to start in states where all nodes are inactive and no Environment steps are possible.

We then generate a reasonable number of traces for all elements of (SUBSET {pt1,pt2,pt3,pt4}) \X {7,29,43}. This is done in the TLA+ “script” EWD998_optsSC.tla. The average number of token passes are plotted in the barplots below.

Note that the underlying R code can be found in README.Rmd, from which this file is generated on a host with all libraries installed via the command Rscript -e "rmarkdown::render('README.Rmd')".

Average length of suffix for which `terminated /\ ~terminationDetected` holds

All three barplots above confirm that the average length of the suffix, for which terminated /\ ~terminationDetected is true, is consistently higher for feature flag subsets that include the pt4 variant and highest for the pt3 variant. Additionally, we see that the signal gets stronger for larger numbers of nodes. This could indicate that the small-scope hypothesis does not transfer to empirical measurements.

Plotting the average occurrences of the spec’s sub-actions shows how the InitiateProbe and, consequently (for each InitiateProbe action there are N PassToken actions), PassToken actions increase for variants pt3 and pt4. The higher number of SendMsg/RecvMsg actions for the original spec is likely statistical noise because it’s limited to N=7.

Average number of sub-actions per behavior

To conclude this section, differential analysis along multiple dimensions (number of nodes, spec variants) appears to be useful for predicting statistical properties of algorithmic variants. The occurrences of sub-actions alone, hints at properties that can be studied further by instrumenting a spec as shown above. However, even wall-clock time to simulate a fixed number of behaviors over the two dimensions number of nodes and spec variants alone, indicated the inferior behavior of pt3–running the simulator took forever to the point that it seemed stuck for pt3 with >100 nodes.

Running statistical analysis at the algorithm/spec level should generally provide a more robust signal because many sources of noise that usually distort measurements at the implementation level do not exist. Related, we don’t fully take advantage of studying properties above the code/implementation level, because we still let the simulator generate behaviors satisfying the original behavior spec that allows Environment steps. It would be faster to generate behavior by defining the set of initial states to be all states in which terminated /\ ~terminationDetected holds. Unfortunately, this is infeasible because the TLC simulator generates all initial states upfront. For increasing numbers of nodes, the cardinality of this set becomes astronomical (under a suitable state constraint to bound the message counters). The Randomization.tla module, otherwise used to validate inductive invariants, is a possible workaround.

Analytical worst-case complexity

With the help of the animation of the original algorithm below, we can deduce that the number of token passes is at most 3N with N the number of nodes (we start termination detection in a state where all nodes have terminated and are black, and the token is at node N-2). The original algorithm requires two inconclusive token rounds to reset all nodes to white and a final, conclusive round to detect termination. In short, the number of required token passes after termination is linear in the size of the ring.

On the other hand, the animation below reveals that the variant pt3 causes the inconclusive token rounds to be aborted N times, which gives us ((n+1)^2 + (n+1) / 2) - 1 as the overall worst-case number of token passes.

Variant pt4 has a slightly better worst-case behavior, but it still requires ((N+1)/2)^2+N when N is odd, and ((N+2)/2)^2+N-1 token passes when N is even.

(pt1 and pt2 have no relevance on the number of token passes between termination and termination detection, because they apply to active nodes only. More importantly, with pt1 and pt2 active, the enablement of sub-action PassTokenOpts changes increasing the probability of the simulator choosing PassTokenOpts to extend the current trace.)

Randomized State Exploration (“Smoke Testing”)

Violations found checking SmokeInit /\ [Next]_vars /\ WF_vars(System) => []Inv /\ ATDSpec with randomized state exploration (“simulation”) starting from a small, randomly chosen subset of the initial states.

README.md is generated from README.Rmd on a host with all libraries installed via:

Rscript -e "rmarkdown::render('README.Rmd')"

Install required libraries and R packages (on macOS) with:

brew install pandoc r
Rscript -e "install.packages(c('rmarkdown', 'ggplot2','dplyr', 'here'), repos='http://cran.us.r-project.org')"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ewd998

ewd998

README.md

EWD998

Statistics

Simulation

Average length of suffix for which `terminated /\ ~terminationDetected` holds

Average number of sub-actions per behavior

Analytical worst-case complexity

Randomized State Exploration (“Smoke Testing”)

README.md is generated from README.Rmd on a host with all libraries installed via:

Install required libraries and R packages (on macOS) with:

Files

ewd998

Directory actions

More options

Directory actions

More options

Latest commit

History

ewd998

Folders and files

parent directory

README.md

EWD998

Statistics

Simulation

Average length of suffix for which terminated /\ ~terminationDetected holds

Average number of sub-actions per behavior

Analytical worst-case complexity

Randomized State Exploration (“Smoke Testing”)

README.md is generated from README.Rmd on a host with all libraries installed via:

Install required libraries and R packages (on macOS) with:

Average length of suffix for which `terminated /\ ~terminationDetected` holds