Anduril (old codename: FIR) uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing.
- Requirements
- 0. Install and configure dependencies
- 1. Clone the repository
- 2. Run the main experiments
- 3. Artifact evaluation
-
OS and JDK:
- Anduril is developed and tested under Ubuntu 18.04 to 20.04 with JDK 8.
- Other systems and newer JDKs may also work.
-
Hardware:
- The basic workflow of Anduril described in this README can be done in just one single node.
- Our experiment node uses the CloudLab
c220g5
node type, which has two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz, 192GB ECC DDR4-2666 memory, and a 1 TB 7200 RPM 6G SAS HDs.
-
Git (>= 2.16.2, version control)
-
Apache Maven (>= 3.6.3, for Anduril compilation)
-
Apache Ant (>= 1.10.9, artifact testing only, for zookeeper compilation)
-
JDK8 (openjdk recommended)
-
protobuf (==2.5.0, artifact testing only, for HDFS compilation)
sudo apt-get update
sudo apt install git maven ant vim openjdk-8-jdk
sudo update-alternatives --set java $(sudo update-alternatives --list java | grep "java-8")
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
echo export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 >> ~/.bashrc
If you do not have root permissions, install the dependencies this way:
Rootless installation
DEP=$HOME/anduril-dep # modify this path to where you want the dependencies installed
cd $DEP
wget https://builds.openlogic.com/downloadJDK/openlogic-openjdk/8u422-b05/openlogic-openjdk-8u422-b05-linux-x64.tar.gztar xzvf jdk-8u301-linux-x64.tar.gz
tar xzvf openlogic-openjdk-8u422-b05-linux-x64.tar.gz
wget https://dlcdn.apache.org/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz
tar xzvf apache-maven-3.9.9-bin.tar.gz
wget https://dlcdn.apache.org//ant/binaries/apache-ant-1.10.14-bin.tar.gz
tar xzvf apache-ant-1.10.14-bin.tar.gz
export PATH=$PATH:$DEP/openlogic-openjdk-8u422-b05-linux-x64/bin:~/apache-maven-3.9.9/bin:$DEP/apache-ant-1.10.14/bin:$DEP/protobuf-build/bin
export JAVA_HOME=$DEP/openlogic-openjdk-8u422-b05-linux-x64
echo "export PATH=$DEP/openlogic-openjdk-8u422-b05-linux-x64/bin:~/apache-maven-3.9.9/bin:$DEP/apache-ant-1.10.14/bin:$DEP/protobuf-build/bin:\$PATH" >> ~/.bashrc
echo "export JAVA_HOME=$DEP/openlogic-openjdk-8u422-b05-linux-x64" >> ~/.bashrc
Install protobuf, which is needed for HDFS compilation:
DEP=$HOME/anduril-dep # modify this path to where you want the dependencies installed
cd $DEP
wget https://github.com/OrderLab/Anduril/raw/main/systems/protobuf-2.5.0.zip
unzip protobuf-2.5.0.zip
cd protobuf-2.5.0/
autoreconf -f -i -Wall,no-obsolete
./configure --prefix=$DEP/protobuf-build
make -j4
make install
export PATH=$DEP/protobuf-build/bin:$PATH
echo "export PATH=$DEP/protobuf-build/bin:\$PATH" >> ~/.bashrc
protoc --version
git clone https://github.com/OrderLab/Anduril.git
This repository contains the evaluated systems, so it is a bit large (around 3.5 GB). Make sure you have enough disk space.
There are 22 cases totaling up. Even though the target system of some of the cases are same (e.g. there are 4 cases in ZooKeeper), the patch version may differ a lot so the compilation, static analysis, and dynamic experiment config differ a lot.
The first step is to compile the system code into classes so that they can be
utilized by our static analyzer. The system codes are in the directory
system/case_name
. We need to switch to that directory and then run the
compilation commands. Besides the system code, we may also need to compile the
tests in the system code directory, which will serve as the workload for that
case.
Since the compilations commands differ by cases, we prepare a compile.sh
script
in each case directory that you can invoke. For example:
cd systems/zookeeper-3006
./compile.sh
We also provide a script to compile all cases:
cd systems
./compile-all.sh
In the second step, the goal is to filter out important log entries in the failure log.
In experiments/case_name
, there is a script that you can run the workload to get the logs. We run two times.
./run-original-experiment.sh > good-run-log.txt
./run-original-experiment.sh > good-run-log-2.txt
Then, move them to ground_truth/case_name
together with the failure log named bad-run-log.txt
. There is a script to filter out suspicious log entries.
# Assume there are good-run-log.txt, good-run-log-2.txt, and bad-run-log.txt
./make_diff.sh
The output are diff_log_original.txt
, diff_log_dd.txt
, and diff_log_dd_set.txt
in the directory ground_truth/case_name
. Take an example of the format:
# First is the class and second is the line number
LeaderRequestProcessor 77
MBeanRegistry 128
ZooKeeperCriticalThread 48
PrepRequestProcessor 965
ClientCnxn$SendThread 1181
AppenderDynamicMBean 209
...
Before running static analysis, compile our toolkit first
In tool/
,
mvn install -DskipTests
The scripts are in directory tool/bin
. For case case_name
, analyzer-${case_name}.sh
will output causal graph tree.json
in the directory you run the script and the instrumented class files. There is another post-processing step on the generated instrumnted class files through scripts in tool/move
.
tool/bin/analyze-${case_name}.sh
tool/move/${case_name}.sh
For the state-of-the-art baselines,
Static analysis of Fate
fate= tool/bin/analyze-${case_name}.sh
tool/move/${case_name}.sh
Static analysis of Crashtuner
crashtuner= tool/bin/analyze-${case_name}.sh
tool/move/${case_name}.sh
All the evaluation should happen in evaluation/case_name
directory.
For
cd evaluation/case_name
cp $DIR_WHERE_YOU_PERFORM_STATIC_ANALYSIS/tree.json .
./update.sh
If it is FIR:
cp fir-trial.sh single-trial.sh
Fate:
cp fate-trial.sh single-trial.sh
Crashtuner:
cp crashtuner-trial.sh single-trial.sh
The configuration file is config.properties
.
There is one extra file called config-template
. We can make the 6 corresponding config.properties
from it by attaching extra configuration.
For example, in zookeeper-2247
, config-template
flakyAgent.avoidBlockMode=true
flakyAgent.probability=0.05
flakyAgent.timePriorityTable=time.bin
flakyAgent.timeFeedbackMode=min_times
flakyAgent.trialTimeout=90
flakyAgent.recordOnthefly=true
The config.properties
for Full Feedback can be generated through:
cp config-template config.properties
echo "flakyAgent.feedback=true" >> config.properties
echo "flakyAgent.augFeedback=true" >> config.properties
echo "flakyAgent.occurrenceSize=1" >> config.properties
You can refer to fir-evaluation.sh
for all the 6 policies in FIR
There is one extra file called config-sota
:
flakyAgent.trialTimeout=90
flakyAgent.recordOnthefly=true
The config.properties
for either Fate or Crashtuner can be generated through:
cp config-sota config.properties
You can refer to fate-evaluation.sh
or crashtuner-evaluation.sh
to see what happens.
If your configuration contains flaky.timeFeedback=true
pr flaky.augFeedback=true
, time table is needed.
./make-depps.sh # If it is in evaluation/case_name
./run-instrumnted-experiment.sh > record-inject
java -jar reporter-1.0-SNAPSHOT-jar-with-dependencies.jar -t trials/ -s tree.json
Driver will run the experiments and output the trials into trials
. For trial with index i, injection-$i.json
records the fault injection point while $i.out
records the system output.
FIR:
./driver.sh num_trials
SOTA:
./driver-sota.sh num_trials
There are two options, if check-${case_name}.sh
is in the evaluation dir, we should use
`check-${case_name}.sh` trials
Else, it is incoporated into our reporter framework and can be checked with
java -jar reporter-1.0-SNAPSHOT-jar-with-dependencies.jar -t trials/ -s tree.json
We will uniformize it soon!
The scripts are stored in evaluation/scripts
.
We need three scripts fir-evaluation.sh
, fate-evaluation.sh
and crashtuner-evaluation.sh
. fir-evaluation.sh
is for the first 6 columns while fate-evaluation.sh
and crashtuner-evaluation.sh
are for SOTA.
Suppose you want to get the row of case_name
, copy the three scripts into the folder evaluaiton/case_name
The three scripts can be ran on three different machines. Before running the script, there are some fields needed to be edited"
In fir-evaluation.sh
, the case_name should be changed to case_name
. fir-evaluation.sh
will run the 6 experiments shown in Table II sequentially and p1-p6
designate how many trials each experiment lasts. For example, if you set p1
to 20
, the first experiment, Full Feedback
, would last 20
trials. A rule of thumb is to set this to be two times the data in the Table II. It it exceeds 2000
, decrease it to 2000
. Or it can not be finished in one day.
#!/usr/bin/env bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
case_name=zookeeper-2247
p1=1
p2=1
p3=1
p4=1
p5=1
p6=1
As to fate-evaluation.sh
or crashtuner-evaluation.sh
, there is only one experiment, so only p1
exists.
#!/usr/bin/env bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
case_name=zookeeper-2247
p1=1
tool_dir="${SCRIPT_DIR}/../.."
R='\033[0;31m'
G='\033[0;32m'
RESET='\033[0m'
function compile_before_analysis() {
mvn clean
mvn install -DskipTests
}
Also note that for some cases, the three scripts are already there. You can directly run them and they serve as good examples for you do other experiments.
They traverses the entire pipeline in section I, so you can just run the script to get the results.
./fir-evaluation.sh
./fate-evaluation.sh
./crashtuner-evaluation.sh
The first index of the trial in which the case is reproduced will be printed in Green
color.
echo -e "${G}Full Feedback result:"
./check-${case_name}.sh trials
echo -e "${RESET}"
Same idea as Table I. Edit and run parameter-evaluation.sh
in evaluaiton/artifact-evaluation
.