Skip to content

Latest commit

 

History

History
331 lines (210 loc) · 18.2 KB

FASE23-README.md

File metadata and controls

331 lines (210 loc) · 18.2 KB

Model-based Player Experience (PX) Testing

1. Summary

This artifact provides the experiments that were carried out in this paper:

  • Saba Gholizadeh Ansari, I. S. W. B. Prasetya, Davide Prandi, Fitsum Meshesha Kifetew, Frank Dignum, Mehdi Dastani, Gabriele Keller, Model-based Player Experience Testing with Emotion Pattern Verification, in the 26th International Conference on Fundamental Approaches to Software Engineering (FASE), 2023.

Experiment subject: a game called Lab Recruits, see: https://github.com/iv4xr-project/labrecruits.

The experiments aim to demonstrate that: (1) emotion heatmaps can be automatically generated for the subject, and (2) a form of player experience requirements can be automatically checked. This is done by first generating test suites from an EFSM model of the subject, and then using a computational model of emotion (see the paper) to produce emotion traces capturing the intensity of different types of emotions through out the tests' runs. The traces are then analyzed to produce visualisation and to do requirement checking.

Artifact content

The artifact contains the original experiment's raw results, as well as scripts to allow data to be generated from scratch.

Unzipping the artifact should create this directory structure:

px-mbt
 |-eplaytesting-pipeline
 |  |- ...
 |  |- FASE23Dataset
 |-SBTtest
 |-MCtest
 |-Combinedtest
 |-traces
 |-fixedtraces
 |-iv4xrDemo
 |  |-gym
 |-mavenrepo
 |-otherNeededSoftware
  • eplaytesting-pipeline This contains the Java-project implementing the method in the paper Ansari et al. It also contains scripts to re-run the experiments in the paper.

  • eplaytesting-pipeline/FASE23Dataset contains raw results, in the form of test suites and emotion traces, that were generated by the original experiments in the paper. More precisely it contains:

    • Base_LR_Level_and_EFSM : the base game-level used in the experiment, along with its EFSM model.
    • MutatedLevels: 10 mutants of the base game-level.
    • GeneratedTestSuites: the test suites used in the paper, generated by search-based-testing (SBT) and model-checking-based (MC) generators.
    • Sec5.2_Experiment_produced_emotion_tracesl: emotion traces produced by running the above test suites.
    • Sec5.3_Experiment_produced_emotion_traces_from_10mutants: emotion traces produced by the mutants.
  • SBTtest, MCtest, Combinedtest: if you want to generate the test suites anew (rather than using those pre-provided above), they will be placed in these folders.

  • traces: if you want to re-run the test suites, the produced emotion traces will be placed in this folder.

  • iv4xrDemo: containing the System under Test executable.

2. Hardware Requirement

Experiments were run on a machine with the following specifiacrtion:

  • 8th generation Intel Core i7 processor
  • 32 GB RAM
  • 1.8 GHz CPU

We recommend to use at least:

  • Intel Core i5
  • 8 GB RAM.

3. Setup: steps to take in the FASE 2023 VM

This assumes you have the VM already installed. Else, you can get it from here: https://zenodo.org/record/7446277#.Y7Q2Z-zMJTY

  1. unzip the zip of this artifact in your computer, and share its root folder px-mbt to the VM. Mount it to your home directory in the VM. For example, if your home is /home/fase2023 then mount the artifact-root folder to the location /home/fase2023/px-mbt in the VM.

  2. Go into the folder px-mbt. If you are prevented to go into px-mbt due to permission denied, try this: sudo adduser $USER vboxsf. Reset the VM and then try again.

  3. From px-mbt run the script:

> ./prep.sh

This should install the needed support software (Maven and some Python libraries) needed to experiments. After this we are ready to run the experiments.

Note: The System under Test (experiment subject): the game Lab Recruits

It is a game called Lab Recruits, see: https://github.com/iv4xr-project/labrecruits. Windows, Mac, and Linux executables are included in this artifact. The game requires a level (a 'game world'). The level used by the paper is in the folder eplaytesting-pipeline/FASE23Dataset/Base_LR_Level_and_EFSM .

If you want to try to play it, just run the game, and load the level-definition file LabRecruits_level.csvin that folder. Playing with graphics is only possible on Windows and Mac.

Caution

The original experiments were run on Windows. The game Lab Recruits uses graphics and is not really meant to be played on Linux. We do provide executable for Linux, though without graphics. The experiments are by default set to run without graphics, but if you do want to see the graphics then you need a Windows or Mac machine.

4. Test instructions (5-10 min)

Go to the folder playtesting-pipeline and do:

>  ./fase23clean.sh
>  ./fase23quickcheck.sh

Check the following:

  • At the end the following is printed in the console:
... <snip>
** #test-cases: 2
** #emotion patterns investigated: 11
** #patterns witnessed: 3
** #patterns not appearing: 8
... <snip>
  • ../MCtest contains test cases (files named xxtestxx.ser).
  • ../traces contains two traces (.csv files) and 6 png graphics.

5. Replication

Running the full set of experiments takes a very long time (xxx hours), so we will first give the instructions for just running a subset of them. Also keep in mind that the original experiments were ran on the Windows version of the SUT. Its Linux version is not guaranteed to have exactly the same behavior.

5a. Limited Replication of Experiments in Section 5.2 (about 5 hours)

This only uses the LTL model checker to generate a test suite. The model checker is deterministic so it will always produce the same test suite. From the generated test suite we will produce emotion heatmaps and we will check a set of emotion requirements.

Go to the folder playtesting-pipeline and do:

>  ./fase23clean.sh
>  ./fase23RunMC.sh
  • You can compare the produced heatmaps in ../traces to the results shown in Figure 4 and Figure 5 (the MC heatmaps only) in Section 5.2. You need to flip the images vertically first.

  • The last step of the runs checks which of the Player Experience (PX) requirements in Table 3 in Section 5.2 are confirmed by the tests. The results are printed to the console, and should be among the last printed, it looks like this:

** nD;S: 2 (VALID)
** nF;S: 1 (SAT)
** J;nS: 0 (UNSAT)
** J;D: 0 (UNSAT)
** J;F;S: 1 (SAT)
** D;H;P: 0 (UNSAT)
** D;H;S: 0 (UNSAT)
** D;H;nD;S: 0 (UNSAT)
** F;D;H;F;J: 0 (UNSAT)
** H;F;D;D;D;H;F;J: 0 (UNSAT)
** F;D;D;H;F;P: 0 (UNSAT)

Note that a VALID verdict implies SAT. Also note that some requirements expect UNSAT as verdict. Only compare the TS_MC part of Table 3.

5b. Full Replication of Experiments in Section 5.2 (additional 3.5 hours)

Full replication of Section 5.2 consists of the Limited Replication from step 5a above, plus running a test suite generated by a Search-Based Testing (SBT) algorithm. If you already did 5a, you only need to add the SBT part below, and then we have full replication (and else you also have to do 5a).

The SBT algorithm is however stochastic; so it produces a different test suite every time. We will provide two scripts. In the first one you use the SBT test suite that was used in the original experiment. The second script would generate a fresh suite.

Go to the folder eplaytesting-pipeline and do this first:

> ./fase23clean.sh

Then you can choose between one of these.

  • > fase23RunSBTwithOrg.sh This uses a pre-generated SBT test suite.

  • > fase23RunSBTregen.sh This generates a fresh test suite.

After this, and assuming you already did the "Limited Replication" in 5a, you now only need to validate:

  • You can compare the produced heatmaps in ../traces to the results reported in Figure 4 and Figure 5 (the SBT heatmaps) in Section 5.2. You need to flip the images vertically first.

  • The last step of the runs check which of the Player Experience (PX) requirements in Table 3 in Section 5.2 are confirmed by the tests. The results are printed to the console, and should be among the last printed. Note that a VALID verdict implies SAT. Also note that some requirements expect UNSAT as verdict. Only compare the TS_SB part of the table.

5c Full Replication of Experiment in Section 5.3 (about 100 hours)

This experiment is a mutation test, intended to asses the ability of our approach to find PX errors. The experiment is very time consuming. It uses both the MC and SBT test suites, which has to be run on each of the mutants. There are in total 10 mutants, each will take about 10 hours to test since we now use both test suites.

The mutated game levels can be found in the folder FASE23Dataset/MutatedLevels. For each mutant do the following:

  1. Go to the folder eplaytesting-pipeline.
  2. Do > ./fase23clean.sh
  3. Copy the selected mutant file (a .csv file) to ../Combinedtest/model/
  4. Run > ./fase23RunAllOrg.sh
  5. Check the last console output to see which PX requirements are confirmed by the tests. The mutant is killed if one of the requirements is violated (e.g. a requirement expect SAT but we get an UNSAT result, or it expect UNSAT but we get SAT or VALID). You can compare the result with Table 5 in Section 5.3 (each mutant has its own column in the Table).

6. General use of this artifact

This section describes the general use of the artifact. We will first explain the workflow of automated PX testing, and then explain how to perform each step in the workflow. This would allow the user of this artifact to select his/her own subset of test cases to use and then perform the rest of PX testing. We will also explain how to formulate your own PX requirements to try.

Workflow

Generally the workflow is as follows:

  • Step 1: generate test suites from an EFSM model of a Lab Recruits level. The used EFSM model and level can be found/viewed in this folder: eplaytesting-pipeline/FASE23Dataset/Base_LR_Level_and_EFSM. This process is offline (does not require runs on the SUT).

  • Step 2: run (selected) test cases on the SUT. This uses the iv4xr agent Framework along with the OCC Computational Model of Emotion. Running test cases produce emotion traces, showing the intensity of various emotions through the runs, as calculated by the OCC model.

  • Step 3: apply Player Experience (PX) analyses on the resulting emotion traces (from step-2). Two analyses are provided in this artifact: (a) producing emotion heatmaps, as in Figure 4 and 5 in the paper, and (b) checking if the expected presence or absence of emotion patterns, as in Table 3 in the paper.

You can also do these analyses on traces copied from the original experiments that are also packaged in this artifact (in the folder eplaytesting-pipeline/FASE23Dataset/Sec5.2_Experiment_produced_emotion_traces), or on traces you generate yourself from step-2 above.

For convenience experiments are run using Java JUnit runner ---we only use Junit to borrow its runner, and not for doing unit-test.

Step-1: generating test suites

As said, this stage is offline (does not require runs on the SUT). Test suites are generated from a supplied EFSM model of the SUT.

Pre-generated test suites are provided in the folder eplaytesting-pipeline/FASE23Dataset/GeneratedTestSuites. These are the test suites originally used in the paper.

If you want to try to generate the test suites yourself, go to eplaytesting-pipeline. From there, run these:

> mvn test -Dtest=eu.iv4xr.ux.pxtestingPipeline.SBtest_Generation
> mvn test -Dtest=eu.iv4xr.ux.pxtestingPipeline.MCtest_Generation

The first use a search-based testing algorithm (SBT-MOSA) to generate a test suite. The second uses an LTL model checker to do the same. The resulting test suites will be placed in SBTTest and MCtest respectively. The Model Checker is deterministic (always produce the same test suite). The SBT generator is stochastic, so it may produce a different test suite every time. In the original experiment we ran it 10x; we include one of these in this artifact.

Note: the above are NOT JUnit tests, we just use its runner for convenience.

Each test case in the suites consists of these files:

  • xxxtest_k.txt : a text-readable representation of the test case (as a sequence of steps).
  • xxxtest_k.ser : a binary representation of the test case. This will be loaded when we want to run the test-case.

Step-2: selecting, running the test cases and generating the emotion traces

To do player experience analyses we first need to run test cases on the SUT. The OCC Computational Model of Emotion is hooked into the tool that runs the test cases, so that running them generate emotion trace-files, recording the intensity of every emotion types through out the game plays as calculated by the OCC model. The player experience analyses will be later done offline on the generated traces.

To run test cases and generate emotion traces the steps:

  1. Select some or all the test cases from SBTTest or/and MCtest and copy those to the folder Combinedtest. Keep in mind that running them all will take quite some time (hours!); you can perhaps start with just selecting two or three test cases.

You can run a subset of STB-suite and a subset of MC-suite separately (we did that in the original experiment) if you want to compare their results, or just make a mixed selection.

Else, if you want to use some or all of the original test suites used in the paper, they can be found in the folder eplaytesting-pipeline/FASE23Dataset/GeneratedTestSuites. Copy them to the folder Combinedtest.

  1. Copy the target game level:
> cp "eplaytesting-pipeline/FASE23Dataset/Base_LR_Level_and_EFSM/*"  Combinedtest/model/
  1. Run the tests you selected above. Go to eplaytesting-pipeline, then do:
> mvn test -Dtest=eu.iv4xr.ux.pxtestingPipeline.RunOCC

The test cases will be run on the SUT without graphics. If you want to see the graphics, turn on the flag withGraphics in the class RunOCC (the one that you run above) to true.

The resulting trace-files are placed in the traces folder. They are named data_goalQuestCompleted_xxx.csv

Note-1: the above is NOT a JUnit test, we just use its runner for convenience.

Note-2: the interface to the SUT does not allow a synchronous control over the SUT. Because of this the test agent might get stuck e.g. around a sticking corner. If you suspect this might be an issue, you can run a testcase-run fixer. It will check the trace-files if they look strange. The test-cases whose traces look strange will be re-run. To run the fixer:

mvn test -Dtest=eu.iv4xr.ux.pxtestingPipeline.TestcasesExecRepair

The resulting fixed trace can be found in the folder fixedtraces.

Step-3: run PX analyses

The analyses work on the emotion traces produced in Step-2 above. They should be in the folder traces. Alternatively, you can copy emotion traces from the original experiment from eplaytesting-pipeline/FASE23Dataset/Sec5.2_Experiment_produced_emotion_traces to the folder traces.

Emotion heatmaps

To produce emotion heatmaps from the traces (as in Figure 4 and 5 in the paper), Go to eplaytesting-pipeline, then do:

> python3 ./mkHeatmaps.py

This will produce 6x heatmaps, one for each emotion type. The heatmap of emotion type E, e.g. fear, aggregates the data from all traces in the folder traces, showing the maximum intensity of fear at every 'square' area in the game level, over all runs in the test suite (that were used to produce the traces).

Checking PX properties

Section 4.3 in the paper discusses the use of patterns such as "H;nD;H" to capture Player Experience (PX) requirements (see the paper for the explanation on the meaning of such a pattern). To check the PX requirements listed in Table 3, go to eplaytesting-pipeline, then do:

> mvn test -Dtest=eu.iv4xr.ux.pxtestingPipeline.CheckPXProperties

This will check those PX requirements on the emotion traces in the folder traces. The results are printed to the console e.g.

** nD;S: 2 (VALID)
** nF;S: 1 (SAT)
** J;nS: 0 (UNSAT)
** J;D: 0 (UNSAT)
** J;F;S: 1 (SAT)
** D;H;P: 0 (UNSAT)
** D;H;S: 0 (UNSAT)
** D;H;nD;S: 0 (UNSAT)
** F;D;H;F;J: 0 (UNSAT)
** H;F;D;D;D;H;F;J: 0 (UNSAT)
** F;D;D;H;F;P: 0 (UNSAT)
** #test-cases: 2
** #emotion patterns investigated: 11
** #patterns witnessed: 3
** #patterns not appearing: 8

When the verdict of a property is VALID it means that the property occurs in (is satisfied by) all traces. SAT means that it occurs in one (but not all) traces. UNSAT means that it does NOT occur in any trace.

If you want to play with other PX properties, you can specify your own in the class CheckPXProperties. The notation is explain in Section 4.3 in the paper.

Running the mutation test

There are 10 mutants provided to asses how well your selected test cases and choice of PX requirements (as your test oracles) in killing them.

The 10 mutants are the the same as those used in the paper. They can be found in the folder eplaytesting-pipeline/FASE23Dataset/MutatedLevels. Each file there is a mutated level-definition of the original level (called LabRecruits_level.csv) in eplaytesting-pipeline/FASE23Dataset/Base_LR_Level_and_EFSM.

To do the mutation test, you can re-run the Steps 2.2 - 3 above for each mutant:

  1. Copy a mutatated level xWave-the-flagxxx.csv to the folder Combinedtest\model.
  2. Redo Step 2.3
  3. Redo Step 3, Checking PX Properties. You can see in the Console output which PX properties are satisfies (they all are expected to be at least satisfied).

That was it 😄

Other things you might wonder:

  • How does this "OCC" model work? See:

    An Appraisal Transition System for Event-Driven Emotions in Agent-Based Player Experience Testing, Ansari, Prasetya, Dastani, Dignum, Keller. In International Workshop on Engineering Multi-Agent Systems (EMAS), 2021.

  • The above OCC provides the general part of emotion 'simulation'. An SUT-specific part called 'Player Characterization' needs to be added too. See the paper above, or the FASE23 paper for explanation of the role of this Characterization. For the experiment, the used Player Characterization can be found in the class eu.iv4xr.ux.pxtestingPipeline.PlayerOneCharacterization.

License

Apache License Version 2.0