reproducible evaluation and fault injection of large-scale distributed systems



Open Positions


Reproducing experimental results is a core tenet of the scientific method. Unfortunately, the increasing complexity of the system we build, deploy and evaluate makes it difficult to reproduce results and hence is one of the greatest impairments for the progress of science in general and distributed systems in particular.

The complexity stems not only from the increasing complexity of the systems under study, but also from the inherent complexity of capturing and controlling all variables that can potentially affect experimental results.

We argue that this can only be addressed with a systematic approach to all the stages of the evaluation process. Angainor is a step in this direction.

Our goal is to address the following challenges: i) precisely describe the environment and variables affecting the experiment, ii) minimize the number of (uncontrollable) variables affecting the experiment and iii) have the ability to subject the system under evaluation to controlled fault patterns.

The architecture and main design decisions of the platform will be detailed in an upcoming paper.


How to use

A very early alpha prototype is available to try. Many (most) features are still missing but general feedback is welcome.



  1. Have a Docker client/daemon up and running on your machine. Check the Docker documentation for instructions.

  2. Build the Docker images and push them to a local registry
    make all push
  3. Check config.yaml and adjust it accordingly to your system. The provide defaults provided should work in most cases.

  4. Initialize a cluster with only the local node.
    ./bin/lsds cluster init

    If bash is not your default shell, prefix all commands with bash as in

    bash bin/lsds cluster init
  5. Start the cluster
    ./bin/lsds cluster up
  6. Check the status of the cluster with
    ./bin/lsds cluster status
  7. Let’s run a simple deployment with a nginx server and a siege client.
    ./bin/lsds benchmark --app examples/nginx/nginx.yaml --name hello-world --run-time 120

    which will run the experiment for 120 seconds.

  8. To run a more interesting scenario with churn
    ./bin/lsds benchmark --app examples/nginx/nginx.yaml --name hello-churn --churn examples/nginx/churn.yaml
  9. Results will become available at <data>-<experiment-name>

  10. To shutdown the cluster run
    ./bin/lsds cluster down

In a cluster

  1. Have a cluster ready with Docker running on every host and make sure that every host is accessible though ssh.
  2. Adjust config.yaml to match your cluster settings, with one entry per each cluster machine
  3. To run an experiment in the cluster follow steps 4-9 of the local deployment.


If you find any issue or would like to contribute with new features open a new issue and we will get in touch as soon as possible.


This project is led by Miguel Matos, Senior Researcher at INESC-ID and Assistant Professor at Instituto Superior Técnico, Universidade de Lisboa, Portugal in collaboration with researchers at Universitê de Neuchâtel, Switzerland. Check the CONTRIBUTORS file for the full list of people involved in the project.


This work was partially supported by Fundo Europeu de Desenvolvimento Regional (FEDER) through Programa Operacional Regional de Lisboa and by Fundação para a Ciência e Tecnologia (FCT) through projects with reference UID/CEC/50021/2013 and LISBOA-01-0145-FEDER-031456.