Self-healing distributed systems
Examiner: Prof. Dr. Theo Ungerer
Co-examiner: Prof. Dr. Bernhard Bauer
The growing complexity of distributed systems demands for new ways of control. This work addresses self-healing in distributed environments. The term "self-healing" represents a quite new area of research and is used in a fairly broad way, but can be seen as dynamic fault tolerance. This work proposes generic concepts and algorithms to build self-healing systems.
The detection of node failures in distributed environments is a non-trivial problem. Failure detectors are an important component of many fault tolerant distributed systems. In this work a new failure detection algorithm is proposed with noteworthy features like a high flexibility and good performance. Furthermore an approach is presented to save the message overhead of failure detectors.
New grouping algorithms are introduced in this work to enable a scalable self-monitoring property. This allows an autonomous installation of monitoring relations in complex large scale distributed systems.
A failure recovery engine based on automated planning, which manages a distributed system according to user-defined objectives, is proposed. It is able to generate and execute plans to autonomously recover a system from unwanted states.
Finally, ideas for a generic self-healing architecture for highly complex distributed systems are presented. The design is based on psychological and sociological concepts.
- PDF - (satzger_diss.pdf, 3907 KB)