public enum SWIM
Scalable Weakly-consistent Infection-style Process Group Membership Protocol
As you swim lazily through the milieu,
The secrets of the world will infect you.
Implementation of the SWIM protocol in abstract terms, not dependent on any specific runtime.
The actual implementation resides in
This implementation follows the original terminology mostly directly, with the notable exception of the original
wording of “confirm” being rather represented as
SWIM.Status.dead, as we found the “confirm” wording to be
confusing in practice.
Extensions & Modifications
This implementation has a few notable extensions and modifications implemented, some documented already in the initial SWIM paper, some in the Lifeguard extensions paper and some being simple adjustments we found practical in our environments.
The “random peer selection” is not completely ad-hoc random, but follows a stable order, randomized on peer insertion.
- Unlike the completely random selection in the original paper. This has the benefit of consistently going “around” all peers participating in the cluster, enabling a more efficient spread of membership information among peers, by allowing us to avoid continuously (yet randomly) selecting the same few peers.
- This optimization is described in the original SWIM paper, and followed by some implementations.
Introduction of an
.unreachablestatus, that is ordered after
- This is because the decision to move an unreachable peer to .dead status is a large and important decision,
in which user code may want to participate, e.g. by attempting “shoot the other peer in the head” or other patterns,
before triggering the
.deadstatus (which usually implies a complete removal of information of that peer existence from the cluster), after which no further communication with given peer will ever be possible anymore.
.unreachablestatus is optional and disabled by default.
- Other SWIM implementations handle this problem by storing dead members for a period of time after declaring them dead, also deviating from the original paper; so we conclude that this use case is quite common and allow addressing it in various ways.
- This is because the decision to move an unreachable peer to .dead status is a large and important decision, in which user code may want to participate, e.g. by attempting “shoot the other peer in the head” or other patterns, before triggering the
- The original paper does not keep in memory information about dead peers, it only gossips the information that a member is now dead, but does not keep tombstones for later reference
Implementations of extensions documented in the Lifeguard paper (linked below):
- Local Health Aware Probe - which replaces the static timeouts in probing with a dynamic one, taking into account recent communication failures of our member with others.
- Local Health Aware Suspicion - which improves the way
.suspectstates and their timeouts are handled, effectively relying on more information about unreachability; See:
- Buddy System - enables members to directly and immediately notify suspect peers about them being suspected, such that they have more time and a chance to refute these suspicions more quickly, rather than relying on completely random gossip for that suspicion information to reach such suspect peer.
SWIM serves as a low-level distributed failure detector mechanism.
It also maintains its own membership in order to monitor and select peers to ping with periodic health checks,
however this membership is not directly the same as the high-level membership exposed by the
SWIM provides a weakly consistent view on the process group membership. Membership in this context means that we have some knowledge about the node, that was acquired by either communicating with the peer directly, for example when initially connecting to the cluster, or because some other peer shared information about it with us. To avoid moving a peer “back” into alive or suspect state because of older statuses that get replicated, we need to be able put them into temporal order. For this reason each peer has an incarnation number assigned to it.
This number is monotonically increasing and can only be incremented by the respective peer itself and only if it is suspected by another peer in its current incarnation.
The ordering of statuses is as follows:
alive(N) < suspect(N) < alive(N+1) < suspect(N+1) < dead
A member that has been declared dead can never return from that status and has to be restarted to join the cluster.
Note that such “restarted node” from SWIM’s perspective is simply a new node which happens to occupy the same host/port,
as nodes are identified by their unique identifiers (
The information about dead nodes will be kept for a configurable amount of time, after which it will be removed to prevent the state on each node from growing too big. The timeout value should be chosen to be big enough to prevent faulty nodes from re-joining the cluster and is usually in the order of a few days.
SWIM uses an infection style gossip mechanism to replicate state across the cluster. The gossip payload contains information about other node’s observed status, and will be disseminated throughout the cluster by piggybacking onto periodic health check messages, i.e. whenever a node is sending a ping, a ping request, or is responding with an acknowledgement, it will include the latest gossip with that message as well. When a node receives gossip, it has to apply the statuses to its local state according to the ordering stated above. If a node receives gossip about itself, it has to react accordingly.
If it is suspected by another peer in its current incarnation, it has to increment its incarnation in response. If it has been marked as dead, it SHOULD to shut itself down (i.e. terminate the entire node / service), to avoid “zombie” nodes staying around even through they are already ejected from the cluster.
SWIM Protocol Logic Implementation
SWIM.Instancefor a detailed discussion on the implementation.
Emitted whenever a membership change happens.
isReachabilityChangeto detect whether the is a change from an alive to unreachable/dead state or not, and is worth emitting to user-code or not.
public struct MemberStatusChangedEvent : Equatable
extension SWIM.MemberStatusChangedEvent: CustomStringConvertible
SWIM.Memberrepresents an active participant of the cluster.
Incarnation numbers serve as sequence number and used to determine which observation is “more recent” when comparing gossiped information.
public typealias Incarnation = UInt64
A sequence number which can be used to associate with messages in order to establish an request/response relationship between ping/pingRequest and their corresponding ack/nack messages.
public typealias SequenceNumber = UInt32
The SWIM membership status reflects how a node is perceived by the distributed failure detector.
Modification: Unreachable status (opt-in)
If the unreachable status extension is enabled, it is set / when a classic SWIM implementation would have declared a node
.dead, / yet since we allow for the higher level membership to decide when and how to eject members from a cluster, / only the
.unreachablestate is set and an
Cluster.ReachabilityChangecluster event is emitted. / In response to this a high-level membership protocol MAY confirm the node as dead by issuing
Instance.confirmDead, / which will promote the node to
.deadin SWIM terms.
.unreachablestatus is only used it enabled explicitly by setting
settings.unreachableto enabled. Otherwise, the implementation performs its failure checking as usual and directly marks detected to be failed members as
alive -> suspect
alive -> suspect, with next
SWIM.Incarnation, e.g. during flaky network situations, we suspect and un-suspect a node depending on probing
suspect -> unreachable | alive, if in SWIM terms, a node is “most likely dead” we declare it
.unreachableinstead, and await for high-level confirmation to mark it
unreachable -> alive | suspect, with next
alive | suspect | unreachable -> dead
public enum Status : Hashable
extension SWIM.Status: Comparable