SWIM

public enum SWIM

Scalable Weakly-consistent Infection-style Process Group Membership Protocol

As you swim lazily through the milieu,
The secrets of the world will infect you.

Implementation of the SWIM protocol in abstract terms, not dependent on any specific runtime. The actual implementation resides in SWIM.Instance.

Terminology

This implementation follows the original terminology mostly directly, with the notable exception of the original wording of “confirm” being rather represented as SWIM.Status.dead, as we found the “confirm” wording to be confusing in practice.

Extensions & Modifications

This implementation has a few notable extensions and modifications implemented, some documented already in the initial SWIM paper, some in the Lifeguard extensions paper and some being simple adjustments we found practical in our environments.

  • The “random peer selection” is not completely ad-hoc random, but follows a stable order, randomized on peer insertion.

    • Unlike the completely random selection in the original paper. This has the benefit of consistently going “around” all peers participating in the cluster, enabling a more efficient spread of membership information among peers, by allowing us to avoid continuously (yet randomly) selecting the same few peers.
    • This optimization is described in the original SWIM paper, and followed by some implementations.
  • Introduction of an .unreachable status, that is ordered after .suspect and before .dead.

    • This is because the decision to move an unreachable peer to .dead status is a large and important decision, in which user code may want to participate, e.g. by attempting “shoot the other peer in the head” or other patterns, before triggering the .dead status (which usually implies a complete removal of information of that peer existence from the cluster), after which no further communication with given peer will ever be possible anymore.
    • The .unreachable status is optional and disabled by default.
    • Other SWIM implementations handle this problem by storing dead members for a period of time after declaring them dead, also deviating from the original paper; so we conclude that this use case is quite common and allow addressing it in various ways.
  • Preservation of .unreachable information

    • The original paper does not keep in memory information about dead peers, it only gossips the information that a member is now dead, but does not keep tombstones for later reference

Implementations of extensions documented in the Lifeguard paper (linked below):

  • Local Health Aware Probe - which replaces the static timeouts in probing with a dynamic one, taking into account recent communication failures of our member with others.
  • Local Health Aware Suspicion - which improves the way .suspect states and their timeouts are handled, effectively relying on more information about unreachability; See: suspicionTimeout
  • Buddy System - enables members to directly and immediately notify suspect peers about them being suspected, such that they have more time and a chance to refute these suspicions more quickly, rather than relying on completely random gossip for that suspicion information to reach such suspect peer.

SWIM serves as a low-level distributed failure detector mechanism. It also maintains its own membership in order to monitor and select peers to ping with periodic health checks, however this membership is not directly the same as the high-level membership exposed by the Cluster.

SWIM Membership

SWIM provides a weakly consistent view on the process group membership. Membership in this context means that we have some knowledge about the node, that was acquired by either communicating with the peer directly, for example when initially connecting to the cluster, or because some other peer shared information about it with us. To avoid moving a peer “back” into alive or suspect state because of older statuses that get replicated, we need to be able put them into temporal order. For this reason each peer has an incarnation number assigned to it.

This number is monotonically increasing and can only be incremented by the respective peer itself and only if it is suspected by another peer in its current incarnation.

The ordering of statuses is as follows:

alive(N) < suspect(N) < alive(N+1) < suspect(N+1) < dead

A member that has been declared dead can never return from that status and has to be restarted to join the cluster. Note that such “restarted node” from SWIM’s perspective is simply a new node which happens to occupy the same host/port, as nodes are identified by their unique identifiers (ClusterMembership.Node.uid).

The information about dead nodes will be kept for a configurable amount of time, after which it will be removed to prevent the state on each node from growing too big. The timeout value should be chosen to be big enough to prevent faulty nodes from re-joining the cluster and is usually in the order of a few days.

SWIM Gossip

SWIM uses an infection style gossip mechanism to replicate state across the cluster. The gossip payload contains information about other node’s observed status, and will be disseminated throughout the cluster by piggybacking onto periodic health check messages, i.e. whenever a node is sending a ping, a ping request, or is responding with an acknowledgement, it will include the latest gossip with that message as well. When a node receives gossip, it has to apply the statuses to its local state according to the ordering stated above. If a node receives gossip about itself, it has to react accordingly.

If it is suspected by another peer in its current incarnation, it has to increment its incarnation in response. If it has been marked as dead, it SHOULD to shut itself down (i.e. terminate the entire node / service), to avoid “zombie” nodes staying around even through they are already ejected from the cluster.

SWIM Protocol Logic Implementation

See also

SWIM.Instance for a detailed discussion on the implementation.

Further Reading

  • Emitted whenever a membership change happens.

    Use isReachabilityChange to detect whether the is a change from an alive to unreachable/dead state or not, and is worth emitting to user-code or not.

    See more

    Declaration

    Swift

    public struct MemberStatusChangedEvent : Equatable
    extension SWIM.MemberStatusChangedEvent: CustomStringConvertible

SWIM Member

  • A SWIM.Member represents an active participant of the cluster.

    It associates a specific SWIMAddressablePeer with its SWIM.Status and a number of other SWIM specific state information.

    See more

    Declaration

    Swift

    public struct Member
    extension SWIM.Member: Hashable, Equatable
    extension SWIM.Member: CustomStringConvertible
  • Incarnation numbers serve as sequence number and used to determine which observation is “more recent” when comparing gossiped information.

    Declaration

    Swift

    public typealias Incarnation = UInt64
  • A sequence number which can be used to associate with messages in order to establish an request/response relationship between ping/pingRequest and their corresponding ack/nack messages.

    Declaration

    Swift

    public typealias SequenceNumber = UInt32
  • Typealias for the underlying membership representation.

    Declaration

    Swift

    public typealias Membership = Dictionary<Node, SWIM.Member>.Values
  • Message sent in reply to a .ping.

    The ack may be delivered directly in a request-response fashion between the probing and pinged members, or indirectly, as a result of a pingRequest message.

    See more

    Declaration

    Swift

    public enum PingResponse

Gossip

  • A piece of “gossip” about a specific member of the cluster.

    A gossip will only be spread a limited number of times, as configured by settings.gossip.gossipedEnoughTimes(_:members:).

    See more

    Declaration

    Swift

    public struct Gossip : Equatable
  • A GossipPayload is used to spread gossips about members.

    See more

    Declaration

    Swift

    public enum GossipPayload
  • The SWIM.Instance encapsulates the complete algorithm implementation of the SWIM protocol.

    Please refer to SWIM for an in-depth discussion of the algorithm and extensions implemented in this package.

    See also

    SWIM for a complete and in depth discussion of the protocol.
    See more

    Declaration

    Swift

    public final class Instance : SWIMProtocol
    extension SWIM.Instance: CustomDebugStringConvertible

SWIM Settings

  • Settings generally applicable to the SWIM implementation as well as any shell running it.

    See more

    Declaration

    Swift

    public struct Settings
  • The SWIM membership status reflects how a node is perceived by the distributed failure detector.

    Modification: Unreachable status (opt-in)

    If the unreachable status extension is enabled, it is set / when a classic SWIM implementation would have declared a node .dead, / yet since we allow for the higher level membership to decide when and how to eject members from a cluster, / only the .unreachable state is set and an Cluster.ReachabilityChange cluster event is emitted. / In response to this a high-level membership protocol MAY confirm the node as dead by issuing Instance.confirmDead, / which will promote the node to .dead in SWIM terms.

    The additional .unreachable status is only used it enabled explicitly by setting settings.unreachable to enabled. Otherwise, the implementation performs its failure checking as usual and directly marks detected to be failed members as .dead.

    • alive -> suspect
    • alive -> suspect, with next SWIM.Incarnation, e.g. during flaky network situations, we suspect and un-suspect a node depending on probing
    • suspect -> unreachable | alive, if in SWIM terms, a node is “most likely dead” we declare it .unreachable instead, and await for high-level confirmation to mark it .dead.
    • unreachable -> alive | suspect, with next SWIM.Incarnation optional)
    • alive | suspect | unreachable -> dead

    See also

    SWIM.Incarnation

    See more

    Declaration

    Swift

    public enum Status : Hashable
    extension SWIM.Status: Comparable