Skip to content

ADR-0006: Cluster lifecycle management

Proposed
Status

proposed

Date

2026-03-09

Group

cluster-management

Depends-on

ADR-0002, ADR-0005

Context

With bare-metal provisioning in place (ADR-0005), tenant organizations need dedicated Kubernetes clusters with full lifecycle management: provisioning, upgrades, certificate rotation, node rotation, auto-scaling, and self-healing. At 50,000 physical servers this means potentially hundreds of clusters that must be managed without manual intervention.

Options

Option 1: Gardener

  • Pros: full lifecycle management built in (upgrades, cert rotation, node rotation, auto-scaling, self-healing), proven at scale (SAP runs thousands of clusters), control planes run as pods on Seed cluster (efficient), native metal-stack integration, European origin (SAP/NeoNephos/LF Europe), Apache 2.0

  • Cons: has its own concept model (Garden/Seed/Shoot) with learning curve, Seed cluster is critical component

Option 2: Kamaji

  • Pros: efficient shared control planes, CNCF Sandbox, fast provisioning

  • Cons: young project, smaller community, central etcd requires careful capacity planning, less mature tooling

Option 3: vCluster

  • Pros: lowest overhead, seconds to provision, works within existing cluster

  • Cons: no kernel isolation (container breakout affects other tenants), fails EUCS SEAL-4 isolation requirements, sync mechanism adds attack surface

Decision

Gardener. Native metal-stack integration means the full stack (provisioning → cluster lifecycle) is proven in production together. Mature project with a lot of production exposure. European governance aligns with sovereignty requirements.

Consequences

  • Seed cluster(s) must be operated with HA configuration

  • Gardener Extensions model provides the plugin mechanism for per-cluster services (DNS, certs, monitoring)

  • Tenant isolation is at the Shoot cluster level — each organization gets dedicated clusters

  • The platform team needs Gardener expertise