This article is a written companion to my KubeCon EU 2026 talk of the same name. It covers four stories from five years of running a Kubernetes platform at PostFinance, a systemic Swiss financial institution: SLOs as a reliability driver, open-source monitoring tools, continuous end-to-end testing, and an interactive debugging session tracking down rare 502 errors.

The interactive visualizations below (hash ring, race condition sequence diagram) are ported from the Slidev presentation so you can explore them at your own pace.


Context

PostFinance operates ~35 Kubernetes clusters in an air-gapped environment with strict regulatory requirements. The platform has been in production for 5+ years, initially built on kubeadm/Debian and now undergoing a migration to Talos managed via TOPF.

In banking, every failed request is a potential denied payment. This shapes how we approach reliability — even single-digit errors out of millions matter.


Part 1: SLOs as a Driver

From “it feels slow” to data-driven reliability

For months, developers complained that “the cluster feels slow today.” We had basic Grafana dashboards, but no clear targets. Without a number and a timeline, “slow” is subjective and easy to ignore.

Defining API Server SLOs

We defined three SLOs for the Kubernetes API server (following the SRE book approach):

  • Availability — less than 0.1% of requests return 5xx or 429
  • Latency (read) — GET/LIST within threshold (varies by subresource & scope)
  • Latency (write) — POST/PUT/PATCH/DELETE within 1s

Writing the PromQL queries by hand would have been tedious, but sloth made it tractable:

slos:
  - name: apiserver-availability
    objective: 99.9
    sli:
      events:
        error_query: sum(apiserver_request_total{code=~"5..|429"})
        total_query: sum(apiserver_request_total)

From these definitions, sloth generates all recording rules, multi-window burn-rate alerts, and error budget calculations automatically.

SLOs Reveal the Truth

Kubernetes SLO dashboard

Once SLOs were live, “the cluster feels slow” became “we burned 40% of our error budget during Tuesday’s upgrade.” We were able to clearly correlate disruption with some of our actions and this motivated us to improve the situation.

Fix #1: etcd Topology

Our initial topology had each API server connecting to all 3 etcd members (a variant of the external etcd topology). When one etcd node was upgraded, all 3 API servers were impacted.

Loading diagram...

Complex junction road — our initial apiserver-etcd topology in a nutshell

We switched to a stacked topology: each API server talks to its local etcd only. An etcd upgrade now impacts only one API server instead of all three. This improved the situation already, but we were still encountering degraded apiserver availability during cluster maintenance, so we had to look further.

Loading diagram...

Fix #2: etcd Leadership Migration

Before upgrading a node, we now migrate etcd leadership to another member:

etcdctl move-leader $NEW_LEADER_ID

This avoids leader elections during the maintenance window — a light improvement but not the full solution.

etcd leadership hot potato

Fix #3: The Real Culprit — --goaway-chance

The biggest issue was that one control-plane node was doing all the work while the other two sat idle. Not only was the load poorly distributed, but more critically the two other apiserver instances never had to populate their caches. When the busy apiserver was shut down for maintenance, the remaining two would choke while their caches were being filled from scratch.

The root cause: long-lived HTTP/2 connections never redistributed. Clients open a TCP connection once and reuse it for all requests forever.

The fix: --goaway-chance=0.001 on the API server. 1 in 1000 requests gets a GOAWAY frame, causing the client to reconnect through the load balancer. Once all API servers were handling traffic and had warm caches, upgrades stopped being a problem.


Part 2: Open-Source Monitoring Tools

kubenurse

kubenurse is a DaemonSet that performs continuous network health checks across your cluster. Each pod validates 5 different network paths from every node (see also my detailed kubenurse article):

  1. API server (DNS) — through kubernetes.default.svc.cluster.local
  2. API server (IP) — direct endpoint, bypassing DNS
  3. me-ingress — through the ingress controller
  4. me-service — through the cluster service
  5. Neighbourhood — node-to-node checks

Loading diagram...

httptrace Instrumentation

Metrics are labeled with httptrace event types, giving a precise breakdown of each request phase: dns_start, connect_done, tls_handshake_done, got_first_response_byte, etc. When something fails, you know exactly which phase failed.

kubenurse Grafana dashboard

O(n²) → O(n): Deterministic Neighbor Selection

A community discussion identified that the original design had every pod checking every other pod — O(n²) total checks. The fix: node names are SHA-256 hashed and each pod checks only its n nearest neighbors in hash order (default: n=10).

The distribution is random but deterministic — stable metrics across restarts. Use the interactive visualization below to explore this:

Total checks: 75 (15 × 5) click a node to see its neighbors

hostlookuper

hostlookuper is simpler: it periodically resolves DNS targets and exports latency + error counters as Prometheus metrics. DNS is an excellent network congestion indicator — UDP packets are not retried and result in errors, making DNS failures often the first sign of trouble.

Graceful Shutdown: Lameduck Mode

SLOs on kubenurse itself revealed errors on the me_ingress check during node upgrades. The problem isn’t specific to ingress-nginx: SIGTERM arrives, but the load balancer doesn’t know yet, so requests still route to a dying process.

The fix (inspired by CoreDNS): lameduck shutdown (commit). On SIGTERM, keep serving for a few seconds (default: 5s), giving the LB/proxy/CNI time to catch up and stop sending traffic. Then stop the server.


Part 3: Continuous End-to-End Testing

Your end users should NOT be your end-to-end tests

Complex interactions between Kubernetes components (networking, storage, security, DNS) can fail in subtle ways that unit tests and CI pipelines don’t catch.

Our Approach

A Go test suite using e2e-framework, scheduled as a Kubernetes CronJob running every 15 minutes. Results are captured with OpenTelemetry and visualized in Grafana dashboards.

func TestKubernetesDeployment(t *testing.T) {
    start := time.Now()
    t.Cleanup(func() {
        metricsCollector.RecordTestExecution(t, time.Since(start))
    })

    dep := newDeployment("nginx", 3)
    err := env.Create(ctx, dep)
    require.NoError(t, err)

    waitForPodsReady(t, dep, 30*time.Second)
}

Open-Source: e2e-tests

I’ve written an analogous open-source implementation at clementnuss/e2e-tests that you can fork and adapt. It covers:

Test What it validates
Deployment Pod scheduling, container runtime, workload lifecycle
Storage (CSI) PV provisioning, read/write operations
Networking DNS resolution, service discovery, inter-pod connectivity
RBAC Role-based access boundaries, permission enforcement

Deploy as a CronJob, stream metrics to an OTLP endpoint, and you get instant cluster health monitoring with alert rules that trigger on test failures.

e2e tests Grafana dashboard


Part 4: The 502 Mystery

This section summarizes the investigation — for the full deep-dive, see my dedicated 502 article.

The Symptoms

A Tomcat-based e-finance application serving ~1.7M requests/day on one ingress. 8–10 failures per day — roughly 6 per million. Observations:

  • 502s uniformly distributed across all ingress-nginx pods
  • No pattern in time, endpoint, or client
  • App pods healthy, no errors in application logs
  • Load testing with K6 couldn’t reproduce it
  • Errors correlate with request volume, but the rate stays constant

The Breakthrough

ingress-nginx error logs contain the FQDN, not the ingress name. We were searching for the wrong thing. Once we filtered by hostname, we found:

upstream prematurely closed connection while reading response header from upstream

This told us: nginx had an open keepalive connection, sent a request on it, but the backend closed the connection before responding → 502 Bad Gateway.

The Race Condition

Two conflicting keepalive timeouts:

  • nginx: keeps connections open for 60s (default)
  • Tomcat: closes idle connections after 20s (default)

The race window: the connection sits idle for ~20s, Tomcat sends a FIN to close it, and at nearly the same moment nginx sends a new request on that connection. The packets cross in flight → 502.

Explore the race condition with this interactive sequence diagram:

The Fix

One environment variable:

export TC_HTTP_KEEPALIVETIMEOUT="75000"  # 75s > nginx's 60s

The rule: the upstream keepalive_timeout must be greater than the reverse proxy’s. nginx defaults to 60s; Tomcat was at 20s, now set to 75s. The backend always outlives the proxy’s connection → no more race.

Reproducing with K6

Standard load tests failed because they didn’t test idle + burst patterns. The key insight: cycle through load → idle → load phases with varying idle durations to hit the keepalive race window:

import http from 'k6/http';
import { check, sleep } from 'k6';

// Cycle: ramp up → sustain → ramp down → idle
// Idle duration increases (4s→11s) to maximize
// chance of hitting Tomcat's 20s timeout boundary
function generate_stages() {
    var stages = []
    for (let i = 4; i < 12; i++) {
        stages.push({ duration: "5s", target: 100 });
        stages.push({ duration: "55s", target: 100 });
        stages.push({ duration: "5s", target: 0 });
        stages.push({ duration: i + "s", target: 0 });
    }
    return stages
}

export let options = {
    noConnectionReuse: true,
    noVUConnectionReuse: true,
    scenarios: {
        http_502: {
            stages: generate_stages(),
            executor: 'ramping-vus',
            gracefulRampDown: '1s',
        },
    },
};

export default function() {
    let data = { data: 'Hello World' };
    for (let i = 0; i < 10; i++) {
        let res = http.post(
          `${__ENV.URL}`, JSON.stringify(data));
        check(res, {
          "status was 200": (r) => r.status === 200
        });
    }
    sleep(1);
}

Key Takeaways

  • SLOs are a forcing function — from “it feels slow” to data-driven fixes (etcd topology, leadership migration, goaway-chance)
  • Open-source your tools — the best fixes come from community discussions, not always code, sometimes just the right conversation (kubenurse #55)
  • Test continuously, in-cluster — your end users should not be your e2e tests
  • Every error matters — 8 out of 1.7M requests still deserved investigation