Articles

Adding PrometheusHistograms support to VictoriaMetrics/metrics

TL;DR: I added support for PrometheusHistograms (those with le buckets) to the VictoriaMetrics/metrics package (a lightweight alternative to prometheus/client_golang), which allows me to: Switch to the more lightweight VictoriaMetrics/metrics library in my open-source projects, which I find simpler to use Make it possible to choose between classical Prometheus histograms or VictoriaMetrics histograms (much more precise) with a flag Maintain compatibility with existing Prometheus-based monitoring setups Problem While working on kubenurse, I wanted to switch from the heavier prometheus/client_golang library to the more lightweight VictoriaMetrics/metrics package. However, there was one significant blocker: the VictoriaMetrics library only supported their own log-based histogram format, not the traditional Prometheus histograms with static le buckets. ...

A connected farm, part 3 - weighbridge automation

The Weighbridge Next to the actual farm with the milking cows, the farm is also constituted of a biogas plant. Taking advantage of the facilities there (trucks, buildings, etc.), my wife’s family have been collecting “green waste” for years now, and up until 2024, the cost for taking care of that waste was being paid for by a “per-habitant” tax paid by the town. Recently however, due to the so-called “principe de causalité”, in place of a tax/fee per capita, people bringing green waste to the biogas plant will have to pay for the amount they brought. As a result, a weighbridge had to be installed, which is only one part of the equation. ...

HTTP 502 - Upstream errors with nginx

A little context I work at PostFinance, taking care of the Linux systems and of the Open-Source Kubernetes platform we are running to support all sorts of banking workloads. Aside from running the platform, we also take in user support issues (where users are internal developers/colleagues), and this blog article covers an issue named “Ingress gets HTTP 502 errors on high load”. You can take this article as an SRE exercise: I’ll provide the same data I received in the support issue, in the same order, and you should try to discover the actual issue as soon as possible. Good luck ;) ...

A Connected Farm, part 2 - Remote Controlled Fence ⚡️

This article again covers a topic related to my wife’s family farm, but this time, instead of exporting milking data to Grafana, I will detail my usage of Michael Stapelberg’s amazing gokrazy project, which made it possible to reliably develop Go software to control fences around the farm. Fences and Cows 🐄 The farm is distributed on 2 sites, and on each site there are rather long electric fences, in which the cows happily pasture during the day (and for the heifer’s fence, also during the night). To prevent the cows from escaping the fences and e.g. eat our neighbour’s grass (which is always greener, as we all know), the fences are electrified ⚡️ with high voltage (6000V) impulsions every second. ...

Kubenurse: The In-Cluster Doctor Making Network Rounds

TLDR: Kubenurse is the Swiss army knife for Kubernetes network monitoring. It will help you pinpoint bottlenecks and know the latency in your network identify nodes with network issues (packet drops, slow connection, etc.) uncover issues like DNS failures, broken sockets, or interrupted TLS negotiations Description Kubenurse is a Kubernetes network monitoring tool developed and open-sourced by PostFinance (a Swiss Banking Institution), which acts like an in-cluster doctor, continuously checking the health of your pod-to-pod, pod-to-service, and pod-to-ingress connections. ...

A Connected Farm, part 1 - Milking 🐄 🥛

Alongside my work as a System Engineer (with a focus on Kubernetes) at PostFinance, I’m married to a farmer in Switzerland, and live with her and her family on the family farm. This is quite different from my daily work, and I sometimes have the opportunity to help by, for example, feeding calves during milking, using my skills to install surveillance cameras, deploying a long-distance WiFi network across the farm, or modernizing the milking monitoring. It’s this latter point that I’m detailing today (without all the technical details, which are covered in the README of the open-source project I’ve created for this purpose). ...

Backing up MariaDB on Kubernetes

Hosting MariaDB on Kubernetes proved so far a quite good experience: using the Bitnami Helm Chart to host a “standalone” instance (i.e. without replication, as replication already happens on the storage layer, and because simplicity is more valuable than a complex HA setup like Galera) of MariaDB worked out quite well. Being cautious, I had configured a daily backup to S3, using a tool found on Github, but when it came to restoring data dumped with this tool, which uses a pretty old mysqldump binary, I was stuck and couldn’t restore 😅 For some reason, the default config of the tool didn’t bother to escape quotes and other sensitive types of chars, and as a result I had to resort to restoring my daily velero backup of my MariaDB instance in another namespace to make a proper export from there and to finally restore my data. ...

Advent of Code 🎄 - an eBPF take 🐝

It’s that period of the year already ! With December comes the Advent of Code programming challenge, and its daily mental workout. Advent of Code is an Advent calendar of small programming puzzles for a variety of skill sets and skill levels that can be solved in any programming language you like. The complexity level of the programming challenges increase every day, and tend to be notoriously hard during the last few days. However, as of writing this article, it’s only day 9, and there were a few problems that didn’t require too much processing cycles, provided you spent enough mathematical effort and didn’t come up with only the brute-force solution. ...

DNS servers monitoring

A few months ago, I found myself needing to know about the reliability of some internal DNS provider’s servers, after getting a series of hardly trackable random network issues, aka “It’s always DNS”. More specifically, I needed to know about the following: number of errors/timeouts capability to query over TCP or UDP capability to monitor multiple DNS servers at once return codes received in the answer (i.e. NOERROR, SERVFAIL, NXDOMAIN, you name it) ...

Minimal downtime when rebooting etcd nodes

Graceful leader changes When needing to restart some Kubernetes control-plane nodes on which etcd also happens to be running, you will prefer a graceful transfer of the leadership of the etcd cluster, to reduce the transition period that comes with a leader election. This can be achieved with the following script, provided you specify the adequate environment variables in /etc/profile.d/etcd-all file. set -o pipefail && \ source /etc/profile.d/etcd-all && \ AM_LEADER=$(etcdctl endpoint status | grep $(hostname) | cut -d ',' -f 5 | tr -d ' ') && \ if [[ $AM_LEADER = "true" ]] then NEW_LEADER=$(etcdctl endpoint status | grep -v $(hostname) | cut -d ',' -f 2 | tr -d ' ' | tail -n '-1') && \ etcdctl move-leader $NEW_LEADER && sleep 15 fi Info: the following environment variables need to be set, for example through a file such as: /etc/profile.d/etcd-all ...