DNS servers monitoring

A few months ago, I found myself needing to know about the reliability of some internal DNS provider’s servers, after getting a series of hardly trackable random network issues, aka “It’s always DNS”.

Always DNS meme

More specifically, I needed to know about the following:

number of errors/timeouts
capability to query over TCP or UDP
capability to monitor multiple DNS servers at once
return codes received in the answer (i.e. NOERROR, SERVFAIL, NXDOMAIN, you name it)
results available in Prometheus/OpenTelemetry format (although it currently export vmrange buckets only at the moment, there’s an open issue for that)

`hostlookuper`

As I couldn’t find the right tool for the job, I decided to dust off an old internal (to PostFinance) monitoring tool, namely hostlookuper, refresh it a bit and open source it: https://github.com/postfinance/hostlookuper

It’s written in Go, capable of the above, and its configuration is rather minimalistic:

 $ hostlookuper -h
Usage of hostlookuper:
  -debug
        enable verbose logging
  -dns-servers string
        comma-separated list of DNS servers. if the protocol is omitted, udp is implied, and if the port is omitted, 53 is implied (default "udp://9.9.9.9:53,udp://8.8.8.8:53,udp://one.one.one.one:53")
  -hosts string
        comma-separated list of hosts against which to perform DNS lookups (default "google.ch,ch.ch")
  -interval duration
        interval between DNS checks. must be in Go time.ParseDuration format, e.g. 5s or 5m or 1h, etc (default 5s)
  -listen string
        address on which hostlookuper listens. e.g. 0.0.0.0:9090 (default ":9090")
  -timeout duration
        maximum timeout for a DNS query. must be in Go time.ParseDuration format, e.g. 5s or 5m or 1h, etc (default 5s)

The tool is supposed to run as a daemon, and will continuously query the DNS servers you asked it to monitor, thereby collecting metrics such as their response time and error rate (provided they make any error!)

It exports its metrics on :9090/metrics per default and an example set of metrics are available here.

Interpreting the metrics

After letting it run for some time (see the example Kubernetes deployment), you’ll be able to import this Grafana dashboard to obtain a nice graphical overview of your DNS servers.

You get the return code (RCODE) repartition, the number of errors per DNS server, a heatmap of the duration of the request, and finally a per-server latency average.

Hostlookuper Grafana dashboard

How does it help?

When you configure hostlookuper to make both tcp and udp queries, it permits you to continuously know what the state of your network is: if one DNS server has issues, you’ll only see errors on that DNS server, but when multiple DNS servers see their error counter going up, you might as well have a network or I/O issue at hand.

Concretely, it has already helped us detect an internal DNS server with a problematic version, which was dropping 0.5% of requests. What would have been extremely complicated to pinpoint was rendered as easy as looking up a Grafana dashboard.

You can also configure hostlookuper to query both a load-balanced DNS server and its backend. It permits you to make sure that during maintenance of a backend the load-balanced server is performing correctly.

Finally, it’s quite interesting to see the correlation on the network traffic and number for udp queries dropped: as soon as you saturate your uplink, you will start to see DNS errors.

What next?

Some future steps include

permit exporting metrics for non-VictoriaMetrics consumers (i.e. all Prometheus users to begin with) (issue #3)
permit querying IPv6 DNS servers (issue #21)
add some examples to let it run as a systemd service, for running it outside Kubernetes
Get it more traction/adoption, as it could probably help diagnose many infrastructure issues 🙂

If there’s a feature you need or a bug you encounter, please file a Github issue, I’ll be happy to work on it!

hostlookuper#

Interpreting the metrics#

How does it help?#

What next?#

`hostlookuper`

Interpreting the metrics

How does it help?

What next?