Monitor Varnish like a PRO in CentOS 7

Danila Vershinin

8 years ago

A silly mistake

At some point I started seeing strange things about my Varnish instance. It gave unexplained “backend fetch failed” errors. Only when I viewed syslog (and this was for an entirely different task), I spotted Varnish panic happening quite often:

Child (28380) Last panic at: Tue, 11 Sep 2018 19:18:42 GMT#012″Assert error in default_oc_getobj(), storage/stevedore.c line 60:#012 Condition(((o))->magic == (0x32851d42)) not true

My immediate reaction was trying to downgrade, etc. All was in vain – the actual error was my own misconfiguration. Cache segmentation was configured in a way that both static files and page cache backend were looking at the same file:

-s static=file,/var/lib/varnish/varnish_storage.bin,512M 
-s file,/var/lib/varnish/varnish_storage.bin,512M

Things were wrong on many levels:

Pointing to the same file by different cache backends
There is no need to segment cache if it’s intended to store it in the same filesystem

Surely this was an easy fix. But the frustrating part was not knowing that something is wrong with Varnish configuration before spotting the panic messages in syslog, merely by accident. How can we do better here?

How Varnish runs

Varnish architecture builds upon two main processes: the master and the child process.

The child process is the process that actually caches stuff. It panics if there’s a problem. Responsibility of the master process is basically watching over the cache process and restarting it as needed.

Improving things in terms of monitoring and a bit of reliability raises questions:

How can we easily spot Varnish panics and be alert about them?
Who is watching over watcher (master)?

Notification for Varnish panics

It’s easy to know if your running Varnish instance had a panic happen with the following command:

varnishadm panic.show

If a panic has happened, you’d see its details. But how do we know we have to check it in the first place? It would be nice to be notified. Here comes our simple Monit check. E.g. place in /etc/monit.d/varnish.mon:

check program varnishpanic with path "/bin/varnishadm panic.show"
       if status != 1 then alert

The trick here is knowing that varnishadm panic.show will have an exit code 0 if panic exists and 1 otherwise. The easy check will ensure that you will get an alert, should there be any panic. And act on it early.

Watch over master

The master Varnish process is quite reliable and is the least likely thing to crash. But why not add a bit of monitoring if we can?

There are basically two options here: you can also use Monit to ensure main Varnish process is running. Or you can use systemd feature.

systemd

With the arrival of systemd in CentOS 7, one does not have to take care about constant Varnish uptime, should it completely crash.

Simply edit Varnish unit file, e.g. with systemctl edit varnish and add:

[Service]
Restart=on-failure

Note that this isn’t needed if you use our Varnish 4 package, as this is already incorporated.

Monit

Add to varnish.mon (assuming Varnish is set to run on port 80):

check process varnish with pidfile /var/run/varnish.pid
group www
start program = "/usr/bin/systemctl start varnish"
stop program = "/usr/bin/systemctl stop varnish"
if failed host localhost port 80 protocol http
       and request "/"
       then restart
if 3 restarts within 5 cycles then timeout