This repository has been archived on 2024-04-26. You can view files and clone it, but cannot push or open issues or pull requests.
static/jekyll/_posts/backup-failure/2023-08-08-backup-failure.md

2.4 KiB

layout title date categories
post Error Handling in Borgmatic 2023-08-08 11:51:00 Europe/Amsterdam backup borg borgmatic

BorgBackup and Borgmatic have been my go-to tools to create backups for my home lab since I started creating backups. Using Systemd Timers, I regularly create a backup every night. I also monitor successful execution of the backup process, in case some error occurs. However, the way I set this up resulted in not receiving notifications. Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly.

I was using the on_error option to handle errors, like so:

on_error:
  - 'apprise --body="Error while performing backup" <URL> || true'

However, on_error does not handle errors from the execution of before_everything and after_everything hooks. My solution to this was moving the error handling up to the Systemd service that calls Borgmatic. This results in the following Systemd service:

[Unit]
Description=Backup data using Borgmatic
# Added
OnFailure=backup-failure.service

[Service]
ExecStart=/usr/bin/borgmatic --config /root/backup.yml
Type=oneshot

This handles any error, be it from Borgmatic's hooks or itself. The backup-failure service is very simple, and just calls Apprise to send a notification:

[Unit]
Description=Send backup failure notification

[Service]
Type=oneshot
ExecStart=apprise --body="Failed to create backup!" <URL>

[Install]
WantedBy=multi-user.target

The Aftermath (or what I learned)

Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight. And, of course, you only notice your backups aren't working when you actually need them. This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that. Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically). I then tried to recover the data using last night's backup, only to find out there was no such backup. Fortunately, I had other means to recover the data so I incurred no data loss.

I already knew it is important to test backups, but I learned it is also important to test failures during backups!