add blog post about borgmatic error handling

This commit is contained in:
Pim Kunis 2023-08-08 12:28:48 +02:00
parent 92366bf618
commit c8b8d925be

View file

@ -0,0 +1,60 @@
---
layout: post
title: Error Handling in Borgmatic
date: 2023-08-08 11:51:00 Europe/Amsterdam
categories: backup borg borgmatic
---
[BorgBackup](https://borgbackup.readthedocs.io/en/stable/) and [Borgmatic](https://torsion.org/borgmatic/) have been my go-to tools to create backups for my home lab since I started creating backups.
Using [Systemd Timers](https://wiki.archlinux.org/title/systemd/Timers), I regularly create a backup every night.
I also monitor successful execution of the backup process, in case some error occurs.
However, the way I set this up resulted in not receiving notifications.
Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly.
I was using the `on_error` option to handle errors, like so:
```yaml
on_error:
- 'apprise --body="Error while performing backup" <URL> || true'
```
However, `on_error` does not handle errors from the execution of `before_everything` and `after_everything` hooks.
My solution to this was moving the error handling up to the Systemd service that calls Borgmatic.
This results in the following Systemd service:
```systemd
[Unit]
Description=Backup data using Borgmatic
# Added
OnFailure=backup-failure.service
[Service]
ExecStart=/usr/bin/borgmatic --config /root/backup.yml
Type=oneshot
```
This handles any error, be it from Borgmatic's hooks or itself.
The `backup-failure` service is very simple, and just calls Apprise to send a notification:
```systemd
[Unit]
Description=Send backup failure notification
[Service]
Type=oneshot
ExecStart=apprise --body="Failed to create backup!" <URL>
[Install]
WantedBy=multi-user.target
```
# The Aftermath (or what I learned)
Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight.
And, of course, you only notice your backups aren't working when you actually need them.
This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that.
Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically).
I then tried to recover the data using last night's backup, only to find out there was no such backup.
Fortunately, I had other means to recover the data so I incurred no data loss.
I already knew it is important to test backups, but I learned it is also important to test failures during backups!