61 lines
2.4 KiB
Markdown
61 lines
2.4 KiB
Markdown
|
---
|
||
|
layout: post
|
||
|
title: Error Handling in Borgmatic
|
||
|
date: 2023-08-08 11:51:00 Europe/Amsterdam
|
||
|
categories: backup borg borgmatic
|
||
|
---
|
||
|
|
||
|
[BorgBackup](https://borgbackup.readthedocs.io/en/stable/) and [Borgmatic](https://torsion.org/borgmatic/) have been my go-to tools to create backups for my home lab since I started creating backups.
|
||
|
Using [Systemd Timers](https://wiki.archlinux.org/title/systemd/Timers), I regularly create a backup every night.
|
||
|
I also monitor successful execution of the backup process, in case some error occurs.
|
||
|
However, the way I set this up resulted in not receiving notifications.
|
||
|
Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly.
|
||
|
|
||
|
I was using the `on_error` option to handle errors, like so:
|
||
|
|
||
|
```yaml
|
||
|
on_error:
|
||
|
- 'apprise --body="Error while performing backup" <URL> || true'
|
||
|
```
|
||
|
|
||
|
However, `on_error` does not handle errors from the execution of `before_everything` and `after_everything` hooks.
|
||
|
My solution to this was moving the error handling up to the Systemd service that calls Borgmatic.
|
||
|
This results in the following Systemd service:
|
||
|
|
||
|
```systemd
|
||
|
[Unit]
|
||
|
Description=Backup data using Borgmatic
|
||
|
# Added
|
||
|
OnFailure=backup-failure.service
|
||
|
|
||
|
[Service]
|
||
|
ExecStart=/usr/bin/borgmatic --config /root/backup.yml
|
||
|
Type=oneshot
|
||
|
```
|
||
|
|
||
|
This handles any error, be it from Borgmatic's hooks or itself.
|
||
|
The `backup-failure` service is very simple, and just calls Apprise to send a notification:
|
||
|
|
||
|
```systemd
|
||
|
[Unit]
|
||
|
Description=Send backup failure notification
|
||
|
|
||
|
[Service]
|
||
|
Type=oneshot
|
||
|
ExecStart=apprise --body="Failed to create backup!" <URL>
|
||
|
|
||
|
[Install]
|
||
|
WantedBy=multi-user.target
|
||
|
```
|
||
|
|
||
|
# The Aftermath (or what I learned)
|
||
|
|
||
|
Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight.
|
||
|
And, of course, you only notice your backups aren't working when you actually need them.
|
||
|
This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that.
|
||
|
Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically).
|
||
|
I then tried to recover the data using last night's backup, only to find out there was no such backup.
|
||
|
Fortunately, I had other means to recover the data so I incurred no data loss.
|
||
|
|
||
|
I already knew it is important to test backups, but I learned it is also important to test failures during backups!
|