add blog post about borgmatic error handling
This commit is contained in:
parent
92366bf618
commit
c8b8d925be
1 changed files with 60 additions and 0 deletions
60
jekyll/_posts/backup-failure/2023-08-08-backup-failure.md
Normal file
60
jekyll/_posts/backup-failure/2023-08-08-backup-failure.md
Normal file
|
@ -0,0 +1,60 @@
|
|||
---
|
||||
layout: post
|
||||
title: Error Handling in Borgmatic
|
||||
date: 2023-08-08 11:51:00 Europe/Amsterdam
|
||||
categories: backup borg borgmatic
|
||||
---
|
||||
|
||||
[BorgBackup](https://borgbackup.readthedocs.io/en/stable/) and [Borgmatic](https://torsion.org/borgmatic/) have been my go-to tools to create backups for my home lab since I started creating backups.
|
||||
Using [Systemd Timers](https://wiki.archlinux.org/title/systemd/Timers), I regularly create a backup every night.
|
||||
I also monitor successful execution of the backup process, in case some error occurs.
|
||||
However, the way I set this up resulted in not receiving notifications.
|
||||
Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly.
|
||||
|
||||
I was using the `on_error` option to handle errors, like so:
|
||||
|
||||
```yaml
|
||||
on_error:
|
||||
- 'apprise --body="Error while performing backup" <URL> || true'
|
||||
```
|
||||
|
||||
However, `on_error` does not handle errors from the execution of `before_everything` and `after_everything` hooks.
|
||||
My solution to this was moving the error handling up to the Systemd service that calls Borgmatic.
|
||||
This results in the following Systemd service:
|
||||
|
||||
```systemd
|
||||
[Unit]
|
||||
Description=Backup data using Borgmatic
|
||||
# Added
|
||||
OnFailure=backup-failure.service
|
||||
|
||||
[Service]
|
||||
ExecStart=/usr/bin/borgmatic --config /root/backup.yml
|
||||
Type=oneshot
|
||||
```
|
||||
|
||||
This handles any error, be it from Borgmatic's hooks or itself.
|
||||
The `backup-failure` service is very simple, and just calls Apprise to send a notification:
|
||||
|
||||
```systemd
|
||||
[Unit]
|
||||
Description=Send backup failure notification
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=apprise --body="Failed to create backup!" <URL>
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
# The Aftermath (or what I learned)
|
||||
|
||||
Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight.
|
||||
And, of course, you only notice your backups aren't working when you actually need them.
|
||||
This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that.
|
||||
Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically).
|
||||
I then tried to recover the data using last night's backup, only to find out there was no such backup.
|
||||
Fortunately, I had other means to recover the data so I incurred no data loss.
|
||||
|
||||
I already knew it is important to test backups, but I learned it is also important to test failures during backups!
|
Reference in a new issue