diff --git a/jekyll/_posts/backup-failure/2023-08-08-backup-failure.md b/jekyll/_posts/backup-failure/2023-08-08-backup-failure.md new file mode 100644 index 0000000..9a03b52 --- /dev/null +++ b/jekyll/_posts/backup-failure/2023-08-08-backup-failure.md @@ -0,0 +1,60 @@ +--- +layout: post +title: Error Handling in Borgmatic +date: 2023-08-08 11:51:00 Europe/Amsterdam +categories: backup borg borgmatic +--- + +[BorgBackup](https://borgbackup.readthedocs.io/en/stable/) and [Borgmatic](https://torsion.org/borgmatic/) have been my go-to tools to create backups for my home lab since I started creating backups. +Using [Systemd Timers](https://wiki.archlinux.org/title/systemd/Timers), I regularly create a backup every night. +I also monitor successful execution of the backup process, in case some error occurs. +However, the way I set this up resulted in not receiving notifications. +Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly. + +I was using the `on_error` option to handle errors, like so: + +```yaml +on_error: + - 'apprise --body="Error while performing backup" || true' +``` + +However, `on_error` does not handle errors from the execution of `before_everything` and `after_everything` hooks. +My solution to this was moving the error handling up to the Systemd service that calls Borgmatic. +This results in the following Systemd service: + +```systemd +[Unit] +Description=Backup data using Borgmatic +# Added +OnFailure=backup-failure.service + +[Service] +ExecStart=/usr/bin/borgmatic --config /root/backup.yml +Type=oneshot +``` + +This handles any error, be it from Borgmatic's hooks or itself. +The `backup-failure` service is very simple, and just calls Apprise to send a notification: + +```systemd +[Unit] +Description=Send backup failure notification + +[Service] +Type=oneshot +ExecStart=apprise --body="Failed to create backup!" + +[Install] +WantedBy=multi-user.target +``` + +# The Aftermath (or what I learned) + +Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight. +And, of course, you only notice your backups aren't working when you actually need them. +This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that. +Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically). +I then tried to recover the data using last night's backup, only to find out there was no such backup. +Fortunately, I had other means to recover the data so I incurred no data loss. + +I already knew it is important to test backups, but I learned it is also important to test failures during backups!