add blog post about borgmatic error handling

2023-08-08 12:28:48 +02:00 · 2023-08-08 12:28:48 +02:00 · c8b8d925be
commit c8b8d925be
parent 92366bf618
1 changed files with 60 additions and 0 deletions
--- a/jekyll/_posts/backup-failure/2023-08-08-backup-failure.md
+++ b/jekyll/_posts/backup-failure/2023-08-08-backup-failure.md
@ -0,0 +1,60 @@
+---
+layout: post
+title:  Error Handling in Borgmatic
+date:   2023-08-08 11:51:00 Europe/Amsterdam
+categories: backup borg borgmatic
+---
+
+[BorgBackup](https://borgbackup.readthedocs.io/en/stable/) and [Borgmatic](https://torsion.org/borgmatic/) have been my go-to tools to create backups for my home lab since I started creating backups.
+Using [Systemd Timers](https://wiki.archlinux.org/title/systemd/Timers), I regularly create a backup every night.
+I also monitor successful execution of the backup process, in case some error occurs.
+However, the way I set this up resulted in not receiving notifications.
+Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly.
+
+I was using the `on_error` option to handle errors, like so:
+
+```yaml
+on_error:
+  - 'apprise --body="Error while performing backup" <URL> || true'
+```
+
+However, `on_error` does not handle errors from the execution of `before_everything` and `after_everything` hooks.
+My solution to this was moving the error handling up to the Systemd service that calls Borgmatic.
+This results in the following Systemd service:
+
+```systemd
+[Unit]
+Description=Backup data using Borgmatic
+# Added
+OnFailure=backup-failure.service
+
+[Service]
+ExecStart=/usr/bin/borgmatic --config /root/backup.yml
+Type=oneshot
+```
+
+This handles any error, be it from Borgmatic's hooks or itself.
+The `backup-failure` service is very simple, and just calls Apprise to send a notification:
+
+```systemd
+[Unit]
+Description=Send backup failure notification
+
+[Service]
+Type=oneshot
+ExecStart=apprise --body="Failed to create backup!" <URL>
+
+[Install]
+WantedBy=multi-user.target
+```
+
+# The Aftermath (or what I learned)
+
+Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight.
+And, of course, you only notice your backups aren't working when you actually need them.
+This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that.
+Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically).
+I then tried to recover the data using last night's backup, only to find out there was no such backup.
+Fortunately, I had other means to recover the data so I incurred no data loss.
+
+I already knew it is important to test backups, but I learned it is also important to test failures during backups!