flatten directory structure
add devenv build gem bundle with nix
|
@ -0,0 +1,87 @@
|
|||
---
|
||||
layout: post
|
||||
title: Using Ansible to alter Kernel Parameters
|
||||
date: 2023-06-19 09:31:00 Europe/Amsterdam
|
||||
categories: ansible grub linux
|
||||
---
|
||||
|
||||
For months, I've had a peculiar problem with my laptop: once in a while, seemingly without reason, my laptop screen would freeze.
|
||||
This only happened on my laptop screen, and not on an external monitor.
|
||||
I had kind of learned to live with it as I couldn't find a solution online.
|
||||
The only remedy I had was reloading my window manager, which would often unfreeze the screen.
|
||||
|
||||
Yesterday I tried Googling once more and I actually found [a thread](https://bbs.archlinux.org/viewtopic.php?id=246841) about it on the Arch Linux forums!
|
||||
They talk about the same laptop model, the Lenovo ThinkPad x260, having the problem.
|
||||
Fortunately, they also propose [a temporary fix](https://bbs.archlinux.org/viewtopic.php?pid=1888932#p1888932).
|
||||
|
||||
# Trying the Fix
|
||||
|
||||
Apparently, a problem with the Panel Self Refresh (PSR) feature of Intel iGPUs is the culprit.
|
||||
According to the [Linux source code](https://github.com/torvalds/linux/blob/45a3e24f65e90a047bef86f927ebdc4c710edaa1/drivers/gpu/drm/i915/display/intel_psr.c#L42), PSR enables the display to go into a lower standby mode when the sytem is idle but the screen is in use.
|
||||
These lower standby modes can reduce power usage of your device when idling.
|
||||
|
||||
This all seems useful, except when it makes your screen freeze!
|
||||
The proposed fix disables the PSR feature entirely.
|
||||
To do this, we need to change a parameter to the Intel Graphics Linux Kernel Module (LKM).
|
||||
The LKM for Intel Graphics is called `i915`.
|
||||
There are [multiple ways](https://wiki.archlinux.org/title/Kernel_parameters) to change kernel parameters, but I chose to edit my Grub configuration.
|
||||
|
||||
First, I wanted to test whether it actually works.
|
||||
When booting into my Linux partition via Grub, you can press `e` to edit the Grub definition.
|
||||
Somewhere there, you can find the `linux` command which specifies to boot Linux and how to do that.
|
||||
I simply appended the option `i915.enable_psr=0` to this line.
|
||||
After rebooting, I noticed my screen no longer freezes!
|
||||
Success!
|
||||
|
||||
# Persisting the Fix
|
||||
|
||||
To make the change permanent, we need to permanently change Grub's configuration.
|
||||
One way to do this, is by changing Grub's defaults in `/etc/default/grub`.
|
||||
Namely, the `GRUB_CMDLINE_LINUX_DEFAULT` option specifies what options Grub should pass to the Linux kernel by default.
|
||||
For me, this is a nice solution as the problem exists for both Linux OSes I have installed.
|
||||
I changed this option to:
|
||||
```ini
|
||||
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash i915.enable_psr=0"
|
||||
```
|
||||
|
||||
Next, I wanted to automate this solution using Ansible.
|
||||
This turned out to be quite easy, as the Grub configuration looks a bit like an ini file (maybe it is?):
|
||||
```yaml
|
||||
- name: Edit grub to disable Panel Self Refresh
|
||||
become: true
|
||||
ini_file:
|
||||
path: /etc/default/grub
|
||||
section: null
|
||||
option: "GRUB_CMDLINE_LINUX_DEFAULT"
|
||||
value: '"quiet splash i915.enable_psr=0"'
|
||||
no_extra_spaces: true
|
||||
notify: update grub
|
||||
```
|
||||
|
||||
Lastly, I created the `notify` hook to update the Grub configuration:
|
||||
```yaml
|
||||
- name: update grub
|
||||
become: true
|
||||
command:
|
||||
cmd: update-grub
|
||||
```
|
||||
|
||||
# Update: Just use Nix
|
||||
|
||||
Lately, I have been learning a bit of NixOS with the intention of replacing my current setup.
|
||||
Compared to Ansible, applying this fix is a breeze on NixOS:
|
||||
```nix
|
||||
{
|
||||
boot.kernelParams = [ "i915.enable_psr=0" ];
|
||||
}
|
||||
```
|
||||
|
||||
That's it, yep.
|
||||
|
||||
# Conclusion
|
||||
|
||||
It turned out to be quite easy to change Linux kernel parameters using Ansible.
|
||||
Maybe some kernel gurus have better ways to change parameters, but this works for me for now.
|
||||
|
||||
As a sidenote, I started reading a bit more about NixOS and realised that it can solve issues like these much more nicely than Ansible does.
|
||||
I might replace my OS with NixOS some day, if I manage to rewrite my Ansible for it.
|
60
_posts/backup-failure/2023-08-08-backup-failure.md
Normal file
|
@ -0,0 +1,60 @@
|
|||
---
|
||||
layout: post
|
||||
title: Error Handling in Borgmatic
|
||||
date: 2023-08-08 11:51:00 Europe/Amsterdam
|
||||
categories: backup borg borgmatic
|
||||
---
|
||||
|
||||
[BorgBackup](https://borgbackup.readthedocs.io/en/stable/) and [Borgmatic](https://torsion.org/borgmatic/) have been my go-to tools to create backups for my home lab since I started creating backups.
|
||||
Using [Systemd Timers](https://wiki.archlinux.org/title/systemd/Timers), I regularly create a backup every night.
|
||||
I also monitor successful execution of the backup process, in case some error occurs.
|
||||
However, the way I set this up resulted in not receiving notifications.
|
||||
Even though it boils down to RTFM, I'd like to explain my error and how to handle errors correctly.
|
||||
|
||||
I was using the `on_error` option to handle errors, like so:
|
||||
|
||||
```yaml
|
||||
on_error:
|
||||
- 'apprise --body="Error while performing backup" <URL> || true'
|
||||
```
|
||||
|
||||
However, `on_error` does not handle errors from the execution of `before_everything` and `after_everything` hooks.
|
||||
My solution to this was moving the error handling up to the Systemd service that calls Borgmatic.
|
||||
This results in the following Systemd service:
|
||||
|
||||
```systemd
|
||||
[Unit]
|
||||
Description=Backup data using Borgmatic
|
||||
# Added
|
||||
OnFailure=backup-failure.service
|
||||
|
||||
[Service]
|
||||
ExecStart=/usr/bin/borgmatic --config /root/backup.yml
|
||||
Type=oneshot
|
||||
```
|
||||
|
||||
This handles any error, be it from Borgmatic's hooks or itself.
|
||||
The `backup-failure` service is very simple, and just calls Apprise to send a notification:
|
||||
|
||||
```systemd
|
||||
[Unit]
|
||||
Description=Send backup failure notification
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=apprise --body="Failed to create backup!" <URL>
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
# The Aftermath (or what I learned)
|
||||
|
||||
Because the error handling and alerting weren't working propertly, my backups didn't succeed for two weeks straight.
|
||||
And, of course, you only notice your backups aren't working when you actually need them.
|
||||
This is exactly what happened: my disk was full and a MariaDB database crashed as a result of that.
|
||||
Actually, the whole database seemed to be corrupt and I find it worrying MariaDB does not seem to be very resilient to failures (in comparison a PostgreSQL database was able to recover automatically).
|
||||
I then tried to recover the data using last night's backup, only to find out there was no such backup.
|
||||
Fortunately, I had other means to recover the data so I incurred no data loss.
|
||||
|
||||
I already knew it is important to test backups, but I learned it is also important to test failures during backups!
|
|
@ -0,0 +1,194 @@
|
|||
---
|
||||
layout: post
|
||||
title: Sending Apprise Notifications from Concourse CI
|
||||
date: 2023-06-14 23:39:00 Europe/Amsterdam
|
||||
categories: concourse apprise
|
||||
---
|
||||
|
||||
Recently, I deployed [Concourse CI](https://concourse-ci.org/) because I wanted to get my feet wet with a CI/CD pipeline.
|
||||
However, I had a practical use case lying around for a long time: automatically compiling my static website and deploying it to my docker Swarm.
|
||||
This took some time getting right, but the result works like a charm ([source code](https://git.kun.is/pim/static)).
|
||||
|
||||
It's comforting to know I don't have move a finger and my website is automatically deployed.
|
||||
However, I would still like to receive some indication of what's happening.
|
||||
And what's a better way to do that, than using my [Apprise](https://github.com/caronc/apprise) service to keep me up to date.
|
||||
There's a little snag though: I could not find any Concourse resource that does this.
|
||||
That's when I decided to just create it myself.
|
||||
|
||||
# The Plagiarism Hunt
|
||||
|
||||
As any good computer person, I am lazy.
|
||||
I'd rather just copy someone's work, so that's what I did.
|
||||
I found [this](https://github.com/mockersf/concourse-slack-notifier) GitHub repository that does the same thing but for Slack notifications.
|
||||
For some reason it's archived, but it seemed like it should work.
|
||||
I actually noticed lots of repositories for Concourse resource types are archived, so not sure what's going on there.
|
||||
|
||||
# Getting to know Concourse
|
||||
|
||||
Let's first understand what we need to do reach our end goal of sending Apprise notifications from Concourse.
|
||||
|
||||
A Concourse pipeline takes some inputs, performs some operations on them which result in some outputs.
|
||||
These inputs and outputs are called _resources_ in Concourse.
|
||||
For example, a Git repository could be a resource.
|
||||
Each resource is an instance of a _resource type_.
|
||||
A resource type therefore is simply a blueprint that can create multiple resources.
|
||||
To continue the example, a resource type could be "Git repository".
|
||||
|
||||
We therefore need to create our own resource type that can send Apprise notifications.
|
||||
A resource type is simply a container that includes three scripts:
|
||||
- `check`: check for a new version of a resource
|
||||
- `in`: retrieve a version of the resource
|
||||
- `out`: create a version of the resource
|
||||
|
||||
As Apprise notifications are basically fire-and-forget, we will only implement the `out` script.
|
||||
|
||||
# Writing the `out` script
|
||||
|
||||
The whole script can be found [here](https://git.kun.is/pim/concourse-apprise-notifier/src/branch/master/out), but I will explain the most important bits of it.
|
||||
Note that I only use Apprise's persistent storage solution, and not its stateless solution.
|
||||
|
||||
Concourse provides us with the working directory, which we `cd` to:
|
||||
```bash
|
||||
cd "${1}"
|
||||
```
|
||||
|
||||
We create a timestamp, formatted in JSON, which we will use for the resource's new version later.
|
||||
Concourse requires us to set a version for the resource, but since Apprise notifications don't have that, we use the timestamp:
|
||||
```bash
|
||||
timestamp="$(jq -n "{version:{timestamp:\"$(date +%s)\"}}")"
|
||||
```
|
||||
|
||||
First some black magic Bash to redirect file descriptors.
|
||||
Not sure why this is needed, but I copied it anyways.
|
||||
After that, we create a temporary file holding resource's parameters.
|
||||
```bash
|
||||
exec 3>&1
|
||||
exec 1>&2
|
||||
|
||||
payload=$(mktemp /tmp/resource-in.XXXXXX)
|
||||
cat > "${payload}" <&0
|
||||
```
|
||||
|
||||
We then extract the individual parameters.
|
||||
The `source` key contains values how the resource type was specified, while the `params` key specifies parameters for this specific resource.
|
||||
```bash
|
||||
apprise_host="$(jq -r '.source.host' < "${payload}")"
|
||||
apprise_key="$(jq -r '.source.key' < "${payload}")"
|
||||
|
||||
alert_body="$(jq -r '.params.body' < "${payload}")"
|
||||
alert_title="$(jq -r '.params.title // null' < "${payload}")"
|
||||
alert_type="$(jq -r '.params.type // null' < "${payload}")"
|
||||
alert_tag="$(jq -r '.params.tag // null' < "${payload}")"
|
||||
alert_format="$(jq -r '.params.format // null' < "${payload}")"
|
||||
```
|
||||
|
||||
We then format the different parameters using JSON:
|
||||
```bash
|
||||
alert_body="$(eval "printf \"${alert_body}\"" | jq -R -s .)"
|
||||
[ "${alert_title}" != "null" ] && alert_title="$(eval "printf \"${alert_title}\"" | jq -R -s .)"
|
||||
[ "${alert_type}" != "null" ] && alert_type="$(eval "printf \"${alert_type}\"" | jq -R -s .)"
|
||||
[ "${alert_tag}" != "null" ] && alert_tag="$(eval "printf \"${alert_tag}\"" | jq -R -s .)"
|
||||
[ "${alert_format}" != "null" ] && alert_format="$(eval "printf \"${alert_format}\"" | jq -R -s .)"
|
||||
```
|
||||
|
||||
Next, from the individual parameters we construct the final JSON message body we send to the Apprise endpoint.
|
||||
```bash
|
||||
body="$(cat <<EOF
|
||||
{
|
||||
"body": ${alert_body},
|
||||
"title": ${alert_title},
|
||||
"type": ${alert_type},
|
||||
"tag": ${alert_tag},
|
||||
"format": ${alert_format}
|
||||
}
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
Before sending it just yet, we compact the JSON and remove any values that are `null`:
|
||||
```bash
|
||||
compact_body="$(echo "${body}" | jq -c '.')"
|
||||
echo "$compact_body" | jq 'del(..|nulls)' > /tmp/compact_body.json
|
||||
```
|
||||
|
||||
Here is the most important line, where we send the payload to the Apprise endpoint.
|
||||
It's quite straight-forward.
|
||||
```bash
|
||||
curl -v -X POST -T /tmp/compact_body.json -H "Content-Type: application/json" "${apprise_host}/notify/${apprise_key}"
|
||||
```
|
||||
|
||||
Finally, we print the timestamp (fake version) in order to appease the Concourse gods.
|
||||
```bash
|
||||
echo "${timestamp}" >&3
|
||||
```
|
||||
|
||||
# Building the Container
|
||||
|
||||
As said earlier, to actually use this script, we need to add it to a image.
|
||||
I won't be explaining this whole process, but the source can be found [here](https://git.kun.is/pim/concourse-apprise-notifier/src/branch/master/pipeline.yml).
|
||||
The most important take-aways are these:
|
||||
- Use `concourse/oci-build-task` to build a image from a Dockerfile.
|
||||
- Use `registry-image` to push the image to an image registry.
|
||||
|
||||
# Using the Resource Type
|
||||
|
||||
Using our newly created resource type is surprisingly simple.
|
||||
I use it for the blog you are reading right now and the pipeline definition can be found [here](https://git.kun.is/pim/static/src/branch/main/pipeline.yml).
|
||||
Here we specify the resource type in a Concourse pipeline:
|
||||
```yaml
|
||||
resource_types:
|
||||
- name: apprise
|
||||
type: registry-image
|
||||
source:
|
||||
repository: git.kun.is/pim/concourse-apprise-notifier
|
||||
tag: "1.1.1"
|
||||
```
|
||||
|
||||
We simply have to tell Concourse where to find the image, and which tag we want.
|
||||
Next, we instantiate the resource type to create a resource:
|
||||
```yaml
|
||||
resources:
|
||||
- name: apprise-notification
|
||||
type: apprise
|
||||
source:
|
||||
host: https://apprise.kun.is:444
|
||||
key: concourse
|
||||
icon: bell
|
||||
```
|
||||
|
||||
We simply specify the host to send Apprise notifications to.
|
||||
Yeah, I even gave it a little bell because it's cute.
|
||||
|
||||
All that's left to do, is actually send the notification.
|
||||
Let's see how that is done:
|
||||
```yaml
|
||||
- name: deploy-static-website
|
||||
plan:
|
||||
- task: deploy-site
|
||||
config: ...
|
||||
|
||||
on_success:
|
||||
put: apprise-notification
|
||||
params:
|
||||
title: "Static website deployed!"
|
||||
body: "New version: $(cat version/version)"
|
||||
no_get: true
|
||||
```
|
||||
|
||||
As can be seen, the Apprise notification can be triggered when a task is executed successfully.
|
||||
We do this using the `put` command, which execute the `out` script underwater.
|
||||
We set the notification's title and body, and send it!
|
||||
The result is seen below in my Ntfy app, which Apprise forwards the message to:
|
||||

|
||||
|
||||
And to finish this off, here is what it looks like in the Concourse web UI:
|
||||

|
||||
|
||||
# Conclusion
|
||||
|
||||
Concourse's way of representing everything as an image/container is really interesting in my opinion.
|
||||
A resource type is quite easily implemented as well, although Bash might not be the optimal way to do this.
|
||||
I've seen some people implement it in Rust, which might be a good excuse to finally learn that language :)
|
||||
|
||||
Apart from Apprise notifications, I'm planning on creating a resource type to deploy to a Docker swarm eventually.
|
||||
This seems like a lot harder than simply sending notifications though.
|
BIN
_posts/concourse-apprise-notifier/ntfy.png
Normal file
After Width: | Height: | Size: 202 KiB |
BIN
_posts/concourse-apprise-notifier/pipeline.png
Normal file
After Width: | Height: | Size: 36 KiB |
74
_posts/fluent-bit-memory/2023-08-09-fluent-bit-memory.md
Normal file
|
@ -0,0 +1,74 @@
|
|||
---
|
||||
layout: post
|
||||
title: Monitoring Correct Memory Usage in Fluent Bit
|
||||
date: 2023-08-09 16:19:00 Europe/Amsterdam
|
||||
categories: fluentd fluentbit memory
|
||||
---
|
||||
|
||||
Previously, I have used [Prometheus' node_exporter](https://github.com/prometheus/node_exporter) to monitor the memory usage of my servers.
|
||||
However, I am currently in the process of moving away from Prometheus to a new Monioring stack.
|
||||
While I understand the advantages, I felt like Prometheus' pull architecture does not scale nicely.
|
||||
Everytime I spin up a new machine, I would have to centrally change Prometheus' configuration in order for it to query the new server.
|
||||
|
||||
In order to collect metrics from my servers, I am now using [Fluent Bit](https://fluentbit.io/).
|
||||
I love Fluent Bit's way of configuration which I can easily express as code and automate, its focus on effiency and being vendor agnostic.
|
||||
However, I have stumbled upon one, in my opinion, big issue with Fluent Bit: its `mem` plugin to monitor memory usage is _completely_ useless.
|
||||
In this post I will go over the problem and my temporary solution.
|
||||
|
||||
# The Problem with Fluent Bit's `mem` Plugin
|
||||
|
||||
As can be seen in [the documentation](https://docs.fluentbit.io/manual/pipeline/inputs/memory-metrics), Fluent Bit's `mem` input plugin exposes a few metrics regarding memory usage which should be self-explaining: `Mem.total`, `Mem.used`, `Mem.free`, `Swap.total`, `Swap.used` and `Swap.free`.
|
||||
The problem is that `Mem.used` and `Mem.free` do not accurately reflect the machine's actual memory usage.
|
||||
This is because these metrics include caches and buffers, which can be reclaimed by other processes if needed.
|
||||
Most tools reporting memory usage therefore include an additional metric that specifices the memory _available_ on the system.
|
||||
For example, the command `free -m` reports the following data on my laptop:
|
||||
```text
|
||||
total used free shared buff/cache available
|
||||
Mem: 15864 3728 7334 518 5647 12136
|
||||
Swap: 2383 663 1720
|
||||
```
|
||||
|
||||
Notice that the `available` memory is more than `free` memory.
|
||||
|
||||
While the issue is known (see [this](https://github.com/fluent/fluent-bit/pull/3092) and [this](https://github.com/fluent/fluent-bit/pull/5237) link), it is unfortunately not yet fixed.
|
||||
|
||||
# A Temporary Solution
|
||||
|
||||
The issues I linked previously provide stand-alone plugins that fix the problem, which will hopefully be merged in the official project at some point.
|
||||
However, I didn't want to install another plugin so I used Fluent Bit's `exec` input plugin and the `free` Linux command to query memory usage like so:
|
||||
```conf
|
||||
[INPUT]
|
||||
Name exec
|
||||
Tag memory
|
||||
Command free -m | tail -2 | tr '\n' ' '
|
||||
Interval_Sec 1
|
||||
```
|
||||
|
||||
To interpret the command's output, I created the following filter:
|
||||
```conf
|
||||
[FILTER]
|
||||
Name parser
|
||||
Match memory
|
||||
Key_Name exec
|
||||
Parser free
|
||||
```
|
||||
|
||||
Lastly, I created the following parser (warning: regex shitcode incoming):
|
||||
```conf
|
||||
[PARSER]
|
||||
Name free
|
||||
Format regex
|
||||
Regex ^Mem:\s+(?<mem_total>\d+)\s+(?<mem_used>\d+)\s+(?<mem_free>\d+)\s+(?<mem_shared>\d+)\s+(?<mem_buff_cache>\d+)\s+(?<mem_available>\d+) Swap:\s+(?<swap_total>\d+)\s+(?<swap_used>\d+)\s+(?<swap_free>\d+)
|
||||
Types mem_total:integer mem_used:integer mem_free:integer mem_shared:integer mem_buff_cache:integer mem_available:integer swap_total:integer swap_used:integer
|
||||
```
|
||||
|
||||
With this configuration, you can use the `mem_available` metric to get accurate memory usage in Fluent Bit.
|
||||
|
||||
# Conclusion
|
||||
|
||||
Let's hope Fluent Bit's `mem` input plugin is improved upon soon so this hacky solution is not needed.
|
||||
I also intend to document my new monitoring pipeline, which at the moment consists of:
|
||||
- Fluent Bit
|
||||
- Fluentd
|
||||
- Elasticsearch
|
||||
- Grafana
|
|
@ -0,0 +1,45 @@
|
|||
---
|
||||
layout: post
|
||||
title: Hashicorp's License Change and my Home Lab - Update
|
||||
date: 2023-08-17 18:15:00 Europe/Amsterdam
|
||||
categories: hashicorp terraform vault nomad
|
||||
---
|
||||
|
||||
_See the [Update](#update) at the end of the article._
|
||||
|
||||
Already a week ago, Hashicorp [announced](https://www.hashicorp.com/blog/hashicorp-adopts-business-source-license) it would change the license on almost all its projects.
|
||||
Unlike [their previous license](https://github.com/hashicorp/terraform/commit/ab411a1952f5b28e6c4bd73071194761da36a83f), which was the Mozilla Public License 2.0, their new license is no longer truly open source.
|
||||
It is called the Business Source License™ and restricts use of their software for competitors.
|
||||
In their own words:
|
||||
> Vendors who provide competitive services built on our community products will no longer be able to incorporate future releases, bug fixes, or security patches contributed to our products.
|
||||
|
||||
I found [a great article](https://meshedinsights.com/2021/02/02/rights-ratchet/) by MeshedInsights that names this behaviour the "rights ratchet model".
|
||||
They define a script start-ups use to garner the interest of open source enthusiasts but eventually turn their back on them for profit.
|
||||
The reason why Hashicorp can do this, is because contributors signed a copyright license agreement (CLA).
|
||||
This agreement transfers the copyright of contributors' code to Hashicorp, allowing them to change the license if they want to.
|
||||
|
||||
I find this action really regrettable because I like their products.
|
||||
This sort of action was also why I wanted to avoid using an Elastic stack, which also had their [license changed](https://www.elastic.co/pricing/faq/licensing).[^elastic]
|
||||
These companies do not respect their contributors and the software stack beneath they built their product on, which is actually open source (Golang, Linux, etc.).
|
||||
|
||||
# Impact on my Home Lab
|
||||
|
||||
I am using Terraform in my home lab to manage several important things:
|
||||
- Libvirt virtual machines
|
||||
- PowerDNS records
|
||||
- Elasticsearch configuration
|
||||
|
||||
With Hashicorp's anti open source move, I intend to move away from Terraform in the future.
|
||||
While I will not use Hashicorp's products for new personal projects, I will leave my current setup as-is for some time because there is no real need to quickly migrate.
|
||||
|
||||
I might also investigate some of Terraform's competitors, like Pulumi.
|
||||
Hopefully there is a project that respects open source which I can use in the future.
|
||||
|
||||
# Update
|
||||
|
||||
A promising fork of Terraform has been announced called [OpenTF](https://opentf.org/announcement).
|
||||
They intend to take part of the Cloud Native Computing Foundation, which I think is a good effort because Terraform is so important for modern cloud infrastructures.
|
||||
|
||||
# Footnotes
|
||||
|
||||
[^elastic]: While I am still using Elasticsearch, I don't use the rest of the Elastic stack in order to prevent a vendor lock-in.
|
184
_posts/homebrew-ssh-ca/2023-05-23-homebrew-ssh-ca.md
Normal file
|
@ -0,0 +1,184 @@
|
|||
---
|
||||
layout: post
|
||||
title: Homebrew SSH Certificate Authority for the Terraform Libvirt Provider
|
||||
date: 2023-05-23 11:14:00 Europe/Amsterdam
|
||||
categories: ssh terraform ansible
|
||||
---
|
||||
|
||||
Ever SSH'ed into a freshly installed server and gotten the following annoying message?
|
||||
```text
|
||||
The authenticity of host 'host.tld (1.2.3.4)' can't be established.
|
||||
ED25519 key fingerprint is SHA256:eUXGdm1YdsMAS7vkdx6dOJdOGHdem5gQp4tadCfdLB8.
|
||||
Are you sure you want to continue connecting (yes/no)?
|
||||
```
|
||||
|
||||
Or even more annoying:
|
||||
```text
|
||||
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
||||
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
|
||||
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
||||
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
|
||||
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
|
||||
It is also possible that a host key has just been changed.
|
||||
The fingerprint for the ED25519 key sent by the remote host is
|
||||
SHA256:eUXGdm1YdsMAS7vkdx6dOJdOGHdem5gQp4tadCfdLB8.
|
||||
Please contact your system administrator.
|
||||
Add correct host key in /home/user/.ssh/known_hosts to get rid of this message.
|
||||
Offending ED25519 key in /home/user/.ssh/known_hosts:3
|
||||
remove with:
|
||||
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "1.2.3.4"
|
||||
ED25519 host key for 1.2.3.4 has changed and you have requested strict checking.
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
Could it be that the programmers at OpenSSH simply like to annoy us with these confusing messages?
|
||||
Maybe, but these warnings also serve as a way to notify users of a potential Man-in-the-Middle (MITM) attack.
|
||||
I won't go into the details of this problem, but I refer you to [this excellent blog post](https://blog.g3rt.nl/ssh-host-key-validation-strict-yet-user-friendly.html).
|
||||
Instead, I would like to talk about ways to solve these annoying warnings.
|
||||
|
||||
One obvious solution is simply to add each host to your `known_hosts` file.
|
||||
This works okay when managing a handful of servers, but becomes unbearable when managing many servers.
|
||||
In my case, I wanted to quickly spin up virtual machines using Duncan Mac-Vicar's [Terraform Libvirt provider](https://registry.terraform.io/providers/dmacvicar/libvirt/latest/docs), without having to accept their host key before connecting.
|
||||
The solution? Issuing SSH host certificates using an SSH certificate authority.
|
||||
|
||||
## SSH Certificate Authorities vs. the Web
|
||||
|
||||
The idea of an SSH certificate authority (CA) is quite easy to grasp, if you understand the web's Public Key Infrastructure (PKI).
|
||||
Just like with the web, a trusted party can issue certificates that are offered when establishing a connection.
|
||||
The idea is, just by trusting the trusted party, you trust every certificate they issue.
|
||||
In the case of the web's PKI, this trusted party is bundled and trusted by [your browser](https://wiki.mozilla.org/CA) or operating system.
|
||||
However, in the case of SSH, the trusted party is you! (Okay you can also trust your own web certificate authority)
|
||||
With this great power, comes great responsibility which we will abuse heavily in this article.
|
||||
|
||||
## SSH Certificate Authority for Terraform
|
||||
|
||||
So, let's start with a plan.
|
||||
I want to spawn virtual machines with Terraform which which are automatically provisioned with a SSH host certificate issued by my CA.
|
||||
This CA will be another host on my private network, issuing certificates over SSH.
|
||||
|
||||
### Fetching the SSH Host Certificate
|
||||
|
||||
First we generate an SSH key pair in Terraform.
|
||||
Below is the code for that:
|
||||
```terraform
|
||||
resource "tls_private_key" "debian" {
|
||||
algorithm = "ED25519"
|
||||
}
|
||||
|
||||
data "tls_public_key" "debian" {
|
||||
private_key_pem = tls_private_key.debian.private_key_pem
|
||||
}
|
||||
```
|
||||
|
||||
Now that we have an SSH key pair, we need to somehow make Terraform communicate this with the CA.
|
||||
Lucky for us, there is a way for Terraform to execute an arbitrary command with the `external` data feature.
|
||||
We call this script below:
|
||||
```terraform
|
||||
data "external" "cert" {
|
||||
program = ["bash", "${path.module}/get_cert.sh"]
|
||||
|
||||
query = {
|
||||
pubkey = trimspace(data.tls_public_key.debian.public_key_openssh)
|
||||
host = var.name
|
||||
cahost = var.ca_host
|
||||
cascript = var.ca_script
|
||||
cakey = var.ca_key
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
These query parameters will end up in the script's stdin in JSON format.
|
||||
We can then read these parameters, and send them to the CA over SSH.
|
||||
The result must as well be in JSON format.
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
IFS=$'\n\t'
|
||||
|
||||
# Read the query parameters
|
||||
eval "$(jq -r '@sh "PUBKEY=\(.pubkey) HOST=\(.host) CAHOST=\(.cahost) CASCRIPT=\(.cascript) CAKEY=\(.cakey)"')"
|
||||
|
||||
# Fetch certificate from the CA
|
||||
# Warning: extremely ugly code that I am to lazy to fix
|
||||
CERT=$(ssh -o ConnectTimeout=3 -o ConnectionAttempts=1 root@$CAHOST '"'"$CASCRIPT"'" host "'"$CAKEY"'" "'"$PUBKEY"'" "'"$HOST"'".dmz')
|
||||
|
||||
jq -n --arg cert "$CERT" '{"cert":$cert}'
|
||||
```
|
||||
|
||||
We see that a script is called on the remote host that issues the certificate.
|
||||
This is just a simple wrapper around `ssh-keygen`, which you can see below.
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
IFS=$'\n\t'
|
||||
|
||||
host() {
|
||||
CAKEY="$2"
|
||||
PUBKEY="$3"
|
||||
HOST="$4"
|
||||
|
||||
echo "$PUBKEY" > /root/ca/"$HOST".pub
|
||||
ssh-keygen -h -s /root/ca/keys/"$CAKEY" -I "$HOST" -n "$HOST" /root/ca/"$HOST".pub
|
||||
cat /root/ca/"$HOST"-cert.pub
|
||||
rm /root/ca/"$HOST"*.pub
|
||||
}
|
||||
|
||||
"$1" "$@"
|
||||
```
|
||||
|
||||
### Appeasing the Terraform Gods
|
||||
|
||||
So nice, we can fetch the SSH host certificate from the CA.
|
||||
We should just be able to use it right?
|
||||
We can, but it brings a big annoyance with it: Terraform will fetch a new certificate every time it is run.
|
||||
This is because the `external` feature of Terraform is a data source.
|
||||
If we were to use this data source for a Terraform resource, it would need to be updated every time we run Terraform.
|
||||
I have not been able to find a way to avoid fetching the certificate every time, except for writing my own resource provider which I'd rather not.
|
||||
I have, however, found a way to hack around the issue.
|
||||
|
||||
The idea is as follows: we can use Terraform's `ignore_changes` to, well, ignore any changes of a resource.
|
||||
Unfortunately, we cannot use this for a `data` source, so we must create a glue `null_resource` that supports `ignore_changes`.
|
||||
This is shown in the code snipppet below.
|
||||
We use the `triggers` property simply to copy the certificate in; we don't use it for it's original purpose.
|
||||
|
||||
```terraform
|
||||
resource "null_resource" "cert" {
|
||||
triggers = {
|
||||
cert = data.external.cert.result["cert"]
|
||||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
triggers
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
And voilà, we can now use `null_resource.cert.triggers["cert"]` as our certificate, that won't trigger replacements in Terraform.
|
||||
|
||||
### Setting the Host Certificate with Cloud-Init
|
||||
|
||||
Terraform's Libvirt provider has native support for Cloud-Init, which is very handy.
|
||||
We can give the host certificate directly to Cloud-Init and place it on the virtual machine.
|
||||
Inside the Cloud-Init configuration, we can set the `ssh_keys` property to do this:
|
||||
```yml
|
||||
ssh_keys:
|
||||
ed25519_private: |
|
||||
${indent(4, private_key)}
|
||||
ed25519_certificate: "${host_cert}"
|
||||
```
|
||||
|
||||
I hardcoded this to ED25519 keys, because this is all I use.
|
||||
|
||||
This works perfectly, and I never have to accept host certificates from virtual machines again.
|
||||
|
||||
### Caveats
|
||||
|
||||
A sharp eye might have noticed the lifecycle of these host certificates is severely lacking.
|
||||
Namely, the deployed host certificates have no expiration date nore is there revocation function.
|
||||
There are ways to implement these, but for my home lab I did not deem this necessary at this point.
|
||||
In a more professional environment, I would suggest using [Hashicorp's Vault](https://www.vaultproject.io/).
|
||||
|
||||
This project did teach me about the limits and flexibility of Terraform, so all in all a success!
|
||||
All code can be found on the git repository [here](https://git.kun.is/home/tf-modules/src/branch/master/debian).
|
|
@ -0,0 +1,283 @@
|
|||
---
|
||||
layout: post
|
||||
title: Home Lab Infrastructure Snapshot August 2023
|
||||
date: 2023-08-27 22:23:00 Europe/Amsterdam
|
||||
categories: infrastructure homelab
|
||||
---
|
||||
|
||||
I have been meaning to write about the current state of my home lab infrastructure for a while now.
|
||||
Now that the most important parts are quite stable, I think the opportunity is ripe.
|
||||
I expect this post to get quite long, so I might have to leave out some details along the way.
|
||||
|
||||
This post will be a starting point for future infrastructure snapshots which I can hopefully put out periodically.
|
||||
That is, if there is enough worth talking about.
|
||||
|
||||
Keep an eye out for the <i class="fa-solid fa-code-branch"></i> icon, which links to the source code and configuration of anything mentioned.
|
||||
Oh yeah, did I mention everything I do is open source?
|
||||
|
||||
# Networking and Infrastructure Overview
|
||||
|
||||
## Hardware and Operating Systems
|
||||
|
||||
Let's start with the basics: what kind of hardware do I use for my home lab?
|
||||
The most important servers are my three [Gigabyte Brix GB-BLCE-4105](https://www.gigabyte.com/Mini-PcBarebone/GB-BLCE-4105-rev-10).
|
||||
Two of them have 16 GB of memory, and one 8 GB.
|
||||
I named these servers as follows:
|
||||
- **Atlas**: because this server was going to "lift" a lot of virtual machines.
|
||||
- **Lewis**: we started out with a "Max" server named after the Formula 1 driver Max Verstappen, but it kind of became an unmanagable behemoth without infrastructure-as-code. Our second server we subsequently named Lewis after his colleague Lewis Hamilton. Note: people around me vetoed these names and I am no F1 fan!
|
||||
- **Jefke**: it's a funny Belgian name. That's all.
|
||||
|
||||
Here is a picture of them sitting in their cosy closet:
|
||||
|
||||

|
||||
|
||||
If you look look to the left, you will also see a Raspberry pi 4B.
|
||||
I use this Pi to do some rudimentary monitoring whether servers and services are running.
|
||||
More on this in the relevant section below.
|
||||
The Pi is called **Iris** because it's a messenger for the other servers.
|
||||
|
||||
I used to run Ubuntu on these systems, but I have since migrated away to Debian.
|
||||
The main reasons were Canonical [putting advertisements in my terminal](https://askubuntu.com/questions/1434512/how-to-get-rid-of-ubuntu-pro-advertisement-when-updating-apt) and pushing Snap which has a [proprietry backend](https://hackaday.com/2020/06/24/whats-the-deal-with-snap-packages/).
|
||||
Two of my servers run the newly released Debian Bookworm, while one still runs Debian Bullseye.
|
||||
|
||||
## Networking
|
||||
|
||||
For networking, I wanted hypervisors and virtual machines separated by VLANs for security reasons.
|
||||
The following picture shows a simplified view of the VLANs present in my home lab:
|
||||
|
||||

|
||||
|
||||
All virtual machines are connected to a virtual bridge which tags network traffic with the DMZ VLAN.
|
||||
The hypervisors VLAN is used for traffic to and from the hypervisors.
|
||||
Devices from the hypervisors VLAN are allowed to connect to devices in the DMZ, but not vice versa.
|
||||
The hypervisors are connected to a switch using a trunk link, allows both DMZ and hypervisors traffic.
|
||||
|
||||
I realised the above design using ifupdown.
|
||||
Below is the configuration for each hypervisor, which creates a new `enp3s0.30` interface with all DMZ traffic from the `enp3s0` interface [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/hypervisors/src/commit/71b96d462116e4160b6467533fc476f3deb9c306/ansible/dmz.conf.j2).
|
||||
|
||||
```text
|
||||
auto enp3s0.30
|
||||
iface enp3s0.30 inet manual
|
||||
iface enp3s0.30 inet6 auto
|
||||
accept_ra 0
|
||||
dhcp 0
|
||||
request_prefix 0
|
||||
privext 0
|
||||
pre-up sysctl -w net/ipv6/conf/enp3s0.30/disable_ipv6=1
|
||||
```
|
||||
|
||||
This configuration seems more complex than it actually is.
|
||||
Most of it is to make sure the interface is not assigned an IPv4/6 address on the hypervisor host.
|
||||
The magic `.30` at the end of the interface name makes this interface tagged with VLAN ID 30 (DMZ for me).
|
||||
|
||||
Now that we have an interface tagged for the DMZ VLAN, we can create a bridge where future virtual machines can connect to:
|
||||
|
||||
```text
|
||||
auto dmzbr
|
||||
iface dmzbr inet manual
|
||||
bridge_ports enp3s0.30
|
||||
bridge_stp off
|
||||
iface dmzbr inet6 auto
|
||||
accept_ra 0
|
||||
dhcp 0
|
||||
request_prefix 0
|
||||
privext 0
|
||||
pre-up sysctl -w net/ipv6/conf/dmzbr/disable_ipv6=1
|
||||
```
|
||||
|
||||
Just like the previous config, this is quite bloated because I don't want the interface to be assigned an IP address on the host.
|
||||
Most importantly, the `bridge_ports enp3s0.30` line here makes this interface a virtual bridge for the `enp3s0.30` interface.
|
||||
|
||||
And voilà, we now have a virtual bridge on each machine, where only DMZ traffic will flow.
|
||||
Here I verify whether this configuration works:
|
||||
<details>
|
||||
<summary>Show</summary>
|
||||
|
||||
|
||||
We can see that the two virtual interfaces are created, and are only assigned a MAC address and not a IP address:
|
||||
```text
|
||||
root@atlas:~# ip a show enp3s0.30
|
||||
4: enp3s0.30@enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master dmzbr state UP group default qlen 1000
|
||||
link/ether d8:5e:d3:4c:70:38 brd ff:ff:ff:ff:ff:ff
|
||||
5: dmzbr: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
|
||||
link/ether 4e:f7:1f:0f:ad:17 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
Pinging a VM from a hypervisor works:
|
||||
```text
|
||||
root@atlas:~# ping -c1 maestro.dmz
|
||||
PING maestro.dmz (192.168.30.8) 56(84) bytes of data.
|
||||
64 bytes from 192.168.30.8 (192.168.30.8): icmp_seq=1 ttl=63 time=0.457 ms
|
||||
```
|
||||
|
||||
Pinging a hypervisor from a VM does not work:
|
||||
```text
|
||||
root@maestro:~# ping -c1 atlas.hyp
|
||||
PING atlas.hyp (192.168.40.2) 56(84) bytes of data.
|
||||
|
||||
--- atlas.hyp ping statistics ---
|
||||
1 packets transmitted, 0 received, 100% packet loss, time 0ms
|
||||
```
|
||||
</details>
|
||||
|
||||
## DNS and DHCP
|
||||
|
||||
Now that we have a working DMZ network, let's build on it to get DNS and DHCP working.
|
||||
This will enable new virtual machines to obtain a static or dynamic IP address and register their host in DNS.
|
||||
This has actually been incredibly annoying due to our friend [Network address translation (NAT)](https://en.wikipedia.org/wiki/Network_address_translation?useskin=vector).
|
||||
<details>
|
||||
<summary>NAT recap</summary>
|
||||
|
||||
Network address translation (NAT) is a function of a router which allows multiple hosts to share a single IP address.
|
||||
This is needed for IPv4, because IPv4 addresses are scarce and usually one household is only assigned a single IPv4 address.
|
||||
This is one of the problems IPv6 attempts to solve (mainly by having so many IP addresses that they should never run out).
|
||||
To solve the problem for IPv4, each host in a network is assigned a private IPv4 address, which can be reused for every network.
|
||||
|
||||
Then, the router must perform address translation.
|
||||
It does this by keeping track of ports opened by hosts in its private network.
|
||||
If a packet from the internet arrives at the router for such a port, it forwards this packet to the correct host.
|
||||
</details>
|
||||
|
||||
I would like to host my own DNS on a virtual machine (called **hermes**, more on VMs later) in the DMZ network.
|
||||
This basically gives two problems:
|
||||
|
||||
1. The upstream DNS server will refer to the public internet-accessible IP address of our DNS server.
|
||||
This IP-address has no meaning inside the private network due to NAT and the router will reject the packet.
|
||||
2. Our DNS resolves hosts to their public internet-accessible IP address.
|
||||
This is similar to the previous problem as the public IP address has no meaning.
|
||||
|
||||
The first problem can be remediated by overriding the location of the DNS server for hosts inside the DMZ network.
|
||||
This can be achieved on my router, which uses Unbound as its recursive DNS server:
|
||||
|
||||

|
||||
|
||||
Any DNS requests to Unbound to domains in either `dmz` or `kun.is` will now be forwarded `192.168.30.7` (port 5353).
|
||||
This is the virtual machine hosting my DNS.
|
||||
|
||||
The second problem can be solved at the DNS server.
|
||||
We need to do some magic overriding, which [dnsmasq](https://dnsmasq.org/docs/dnsmasq-man.html) is perfect for [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/hermes/src/commit/488024a7725f2325b8992e7a386b4630023f1b52/ansible/roles/dnsmasq/files/dnsmasq.conf):
|
||||
|
||||
```conf
|
||||
alias=84.245.14.149,192.168.30.8
|
||||
server=/kun.is/192.168.30.7
|
||||
```
|
||||
|
||||
This always overrides the public IPv4 address to the private one.
|
||||
It also overrides the DNS server for `kun.is` to `192.168.30.7`.
|
||||
|
||||
Finally, behind the dnsmasq server, I run [Powerdns](https://www.powerdns.com/) as authoritative DNS server [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/hermes/src/branch/master/ansible/roles/powerdns).
|
||||
I like this DNS server because I can manage it with Terraform [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/hermes/src/commit/488024a7725f2325b8992e7a386b4630023f1b52/terraform/dns/kun_is.tf).
|
||||
|
||||
Here is a small diagram showing my setup (my networking teacher would probably kill me for this):
|
||||

|
||||
|
||||
# Virtualization
|
||||
https://github.com/containrrr/shepherd
|
||||
Now that we have laid out the basic networking, let's talk virtualization.
|
||||
Each of my servers are configured to run KVM virtual machines, orchestrated using Libvirt.
|
||||
Configuration of the physical hypervisor servers, including KVM/Libvirt is done using Ansible.
|
||||
The VMs are spun up using Terraform and the [dmacvicar/libvirt](https://registry.terraform.io/providers/dmacvicar/libvirt/latest/docs) Terraform provider.
|
||||
|
||||
This all isn't too exciting, except that I created a Terraform module that abstracts the Terraform Libvirt provider for my specific scenario [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/tf-modules/src/commit/e77d62f4a2a0c3847ffef4434c50a0f40f1fa794/debian/main.tf):
|
||||
```terraform
|
||||
module "maestro" {
|
||||
source = "git::https://git.kun.is/home/tf-modules.git//debian"
|
||||
name = "maestro"
|
||||
domain_name = "tf-maestro"
|
||||
memory = 10240
|
||||
mac = "CA:FE:C0:FF:EE:08"
|
||||
}
|
||||
```
|
||||
|
||||
This automatically creates a Debian virtual machines with the properties specified.
|
||||
It also sets up certificate-based SSH authentication which I talked about [before]({% post_url homebrew-ssh-ca/2023-05-23-homebrew-ssh-ca %}).
|
||||
|
||||
# Clustering
|
||||
|
||||
With virtualization explained, let's move up one level further.
|
||||
Each of my three physical servers hosts a virtual machine running Docker, which together form a Docker Swarm.
|
||||
I use Traefik as a reverse proxy which routes requests to the correct container.
|
||||
|
||||
All data is hosted on a single machine and made available to containers using NFS.
|
||||
This might not be very secure (as NFS is not encrypted and no proper authentication), it is quite fast.
|
||||
|
||||
As of today, I host the following services on my Docker Swarm [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/shoarma):
|
||||
- [Forgejo](https://forgejo.org/) as Git server
|
||||
- [FreshRSS](https://www.freshrss.org/) as RSS aggregator
|
||||
- [Hedgedoc](https://hedgedoc.org/) as Markdown note-taking
|
||||
- [Inbucket](https://hedgedoc.org/) for disposable email
|
||||
- [Cyberchef](https://cyberchef.org/) for the lulz
|
||||
- [Kitchenowl](https://kitchenowl.org/) for grocery lists
|
||||
- [Mastodon](https://joinmastodon.org/) for microblogging
|
||||
- A monitoring stack (read more below)
|
||||
- [Nextcloud](https://nextcloud.com/) for cloud storage
|
||||
- [Pihole](https://pi-hole.net/) to block advertisements
|
||||
- [Radicale](https://radicale.org/v3.html) for calendar and contacts sync
|
||||
- [Seafile](https://www.seafile.com/en/home/) for cloud storage and sync
|
||||
- [Shephard](https://github.com/containrrr/shepherd) for automatic container updates
|
||||
- [Nginx](https://nginx.org/en/) hosting static content (like this page!)
|
||||
- [Docker Swarm dashboard](https://hub.docker.com/r/charypar/swarm-dashboard/#!)
|
||||
- [Syncthing](https://syncthing.net/) for file sync
|
||||
|
||||
# CI / CD
|
||||
|
||||
For CI / CD, I run [Concourse CI](https://concourse-ci.org/) in a separate VM.
|
||||
This is needed, because Concourse heavily uses containers to create reproducible builds.
|
||||
|
||||
Although I should probably use it for more, I currently use my Concourse for three pipelines:
|
||||
|
||||
- A pipeline to build this static website and create a container image of it.
|
||||
The image is then uploaded to the image registry of my Forgejo instance.
|
||||
I love it when I can use stuff I previously built :)
|
||||
The pipeline finally deploys this new image to the Docker Swarm [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/pim/static/src/commit/eee4f0c70af6f2a49fabb730df761baa6475db22/pipeline.yml).
|
||||
- A pipeline to create a Concourse resource that sends Apprise alerts (Concourse-ception?) [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/pim/concourse-apprise-notifier/src/commit/b5d4413c1cd432bc856c45ec497a358aca1b8b21/pipeline.yml)
|
||||
- A pipeline to build a custom Fluentd image with plugins installed [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/pim/fluentd)
|
||||
|
||||
# Backups
|
||||
|
||||
To create backups, I use [Borg](https://www.borgbackup.org/).
|
||||
As I keep all data on one machine, this backup process is quite simple.
|
||||
In fact, all this data is stored in a single Libvirt volume.
|
||||
To configure Borg with a simple declarative script, I use [Borgmatic](https://torsion.org/borgmatic/).
|
||||
|
||||
In order to back up the data inside the Libvirt volume, I create a snapshot to a file.
|
||||
Then I can mount this snapshot in my file system.
|
||||
The files can then be backed up while the system is still running.
|
||||
It is also possible to simply back up the Libvirt image, but this takes more time and storage [<i class="fa-solid fa-code-branch"></i>](https://git.kun.is/home/hypervisors/src/commit/71b96d462116e4160b6467533fc476f3deb9c306/ansible/roles/borg/backup.yml.j2).
|
||||
|
||||
# Monitoring and Alerting
|
||||
|
||||
The last topic I would like to talk about is monitoring and alerting.
|
||||
This is something I'm still actively improving and only just set up properly.
|
||||
|
||||
## Alerting
|
||||
|
||||
For alerting, I wanted something that runs entirely on my own infrastructure.
|
||||
I settled for Apprise + Ntfy.
|
||||
|
||||
[Apprise](https://github.com/caronc/apprise) is a server that is able to send notifications to dozens of services.
|
||||
For application developers, it is thus only necessary to implement the Apprise API to gain access to all these services.
|
||||
The Apprise API itself is also very simple.
|
||||
By using Apprise, I can also easily switch to another notification service later.
|
||||
[Ntfy](https://ntfy.sh/) is free software made for mobile push notifications.
|
||||
|
||||
I use this alerting system in quite a lot of places in my infrastructure, for example when creating backups.
|
||||
|
||||
## Uptime Monitoring
|
||||
|
||||
The first monitoring setup I created, was using [Uptime Kuma](https://github.com/louislam/uptime-kuma).
|
||||
Uptime Kuma periodically pings a service to see whether it is still running.
|
||||
You can do a literal ping, test HTTP response codes, check database connectivity and much more.
|
||||
I use it to check whether my services and VMs are online.
|
||||
And the best part is, Uptime Kuma supports Apprise so I get push notifications on my phone whenever something goes down!
|
||||
|
||||
## Metrics and Log Monitoring
|
||||
|
||||
A new monitoring system I am still in the process of deploying is focused on metrics and logs.
|
||||
I plan on creating a separate blog post about this, so keep an eye out on that (for example using RSS :)).
|
||||
Safe to say, it is no basic ELK stack!
|
||||
|
||||
# Conclusion
|
||||
|
||||
That's it for now!
|
||||
Hopefully I inspired someone to build something... or how not to :)
|
BIN
_posts/infrastructure-snapshot/nat.png
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
_posts/infrastructure-snapshot/servers.jpeg
Normal file
After Width: | Height: | Size: 490 KiB |
BIN
_posts/infrastructure-snapshot/unbound_overrides.png
Normal file
After Width: | Height: | Size: 31 KiB |
BIN
_posts/infrastructure-snapshot/vlans.png
Normal file
After Width: | Height: | Size: 42 KiB |
|
@ -0,0 +1,61 @@
|
|||
---
|
||||
layout: post
|
||||
title: My Experiences with virtio-9p
|
||||
date: 2023-05-31 14:18:00 Europe/Amsterdam
|
||||
categories: libvirt virtio 9p
|
||||
---
|
||||
|
||||
When I was scaling up my home lab, I started thinking more about data management.
|
||||
I hadn't (and still haven't) set up any form of network storage.
|
||||
I have, however, set up a backup mechanism using [Borg](https://borgbackup.readthedocs.io/en/stable/).
|
||||
Still, I want to operate lots of virtual machines, and backing up each one of them separately seemed excessive.
|
||||
So I started thinking, what if I just let the host machines back up the data?
|
||||
After all, the amount of physical hosts I have in my home lab is unlikely to increase drastically.
|
||||
|
||||
# The Use Case for Sharing Directories
|
||||
|
||||
I started working out this idea further.
|
||||
Without network storage, I needed a way for guest VMs to access the host's disks.
|
||||
Here there are two possibilities, either expose some block device or a file system.
|
||||
Creating a whole virtual disk for just the data of some VMs seemed wasteful, and from my experiences also increases backup times dramatically.
|
||||
I therefore searched for a way to mount a directory from the host OS on the guest VM.
|
||||
This is when I stumbled upon [this blog](https://rabexc.org/posts/p9-setup-in-libvirt) post talking about sharing directories with virtual machines.
|
||||
|
||||
# Sharing Directories with virtio-9p
|
||||
|
||||
virtio-9p is a way to map a directory on the host OS to a special device on the virtual machine.
|
||||
In `virt-manager`, it looks like the following:
|
||||

|
||||
Under the hood, virtio-9p uses the 9pnet protocol.
|
||||
Originally developed at Bell Labs, support for this is available in all modern Linux kernels.
|
||||
If you share a directory with a VM, you can then mount it.
|
||||
Below is an extract of my `/etc/fstab` to automatically mount the directory:
|
||||
```text
|
||||
data /mnt/data 9p trans=virtio,rw 0 0
|
||||
```
|
||||
|
||||
The first argument (`data`) refers to the name you gave this share from the host
|
||||
With the `trans` option we specify that this is a virtio share.
|
||||
|
||||
# Problems with virtio-9p
|
||||
|
||||
At first I had no problems with my setup, but I am now contemplating just moving to a network storage based setup because of two problems.
|
||||
|
||||
The first problem is that some files have suddenly changed ownership from `libvirt-qemu` to `root`.
|
||||
If the file is owned by `root`, the guest OS can still see it, but cannot access it.
|
||||
I am not entirely sure the problem lies with virtio, but I suspect it is.
|
||||
For anyone experiencing this problem, I wrote a small shell script to revert ownership to the `libvirt-qemu` user:
|
||||
```shell
|
||||
find -printf "%h/%f %u\n" | grep root | cut -d ' ' -f1 | xargs chown libvirt-qemu:libvirt-qemu
|
||||
```
|
||||
|
||||
Another problem that I have experienced, is guests being unable to mount the directory at all.
|
||||
I have only experienced this problem once, but it was highly annoying.
|
||||
To fix it, I had to reboot the whole physical machine.
|
||||
|
||||
# Alternatives
|
||||
|
||||
virtio-9p seemed like a good idea, but as discussed, I had some problems with it.
|
||||
It seems [virtioFS](https://virtio-fs.gitlab.io/) might be a an interesting alternative as it is designed specifically for sharing directories with VMs.
|
||||
|
||||
As for me, I will probably finally look into deploying network storage either with NFS or SSHFS.
|
BIN
_posts/virtio-9p-experiences/virt-manager.png
Normal file
After Width: | Height: | Size: 16 KiB |