Compare commits

...
Sign in to create a new pull request.

1 commit

Author SHA1 Message Date
b040c8b87f WIP infra 2024 snapshot 2025-02-20 10:53:07 +01:00
3 changed files with 245 additions and 0 deletions

View file

@ -0,0 +1,245 @@
---
layout: post
title: Home Lab Infrastructure Snapshot 2024
date: 2025-01-12 20:49:00 Europe/Amsterdam
categories: infrastructure homelab
---
If you are reading this when this post was just published: fear not, you did not
just travel back in time to the glorious old days of '24. No, I was simply too
~~lazy~~ busy to write this post last year. Busy with what? Well, I basically
changed every part of my infrastructure the past year obviously. This is all due
to the fact I got infected by the NixOS-virus!
This post is a follow-up of a similar post from 2023 where I described my whole
home lab as it was back then. In this post I will mainly focus on the things
that changed compared to then.
# Hardware and Operating Systems
## Hardware
It seems I was rather frugal this year, as I did not buy any new servers or
replaced existing ones.
However, as we will see in a bit, I am now running Linux on my gaming PC and
deploying it just like I would do with any one of my servers. Therefore, I will
take a moment to gloss over the hardware of that machine:
- **Motherboard**: Gigabyte B550 GAMING X V2
- **CPU**: AMD Ryzen 5 5600
- **GPU**: AMD Radeon RX 6700 XT
Here is a picture of its guts:
![The inside of my gaming PC showing its components.](gamepc.jpeg)
You probably noticed the random components inside it, which are fixed just
securely enough to not cause a fire. These are actually parts of a
[DIY PiKVM V2](https://docs.pikvm.org/v2/). PiKVM is a cool project to transform
your Raspberry Pi into a KVM (Keyboard, Video and Mouse). A KVM, in its most
basic form, allows you to see the video output and control the mouse and
keyboard of a system remotely.
In the bottom right part of the picture, connected to the ribbon cable, you can
see the part that allows the Raspberry Pi to power on/off the PC, reset the PC,
read power status and read disk activity, all via its GPIO! In the top right
part of the picture you can see the component that converts the PC's HDMI output
to CSI, which the Pi supports. Just behind that component, but out of view, the
Pi is secured.
PiKVM has some more neat features: it can emulate a USB storage device which I
can boot off of. This allows me, for example, to run a memory test or tweak BIOS
settings.
One other piece of hardware is worth mentioning. Similar to the PiKVM, I also
bought a
[Sipeed NanoKVM](https://wiki.sipeed.com/hardware/en/kvm/NanoKVM/introduction.html).
I opted for the Lite variant, which is only able to read HDMI and interface with
the hosts USB. It can't control the power like the PiKVM that I built. For now I
attached it to one of my Kubernetes nodes.
![Unboxing of the NanoKVM lite.](nanokvm.png) _Unboxing of the NanoKVM Lite,
[courtesy of Sipeed](https://github.com/sipeed/sipeed_wiki/blob/main/docs/hardware/assets/NanoKVM/unbox/lite_ubox.png)._
## Operating Systems (or, the NixOS propaganda)
For people that know me in real life, this section will not be a surprise. In
late 2023 I started getting interested in NixOS and in 2024 I worked hard to
install it on any piece of hardware I could get my hands on.
NixOS enables me to apply my Infrastrucutre-as-Code spirit to the OS and take it
to the extreme. I used to customize an existing OS (mainly Debian) with Ansible
to configure it to my liking. This works to some extent: the steps needed to get
to my desired configuration are codified! However, after using Ansible for some
time I started seeing drawbacks.
Ansible likes to advertise itself as declarative (because it's YAML, which is
declarative by nature right?). Sure, all steps _should_ be declarative and
idempotent (they can be run repeatedly without changes). But when you do
anything non-trivial, you start to introduce dependencies between different
tasks. Suddenly, your tasks cannot be run just on their own anymore! Together, a
certain Ansible role should indeed still be idempotent, but I often have to
think about the current state of my machines still.
In contrast, NixOS is configured entirely declaratively. You don't have to tell
NixOS how to go from configuration A to configuration B, this is something NixOS
will figure out for you. Not convinced? This is how you get a fully functional
Kubernetes node on NixOS:
```nix
{
config.services.k3s.enable = true;
}
```
### Colmena
Deploying changes on a single NixOS host works fine with the `nixos-rebuild`
tool. However, if you manage multiple NixOS servers, a bit more comfort is
quickly desired. That's why I am happily using the
[Colmena](https://github.com/zhaofengli/colmena) deployment tool. Below is an
example deployment for my servers.
```shell
colmena apply --experimental-flake-eval --on @server
[INFO ] Using flake: git+file:///home/pim/git/nixos-configs
[WARN ] Using direct flake evaluation (experimental)
[INFO ] Enumerating nodes...
[INFO ] Selected 4 out of 6 hosts.
✅ 2m All done!
(...) ✅ 37s Evaluated warwick, atlas, lewis, and jefke
atlas ✅ 0s Built "/nix/store/r0jpg2nrqdnk9gywxvzrh2n3lrwhzy56-nixos-system-atlas-24.11pre-git"
warwick ✅ 4s Built "/nix/store/kfbm5c2fqd2xv6lasvb2nhc8g815hl79-nixos-system-warwick-24.11pre-git" on target node
lewis ✅ 0s Built "/nix/store/h238ly237srjil0fdxzrj29ib6blcmlg-nixos-system-lewis-24.11pre-git"
jefke ✅ 0s Built "/nix/store/b7pnan3wmgk3y0193rka95i82sl33xpc-nixos-system-jefke-24.11pre-git"
atlas ✅ 10s Pushed system closure
jefke ✅ 5s Pushed system closure
lewis ✅ 9s Pushed system closure
warwick ✅ 53s Uploaded keys (pre-activation)
jefke ✅ 49s Uploaded keys (pre-activation)
lewis ✅ 47s Uploaded keys (pre-activation)
atlas ✅ 46s Uploaded keys (pre-activation)
jefke ✅ 9s Activation successful
atlas ✅ 18s Activation successful
warwick ✅ 4s Activation successful
lewis ✅ 16s Activation successful
```
Notice the use of `@server`, which only deploys hosts thare are tagged with
`server`. I also deploy my laptop and gaming PC with this tool, so this is very
handy. Also notice that the host `warwick` is built on the target node. This is
because `warwick` is a Raspberry Pi 4 with an ARM architecture. The system is
therefore compiled on the remote.
## Virtualization
I had two main purposes for virtual machines: security and host network
isolation. I used to run Docker Swarm on my servers, which messes around with
your `iptables` rules. Virtual machines were therefore a way separate that from
the main host.
However, I found that VMs are pretty difficult to manage. It's difficult to
define them in an infrastructure-as-code way. To this end, I used Terraform with
the pretty buggy
[terraform-provider-libvirt](https://github.com/dmacvicar/terraform-provider-libvirt)
provider, but this was not a smooth experience. I also found them running out of
memory quite often. And while I could have invested time into something like
memory ballooning, I didn't really want to spend my time with that.
So now my servers are completely VM-free!
## Container Orchestration
Alongside the operating system, my container clustering setup changed the most
this year.
Before this year, I was using
[Docker Swarm](https://docs.docker.com/engine/swarm/) for container clustering.
The main benefit for me, was its similiarities with Docker Compose, which I was
using at the time. Unfortunately, Docker Swarm is not widely used and doesn't
seem well maintained either. Also, a feature I was really missing was the option
for "distributed" storage that syncs data between nodes to mitigate hardware
failure.
With Docker Swarm out of the option, I needed to choose another solution.
Initially, I wanted to use [Hashicorp Nomad](https://www.nomadproject.io/).
Unfortunately, Nomad is no longer open source software, so this is out of the
question for me. Then, apart from some smaller projects, I really only had one
option left: **Kubernetes**!
Below I outline some of the components I use in my Kubernetes setup that make it
tick.
### k3s
I opted for the [k3s](https://k3s.io/) Kubernetes "distribution", because I
wanted to start simple and as you saw in
[a previous section](#operating-systems-or-the-nixos-propaganda), it's super
simple to enable on NixOS. k3s has all the Kubernetes components out-of-the-box
to run a single-node cluster.
In the future, I might swap to using NixOS' `services.kubernetes`
[options](https://search.nixos.org/options?channel=24.11&from=0&size=50&sort=relevance&type=packages&query=services.kubernetes)
as I hear these work great as well and gives you more control over what runs.
### MetalLB
If you use the Kubernetes platform of a cloud provider, like
[Microsoft's AKS](https://learn.microsoft.com/en-us/azure/aks/),
[Google's GKE](https://cloud.google.com/kubernetes-engine) or
[Amazon's EKS](https://aws.amazon.com/eks/) (all really inspiring names,
probably great to convince your upper management), they provide network load
balancers by default. These are great, because they simplify exposing services
outside of the cluster. Unfortunately, bare-metal Kubernetes lacks a load
balancer and this is where [MetalLB](https://metallb.io/) comes in.
MetalLB works by assigning locally-routable IP addresses to Kubernetes services.
It has two methods to do this: either via ARP or via BGP. I opted for ARP to not
overcomplicate my network setup. To use ARP with MetalLB, you simply have to
reserve some IP address space for MetalLB to play with. It will dynamically
assign these IPs to Kubernetes services and advertise these via ARP.
### Longhorn
Similar to MetalLB, [Longhorn](https://longhorn.io/) fills a gap that Kubernetes
has on bare-metal deployments. Without Longhorn, any Kubernetes storage volume
is tied to a particular host. This is very problematic when you want to move
containers around, or if a physical server dies.
Longhorn fixes this by replicating block storage across multiple Kubernetes
nodes. Want to take down a node? No problem, the workloads can move to another
node with the same data. Longhorn can also create periodic backups, which I
personally backup off-site to [BorgBase](https://www.borgbase.com/).
### Cert-manager
Another amazing part of my Kubernetes setup is
[Cert-manager](https://cert-manager.io/). Cert-manager automatically manages TLS
certificates that are needed for your Kubernetes deployments.
Usage of Cert-manager is super simple. First you have to set up a certificate
issuer, like Let's Encrypt. Then you can request Cert-manager to automatically
provision certificates. I use Kubernetes ingresses, and can simply give it the
`cert-manager.io/cluster-issuer` annotation. Then, Cert-manager will use the
`host` field to request the certificate.
I abstracted this a bit, and this is for example the ingress definition for the
blog you are reading right now:
```nix
{
lab.ingresses.blog = {
host = "pim.kun.is";
service = {
name = "blog";
portName = "web";
};
};
}
```
### Tailscale
## Workloads

Binary file not shown.

After

Width:  |  Height:  |  Size: 709 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB