diff --git a/src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md b/src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md new file mode 100644 index 0000000..02977d5 --- /dev/null +++ b/src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md @@ -0,0 +1,245 @@ +--- +layout: post +title: Home Lab Infrastructure Snapshot 2024 +date: 2025-01-12 20:49:00 Europe/Amsterdam +categories: infrastructure homelab +--- + +If you are reading this when this post was just published: fear not, you did not +just travel back in time to the glorious old days of '24. No, I was simply too +~~lazy~~ busy to write this post last year. Busy with what? Well, I basically +changed every part of my infrastructure the past year obviously. This is all due +to the fact I got infected by the NixOS-virus! + +This post is a follow-up of a similar post from 2023 where I described my whole +home lab as it was back then. In this post I will mainly focus on the things +that changed compared to then. + +# Hardware and Operating Systems + +## Hardware + +It seems I was rather frugal this year, as I did not buy any new servers or +replaced existing ones. + +However, as we will see in a bit, I am now running Linux on my gaming PC and +deploying it just like I would do with any one of my servers. Therefore, I will +take a moment to gloss over the hardware of that machine: + +- **Motherboard**: Gigabyte B550 GAMING X V2 +- **CPU**: AMD Ryzen 5 5600 +- **GPU**: AMD Radeon RX 6700 XT + +Here is a picture of its guts: + +![The inside of my gaming PC showing its components.](gamepc.jpeg) + +You probably noticed the random components inside it, which are fixed just +securely enough to not cause a fire. These are actually parts of a +[DIY PiKVM V2](https://docs.pikvm.org/v2/). PiKVM is a cool project to transform +your Raspberry Pi into a KVM (Keyboard, Video and Mouse). A KVM, in its most +basic form, allows you to see the video output and control the mouse and +keyboard of a system remotely. + +In the bottom right part of the picture, connected to the ribbon cable, you can +see the part that allows the Raspberry Pi to power on/off the PC, reset the PC, +read power status and read disk activity, all via its GPIO! In the top right +part of the picture you can see the component that converts the PC's HDMI output +to CSI, which the Pi supports. Just behind that component, but out of view, the +Pi is secured. + +PiKVM has some more neat features: it can emulate a USB storage device which I +can boot off of. This allows me, for example, to run a memory test or tweak BIOS +settings. + +One other piece of hardware is worth mentioning. Similar to the PiKVM, I also +bought a +[Sipeed NanoKVM](https://wiki.sipeed.com/hardware/en/kvm/NanoKVM/introduction.html). +I opted for the Lite variant, which is only able to read HDMI and interface with +the hosts USB. It can't control the power like the PiKVM that I built. For now I +attached it to one of my Kubernetes nodes. + +![Unboxing of the NanoKVM lite.](nanokvm.png) _Unboxing of the NanoKVM Lite, +[courtesy of Sipeed](https://github.com/sipeed/sipeed_wiki/blob/main/docs/hardware/assets/NanoKVM/unbox/lite_ubox.png)._ + +## Operating Systems (or, the NixOS propaganda) + +For people that know me in real life, this section will not be a surprise. In +late 2023 I started getting interested in NixOS and in 2024 I worked hard to +install it on any piece of hardware I could get my hands on. + +NixOS enables me to apply my Infrastrucutre-as-Code spirit to the OS and take it +to the extreme. I used to customize an existing OS (mainly Debian) with Ansible +to configure it to my liking. This works to some extent: the steps needed to get +to my desired configuration are codified! However, after using Ansible for some +time I started seeing drawbacks. + +Ansible likes to advertise itself as declarative (because it's YAML, which is +declarative by nature right?). Sure, all steps _should_ be declarative and +idempotent (they can be run repeatedly without changes). But when you do +anything non-trivial, you start to introduce dependencies between different +tasks. Suddenly, your tasks cannot be run just on their own anymore! Together, a +certain Ansible role should indeed still be idempotent, but I often have to +think about the current state of my machines still. + +In contrast, NixOS is configured entirely declaratively. You don't have to tell +NixOS how to go from configuration A to configuration B, this is something NixOS +will figure out for you. Not convinced? This is how you get a fully functional +Kubernetes node on NixOS: + +```nix +{ + config.services.k3s.enable = true; +} +``` + +### Colmena + +Deploying changes on a single NixOS host works fine with the `nixos-rebuild` +tool. However, if you manage multiple NixOS servers, a bit more comfort is +quickly desired. That's why I am happily using the +[Colmena](https://github.com/zhaofengli/colmena) deployment tool. Below is an +example deployment for my servers. + +```shell +❯ colmena apply --experimental-flake-eval --on @server +[INFO ] Using flake: git+file:///home/pim/git/nixos-configs +[WARN ] Using direct flake evaluation (experimental) +[INFO ] Enumerating nodes... +[INFO ] Selected 4 out of 6 hosts. + ✅ 2m All done! + (...) ✅ 37s Evaluated warwick, atlas, lewis, and jefke + atlas ✅ 0s Built "/nix/store/r0jpg2nrqdnk9gywxvzrh2n3lrwhzy56-nixos-system-atlas-24.11pre-git" +warwick ✅ 4s Built "/nix/store/kfbm5c2fqd2xv6lasvb2nhc8g815hl79-nixos-system-warwick-24.11pre-git" on target node + lewis ✅ 0s Built "/nix/store/h238ly237srjil0fdxzrj29ib6blcmlg-nixos-system-lewis-24.11pre-git" + jefke ✅ 0s Built "/nix/store/b7pnan3wmgk3y0193rka95i82sl33xpc-nixos-system-jefke-24.11pre-git" + atlas ✅ 10s Pushed system closure + jefke ✅ 5s Pushed system closure + lewis ✅ 9s Pushed system closure +warwick ✅ 53s Uploaded keys (pre-activation) + jefke ✅ 49s Uploaded keys (pre-activation) + lewis ✅ 47s Uploaded keys (pre-activation) + atlas ✅ 46s Uploaded keys (pre-activation) + jefke ✅ 9s Activation successful + atlas ✅ 18s Activation successful +warwick ✅ 4s Activation successful + lewis ✅ 16s Activation successful +``` + +Notice the use of `@server`, which only deploys hosts thare are tagged with +`server`. I also deploy my laptop and gaming PC with this tool, so this is very +handy. Also notice that the host `warwick` is built on the target node. This is +because `warwick` is a Raspberry Pi 4 with an ARM architecture. The system is +therefore compiled on the remote. + +## Virtualization + +I had two main purposes for virtual machines: security and host network +isolation. I used to run Docker Swarm on my servers, which messes around with +your `iptables` rules. Virtual machines were therefore a way separate that from +the main host. + +However, I found that VMs are pretty difficult to manage. It's difficult to +define them in an infrastructure-as-code way. To this end, I used Terraform with +the pretty buggy +[terraform-provider-libvirt](https://github.com/dmacvicar/terraform-provider-libvirt) +provider, but this was not a smooth experience. I also found them running out of +memory quite often. And while I could have invested time into something like +memory ballooning, I didn't really want to spend my time with that. + +So now my servers are completely VM-free! + +## Container Orchestration + +Alongside the operating system, my container clustering setup changed the most +this year. + +Before this year, I was using +[Docker Swarm](https://docs.docker.com/engine/swarm/) for container clustering. +The main benefit for me, was its similiarities with Docker Compose, which I was +using at the time. Unfortunately, Docker Swarm is not widely used and doesn't +seem well maintained either. Also, a feature I was really missing was the option +for "distributed" storage that syncs data between nodes to mitigate hardware +failure. + +With Docker Swarm out of the option, I needed to choose another solution. +Initially, I wanted to use [Hashicorp Nomad](https://www.nomadproject.io/). +Unfortunately, Nomad is no longer open source software, so this is out of the +question for me. Then, apart from some smaller projects, I really only had one +option left: **Kubernetes**! + +Below I outline some of the components I use in my Kubernetes setup that make it +tick. + +### k3s + +I opted for the [k3s](https://k3s.io/) Kubernetes "distribution", because I +wanted to start simple and as you saw in +[a previous section](#operating-systems-or-the-nixos-propaganda), it's super +simple to enable on NixOS. k3s has all the Kubernetes components out-of-the-box +to run a single-node cluster. + +In the future, I might swap to using NixOS' `services.kubernetes` +[options](https://search.nixos.org/options?channel=24.11&from=0&size=50&sort=relevance&type=packages&query=services.kubernetes) +as I hear these work great as well and gives you more control over what runs. + +### MetalLB + +If you use the Kubernetes platform of a cloud provider, like +[Microsoft's AKS](https://learn.microsoft.com/en-us/azure/aks/), +[Google's GKE](https://cloud.google.com/kubernetes-engine) or +[Amazon's EKS](https://aws.amazon.com/eks/) (all really inspiring names, +probably great to convince your upper management), they provide network load +balancers by default. These are great, because they simplify exposing services +outside of the cluster. Unfortunately, bare-metal Kubernetes lacks a load +balancer and this is where [MetalLB](https://metallb.io/) comes in. + +MetalLB works by assigning locally-routable IP addresses to Kubernetes services. +It has two methods to do this: either via ARP or via BGP. I opted for ARP to not +overcomplicate my network setup. To use ARP with MetalLB, you simply have to +reserve some IP address space for MetalLB to play with. It will dynamically +assign these IPs to Kubernetes services and advertise these via ARP. + +### Longhorn + +Similar to MetalLB, [Longhorn](https://longhorn.io/) fills a gap that Kubernetes +has on bare-metal deployments. Without Longhorn, any Kubernetes storage volume +is tied to a particular host. This is very problematic when you want to move +containers around, or if a physical server dies. + +Longhorn fixes this by replicating block storage across multiple Kubernetes +nodes. Want to take down a node? No problem, the workloads can move to another +node with the same data. Longhorn can also create periodic backups, which I +personally backup off-site to [BorgBase](https://www.borgbase.com/). + +### Cert-manager + +Another amazing part of my Kubernetes setup is +[Cert-manager](https://cert-manager.io/). Cert-manager automatically manages TLS +certificates that are needed for your Kubernetes deployments. + +Usage of Cert-manager is super simple. First you have to set up a certificate +issuer, like Let's Encrypt. Then you can request Cert-manager to automatically +provision certificates. I use Kubernetes ingresses, and can simply give it the +`cert-manager.io/cluster-issuer` annotation. Then, Cert-manager will use the +`host` field to request the certificate. + +I abstracted this a bit, and this is for example the ingress definition for the +blog you are reading right now: + +```nix +{ + lab.ingresses.blog = { + host = "pim.kun.is"; + + service = { + name = "blog"; + portName = "web"; + }; + }; +} +``` + +### Tailscale + +## Workloads diff --git a/src/_posts/infra-snapshot-2024/gamepc.jpeg b/src/_posts/infra-snapshot-2024/gamepc.jpeg new file mode 100644 index 0000000..d5e7824 Binary files /dev/null and b/src/_posts/infra-snapshot-2024/gamepc.jpeg differ diff --git a/src/_posts/infra-snapshot-2024/nanokvm.png b/src/_posts/infra-snapshot-2024/nanokvm.png new file mode 100644 index 0000000..82499e2 Binary files /dev/null and b/src/_posts/infra-snapshot-2024/nanokvm.png differ