WIP infra 2024 snapshot

2025-02-20 10:53:07 +01:00
3 changed files with 245 additions and 0 deletions
--- a/src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md
+++ b/src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md
@ -0,0 +1,245 @@
+---
+layout: post
+title: Home Lab Infrastructure Snapshot 2024
+date: 2025-01-12 20:49:00 Europe/Amsterdam
+categories: infrastructure homelab
+---
+
+If you are reading this when this post was just published: fear not, you did not
+just travel back in time to the glorious old days of '24. No, I was simply too
+~~lazy~~ busy to write this post last year. Busy with what? Well, I basically
+changed every part of my infrastructure the past year obviously. This is all due
+to the fact I got infected by the NixOS-virus!
+
+This post is a follow-up of a similar post from 2023 where I described my whole
+home lab as it was back then. In this post I will mainly focus on the things
+that changed compared to then.
+
+# Hardware and Operating Systems
+
+## Hardware
+
+It seems I was rather frugal this year, as I did not buy any new servers or
+replaced existing ones.
+
+However, as we will see in a bit, I am now running Linux on my gaming PC and
+deploying it just like I would do with any one of my servers. Therefore, I will
+take a moment to gloss over the hardware of that machine:
+
+- **Motherboard**: Gigabyte B550 GAMING X V2
+- **CPU**: AMD Ryzen 5 5600
+- **GPU**: AMD Radeon RX 6700 XT
+
+Here is a picture of its guts:
+
+![The inside of my gaming PC showing its components.](gamepc.jpeg)
+
+You probably noticed the random components inside it, which are fixed just
+securely enough to not cause a fire. These are actually parts of a
+[DIY PiKVM V2](https://docs.pikvm.org/v2/). PiKVM is a cool project to transform
+your Raspberry Pi into a KVM (Keyboard, Video and Mouse). A KVM, in its most
+basic form, allows you to see the video output and control the mouse and
+keyboard of a system remotely.
+
+In the bottom right part of the picture, connected to the ribbon cable, you can
+see the part that allows the Raspberry Pi to power on/off the PC, reset the PC,
+read power status and read disk activity, all via its GPIO! In the top right
+part of the picture you can see the component that converts the PC's HDMI output
+to CSI, which the Pi supports. Just behind that component, but out of view, the
+Pi is secured.
+
+PiKVM has some more neat features: it can emulate a USB storage device which I
+can boot off of. This allows me, for example, to run a memory test or tweak BIOS
+settings.
+
+One other piece of hardware is worth mentioning. Similar to the PiKVM, I also
+bought a
+[Sipeed NanoKVM](https://wiki.sipeed.com/hardware/en/kvm/NanoKVM/introduction.html).
+I opted for the Lite variant, which is only able to read HDMI and interface with
+the hosts USB. It can't control the power like the PiKVM that I built. For now I
+attached it to one of my Kubernetes nodes.
+
+![Unboxing of the NanoKVM lite.](nanokvm.png) _Unboxing of the NanoKVM Lite,
+[courtesy of Sipeed](https://github.com/sipeed/sipeed_wiki/blob/main/docs/hardware/assets/NanoKVM/unbox/lite_ubox.png)._
+
+## Operating Systems (or, the NixOS propaganda)
+
+For people that know me in real life, this section will not be a surprise. In
+late 2023 I started getting interested in NixOS and in 2024 I worked hard to
+install it on any piece of hardware I could get my hands on.
+
+NixOS enables me to apply my Infrastrucutre-as-Code spirit to the OS and take it
+to the extreme. I used to customize an existing OS (mainly Debian) with Ansible
+to configure it to my liking. This works to some extent: the steps needed to get
+to my desired configuration are codified! However, after using Ansible for some
+time I started seeing drawbacks.
+
+Ansible likes to advertise itself as declarative (because it's YAML, which is
+declarative by nature right?). Sure, all steps _should_ be declarative and
+idempotent (they can be run repeatedly without changes). But when you do
+anything non-trivial, you start to introduce dependencies between different
+tasks. Suddenly, your tasks cannot be run just on their own anymore! Together, a
+certain Ansible role should indeed still be idempotent, but I often have to
+think about the current state of my machines still.
+
+In contrast, NixOS is configured entirely declaratively. You don't have to tell
+NixOS how to go from configuration A to configuration B, this is something NixOS
+will figure out for you. Not convinced? This is how you get a fully functional
+Kubernetes node on NixOS:
+
+```nix
+{
+    config.services.k3s.enable = true;
+}
+```
+
+### Colmena
+
+Deploying changes on a single NixOS host works fine with the `nixos-rebuild`
+tool. However, if you manage multiple NixOS servers, a bit more comfort is
+quickly desired. That's why I am happily using the
+[Colmena](https://github.com/zhaofengli/colmena) deployment tool. Below is an
+example deployment for my servers.
+
+```shell
+❯ colmena apply  --experimental-flake-eval --on @server
+[INFO ] Using flake: git+file:///home/pim/git/nixos-configs
+[WARN ] Using direct flake evaluation (experimental)
+[INFO ] Enumerating nodes...
+[INFO ] Selected 4 out of 6 hosts.
+        ✅ 2m All done!
+  (...) ✅ 37s Evaluated warwick, atlas, lewis, and jefke
+  atlas ✅ 0s Built "/nix/store/r0jpg2nrqdnk9gywxvzrh2n3lrwhzy56-nixos-system-atlas-24.11pre-git"
+warwick ✅ 4s Built "/nix/store/kfbm5c2fqd2xv6lasvb2nhc8g815hl79-nixos-system-warwick-24.11pre-git" on target node
+  lewis ✅ 0s Built "/nix/store/h238ly237srjil0fdxzrj29ib6blcmlg-nixos-system-lewis-24.11pre-git"
+  jefke ✅ 0s Built "/nix/store/b7pnan3wmgk3y0193rka95i82sl33xpc-nixos-system-jefke-24.11pre-git"
+  atlas ✅ 10s Pushed system closure
+  jefke ✅ 5s Pushed system closure
+  lewis ✅ 9s Pushed system closure
+warwick ✅ 53s Uploaded keys (pre-activation)
+  jefke ✅ 49s Uploaded keys (pre-activation)
+  lewis ✅ 47s Uploaded keys (pre-activation)
+  atlas ✅ 46s Uploaded keys (pre-activation)
+  jefke ✅ 9s Activation successful
+  atlas ✅ 18s Activation successful
+warwick ✅ 4s Activation successful
+  lewis ✅ 16s Activation successful
+```
+
+Notice the use of `@server`, which only deploys hosts thare are tagged with
+`server`. I also deploy my laptop and gaming PC with this tool, so this is very
+handy. Also notice that the host `warwick` is built on the target node. This is
+because `warwick` is a Raspberry Pi 4 with an ARM architecture. The system is
+therefore compiled on the remote.
+
+## Virtualization
+
+I had two main purposes for virtual machines: security and host network
+isolation. I used to run Docker Swarm on my servers, which messes around with
+your `iptables` rules. Virtual machines were therefore a way separate that from
+the main host.
+
+However, I found that VMs are pretty difficult to manage. It's difficult to
+define them in an infrastructure-as-code way. To this end, I used Terraform with
+the pretty buggy
+[terraform-provider-libvirt](https://github.com/dmacvicar/terraform-provider-libvirt)
+provider, but this was not a smooth experience. I also found them running out of
+memory quite often. And while I could have invested time into something like
+memory ballooning, I didn't really want to spend my time with that.
+
+So now my servers are completely VM-free!
+
+## Container Orchestration
+
+Alongside the operating system, my container clustering setup changed the most
+this year.
+
+Before this year, I was using
+[Docker Swarm](https://docs.docker.com/engine/swarm/) for container clustering.
+The main benefit for me, was its similiarities with Docker Compose, which I was
+using at the time. Unfortunately, Docker Swarm is not widely used and doesn't
+seem well maintained either. Also, a feature I was really missing was the option
+for "distributed" storage that syncs data between nodes to mitigate hardware
+failure.
+
+With Docker Swarm out of the option, I needed to choose another solution.
+Initially, I wanted to use [Hashicorp Nomad](https://www.nomadproject.io/).
+Unfortunately, Nomad is no longer open source software, so this is out of the
+question for me. Then, apart from some smaller projects, I really only had one
+option left: **Kubernetes**!
+
+Below I outline some of the components I use in my Kubernetes setup that make it
+tick.
+
+### k3s
+
+I opted for the [k3s](https://k3s.io/) Kubernetes "distribution", because I
+wanted to start simple and as you saw in
+[a previous section](#operating-systems-or-the-nixos-propaganda), it's super
+simple to enable on NixOS. k3s has all the Kubernetes components out-of-the-box
+to run a single-node cluster.
+
+In the future, I might swap to using NixOS' `services.kubernetes`
+[options](https://search.nixos.org/options?channel=24.11&from=0&size=50&sort=relevance&type=packages&query=services.kubernetes)
+as I hear these work great as well and gives you more control over what runs.
+
+### MetalLB
+
+If you use the Kubernetes platform of a cloud provider, like
+[Microsoft's AKS](https://learn.microsoft.com/en-us/azure/aks/),
+[Google's GKE](https://cloud.google.com/kubernetes-engine) or
+[Amazon's EKS](https://aws.amazon.com/eks/) (all really inspiring names,
+probably great to convince your upper management), they provide network load
+balancers by default. These are great, because they simplify exposing services
+outside of the cluster. Unfortunately, bare-metal Kubernetes lacks a load
+balancer and this is where [MetalLB](https://metallb.io/) comes in.
+
+MetalLB works by assigning locally-routable IP addresses to Kubernetes services.
+It has two methods to do this: either via ARP or via BGP. I opted for ARP to not
+overcomplicate my network setup. To use ARP with MetalLB, you simply have to
+reserve some IP address space for MetalLB to play with. It will dynamically
+assign these IPs to Kubernetes services and advertise these via ARP.
+
+### Longhorn
+
+Similar to MetalLB, [Longhorn](https://longhorn.io/) fills a gap that Kubernetes
+has on bare-metal deployments. Without Longhorn, any Kubernetes storage volume
+is tied to a particular host. This is very problematic when you want to move
+containers around, or if a physical server dies.
+
+Longhorn fixes this by replicating block storage across multiple Kubernetes
+nodes. Want to take down a node? No problem, the workloads can move to another
+node with the same data. Longhorn can also create periodic backups, which I
+personally backup off-site to [BorgBase](https://www.borgbase.com/).
+
+### Cert-manager
+
+Another amazing part of my Kubernetes setup is
+[Cert-manager](https://cert-manager.io/). Cert-manager automatically manages TLS
+certificates that are needed for your Kubernetes deployments.
+
+Usage of Cert-manager is super simple. First you have to set up a certificate
+issuer, like Let's Encrypt. Then you can request Cert-manager to automatically
+provision certificates. I use Kubernetes ingresses, and can simply give it the
+`cert-manager.io/cluster-issuer` annotation. Then, Cert-manager will use the
+`host` field to request the certificate.
+
+I abstracted this a bit, and this is for example the ingress definition for the
+blog you are reading right now:
+
+```nix
+{
+  lab.ingresses.blog = {
+    host = "pim.kun.is";
+
+    service = {
+      name = "blog";
+      portName = "web";
+    };
+  };
+}
+```
+
+### Tailscale
+
+## Workloads
--- a/src/_posts/infra-snapshot-2024/gamepc.jpeg
+++ b/src/_posts/infra-snapshot-2024/gamepc.jpeg
--- a/src/_posts/infra-snapshot-2024/nanokvm.png
+++ b/src/_posts/infra-snapshot-2024/nanokvm.png