WIP infra 2024 snapshot
This commit is contained in:
parent
95b66d8c45
commit
b040c8b87f
3 changed files with 245 additions and 0 deletions
245
src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md
Normal file
245
src/_posts/infra-snapshot-2024/2025-01-12-infra-snapshot-2024.md
Normal file
|
@ -0,0 +1,245 @@
|
|||
---
|
||||
layout: post
|
||||
title: Home Lab Infrastructure Snapshot 2024
|
||||
date: 2025-01-12 20:49:00 Europe/Amsterdam
|
||||
categories: infrastructure homelab
|
||||
---
|
||||
|
||||
If you are reading this when this post was just published: fear not, you did not
|
||||
just travel back in time to the glorious old days of '24. No, I was simply too
|
||||
~~lazy~~ busy to write this post last year. Busy with what? Well, I basically
|
||||
changed every part of my infrastructure the past year obviously. This is all due
|
||||
to the fact I got infected by the NixOS-virus!
|
||||
|
||||
This post is a follow-up of a similar post from 2023 where I described my whole
|
||||
home lab as it was back then. In this post I will mainly focus on the things
|
||||
that changed compared to then.
|
||||
|
||||
# Hardware and Operating Systems
|
||||
|
||||
## Hardware
|
||||
|
||||
It seems I was rather frugal this year, as I did not buy any new servers or
|
||||
replaced existing ones.
|
||||
|
||||
However, as we will see in a bit, I am now running Linux on my gaming PC and
|
||||
deploying it just like I would do with any one of my servers. Therefore, I will
|
||||
take a moment to gloss over the hardware of that machine:
|
||||
|
||||
- **Motherboard**: Gigabyte B550 GAMING X V2
|
||||
- **CPU**: AMD Ryzen 5 5600
|
||||
- **GPU**: AMD Radeon RX 6700 XT
|
||||
|
||||
Here is a picture of its guts:
|
||||
|
||||

|
||||
|
||||
You probably noticed the random components inside it, which are fixed just
|
||||
securely enough to not cause a fire. These are actually parts of a
|
||||
[DIY PiKVM V2](https://docs.pikvm.org/v2/). PiKVM is a cool project to transform
|
||||
your Raspberry Pi into a KVM (Keyboard, Video and Mouse). A KVM, in its most
|
||||
basic form, allows you to see the video output and control the mouse and
|
||||
keyboard of a system remotely.
|
||||
|
||||
In the bottom right part of the picture, connected to the ribbon cable, you can
|
||||
see the part that allows the Raspberry Pi to power on/off the PC, reset the PC,
|
||||
read power status and read disk activity, all via its GPIO! In the top right
|
||||
part of the picture you can see the component that converts the PC's HDMI output
|
||||
to CSI, which the Pi supports. Just behind that component, but out of view, the
|
||||
Pi is secured.
|
||||
|
||||
PiKVM has some more neat features: it can emulate a USB storage device which I
|
||||
can boot off of. This allows me, for example, to run a memory test or tweak BIOS
|
||||
settings.
|
||||
|
||||
One other piece of hardware is worth mentioning. Similar to the PiKVM, I also
|
||||
bought a
|
||||
[Sipeed NanoKVM](https://wiki.sipeed.com/hardware/en/kvm/NanoKVM/introduction.html).
|
||||
I opted for the Lite variant, which is only able to read HDMI and interface with
|
||||
the hosts USB. It can't control the power like the PiKVM that I built. For now I
|
||||
attached it to one of my Kubernetes nodes.
|
||||
|
||||
 _Unboxing of the NanoKVM Lite,
|
||||
[courtesy of Sipeed](https://github.com/sipeed/sipeed_wiki/blob/main/docs/hardware/assets/NanoKVM/unbox/lite_ubox.png)._
|
||||
|
||||
## Operating Systems (or, the NixOS propaganda)
|
||||
|
||||
For people that know me in real life, this section will not be a surprise. In
|
||||
late 2023 I started getting interested in NixOS and in 2024 I worked hard to
|
||||
install it on any piece of hardware I could get my hands on.
|
||||
|
||||
NixOS enables me to apply my Infrastrucutre-as-Code spirit to the OS and take it
|
||||
to the extreme. I used to customize an existing OS (mainly Debian) with Ansible
|
||||
to configure it to my liking. This works to some extent: the steps needed to get
|
||||
to my desired configuration are codified! However, after using Ansible for some
|
||||
time I started seeing drawbacks.
|
||||
|
||||
Ansible likes to advertise itself as declarative (because it's YAML, which is
|
||||
declarative by nature right?). Sure, all steps _should_ be declarative and
|
||||
idempotent (they can be run repeatedly without changes). But when you do
|
||||
anything non-trivial, you start to introduce dependencies between different
|
||||
tasks. Suddenly, your tasks cannot be run just on their own anymore! Together, a
|
||||
certain Ansible role should indeed still be idempotent, but I often have to
|
||||
think about the current state of my machines still.
|
||||
|
||||
In contrast, NixOS is configured entirely declaratively. You don't have to tell
|
||||
NixOS how to go from configuration A to configuration B, this is something NixOS
|
||||
will figure out for you. Not convinced? This is how you get a fully functional
|
||||
Kubernetes node on NixOS:
|
||||
|
||||
```nix
|
||||
{
|
||||
config.services.k3s.enable = true;
|
||||
}
|
||||
```
|
||||
|
||||
### Colmena
|
||||
|
||||
Deploying changes on a single NixOS host works fine with the `nixos-rebuild`
|
||||
tool. However, if you manage multiple NixOS servers, a bit more comfort is
|
||||
quickly desired. That's why I am happily using the
|
||||
[Colmena](https://github.com/zhaofengli/colmena) deployment tool. Below is an
|
||||
example deployment for my servers.
|
||||
|
||||
```shell
|
||||
❯ colmena apply --experimental-flake-eval --on @server
|
||||
[INFO ] Using flake: git+file:///home/pim/git/nixos-configs
|
||||
[WARN ] Using direct flake evaluation (experimental)
|
||||
[INFO ] Enumerating nodes...
|
||||
[INFO ] Selected 4 out of 6 hosts.
|
||||
✅ 2m All done!
|
||||
(...) ✅ 37s Evaluated warwick, atlas, lewis, and jefke
|
||||
atlas ✅ 0s Built "/nix/store/r0jpg2nrqdnk9gywxvzrh2n3lrwhzy56-nixos-system-atlas-24.11pre-git"
|
||||
warwick ✅ 4s Built "/nix/store/kfbm5c2fqd2xv6lasvb2nhc8g815hl79-nixos-system-warwick-24.11pre-git" on target node
|
||||
lewis ✅ 0s Built "/nix/store/h238ly237srjil0fdxzrj29ib6blcmlg-nixos-system-lewis-24.11pre-git"
|
||||
jefke ✅ 0s Built "/nix/store/b7pnan3wmgk3y0193rka95i82sl33xpc-nixos-system-jefke-24.11pre-git"
|
||||
atlas ✅ 10s Pushed system closure
|
||||
jefke ✅ 5s Pushed system closure
|
||||
lewis ✅ 9s Pushed system closure
|
||||
warwick ✅ 53s Uploaded keys (pre-activation)
|
||||
jefke ✅ 49s Uploaded keys (pre-activation)
|
||||
lewis ✅ 47s Uploaded keys (pre-activation)
|
||||
atlas ✅ 46s Uploaded keys (pre-activation)
|
||||
jefke ✅ 9s Activation successful
|
||||
atlas ✅ 18s Activation successful
|
||||
warwick ✅ 4s Activation successful
|
||||
lewis ✅ 16s Activation successful
|
||||
```
|
||||
|
||||
Notice the use of `@server`, which only deploys hosts thare are tagged with
|
||||
`server`. I also deploy my laptop and gaming PC with this tool, so this is very
|
||||
handy. Also notice that the host `warwick` is built on the target node. This is
|
||||
because `warwick` is a Raspberry Pi 4 with an ARM architecture. The system is
|
||||
therefore compiled on the remote.
|
||||
|
||||
## Virtualization
|
||||
|
||||
I had two main purposes for virtual machines: security and host network
|
||||
isolation. I used to run Docker Swarm on my servers, which messes around with
|
||||
your `iptables` rules. Virtual machines were therefore a way separate that from
|
||||
the main host.
|
||||
|
||||
However, I found that VMs are pretty difficult to manage. It's difficult to
|
||||
define them in an infrastructure-as-code way. To this end, I used Terraform with
|
||||
the pretty buggy
|
||||
[terraform-provider-libvirt](https://github.com/dmacvicar/terraform-provider-libvirt)
|
||||
provider, but this was not a smooth experience. I also found them running out of
|
||||
memory quite often. And while I could have invested time into something like
|
||||
memory ballooning, I didn't really want to spend my time with that.
|
||||
|
||||
So now my servers are completely VM-free!
|
||||
|
||||
## Container Orchestration
|
||||
|
||||
Alongside the operating system, my container clustering setup changed the most
|
||||
this year.
|
||||
|
||||
Before this year, I was using
|
||||
[Docker Swarm](https://docs.docker.com/engine/swarm/) for container clustering.
|
||||
The main benefit for me, was its similiarities with Docker Compose, which I was
|
||||
using at the time. Unfortunately, Docker Swarm is not widely used and doesn't
|
||||
seem well maintained either. Also, a feature I was really missing was the option
|
||||
for "distributed" storage that syncs data between nodes to mitigate hardware
|
||||
failure.
|
||||
|
||||
With Docker Swarm out of the option, I needed to choose another solution.
|
||||
Initially, I wanted to use [Hashicorp Nomad](https://www.nomadproject.io/).
|
||||
Unfortunately, Nomad is no longer open source software, so this is out of the
|
||||
question for me. Then, apart from some smaller projects, I really only had one
|
||||
option left: **Kubernetes**!
|
||||
|
||||
Below I outline some of the components I use in my Kubernetes setup that make it
|
||||
tick.
|
||||
|
||||
### k3s
|
||||
|
||||
I opted for the [k3s](https://k3s.io/) Kubernetes "distribution", because I
|
||||
wanted to start simple and as you saw in
|
||||
[a previous section](#operating-systems-or-the-nixos-propaganda), it's super
|
||||
simple to enable on NixOS. k3s has all the Kubernetes components out-of-the-box
|
||||
to run a single-node cluster.
|
||||
|
||||
In the future, I might swap to using NixOS' `services.kubernetes`
|
||||
[options](https://search.nixos.org/options?channel=24.11&from=0&size=50&sort=relevance&type=packages&query=services.kubernetes)
|
||||
as I hear these work great as well and gives you more control over what runs.
|
||||
|
||||
### MetalLB
|
||||
|
||||
If you use the Kubernetes platform of a cloud provider, like
|
||||
[Microsoft's AKS](https://learn.microsoft.com/en-us/azure/aks/),
|
||||
[Google's GKE](https://cloud.google.com/kubernetes-engine) or
|
||||
[Amazon's EKS](https://aws.amazon.com/eks/) (all really inspiring names,
|
||||
probably great to convince your upper management), they provide network load
|
||||
balancers by default. These are great, because they simplify exposing services
|
||||
outside of the cluster. Unfortunately, bare-metal Kubernetes lacks a load
|
||||
balancer and this is where [MetalLB](https://metallb.io/) comes in.
|
||||
|
||||
MetalLB works by assigning locally-routable IP addresses to Kubernetes services.
|
||||
It has two methods to do this: either via ARP or via BGP. I opted for ARP to not
|
||||
overcomplicate my network setup. To use ARP with MetalLB, you simply have to
|
||||
reserve some IP address space for MetalLB to play with. It will dynamically
|
||||
assign these IPs to Kubernetes services and advertise these via ARP.
|
||||
|
||||
### Longhorn
|
||||
|
||||
Similar to MetalLB, [Longhorn](https://longhorn.io/) fills a gap that Kubernetes
|
||||
has on bare-metal deployments. Without Longhorn, any Kubernetes storage volume
|
||||
is tied to a particular host. This is very problematic when you want to move
|
||||
containers around, or if a physical server dies.
|
||||
|
||||
Longhorn fixes this by replicating block storage across multiple Kubernetes
|
||||
nodes. Want to take down a node? No problem, the workloads can move to another
|
||||
node with the same data. Longhorn can also create periodic backups, which I
|
||||
personally backup off-site to [BorgBase](https://www.borgbase.com/).
|
||||
|
||||
### Cert-manager
|
||||
|
||||
Another amazing part of my Kubernetes setup is
|
||||
[Cert-manager](https://cert-manager.io/). Cert-manager automatically manages TLS
|
||||
certificates that are needed for your Kubernetes deployments.
|
||||
|
||||
Usage of Cert-manager is super simple. First you have to set up a certificate
|
||||
issuer, like Let's Encrypt. Then you can request Cert-manager to automatically
|
||||
provision certificates. I use Kubernetes ingresses, and can simply give it the
|
||||
`cert-manager.io/cluster-issuer` annotation. Then, Cert-manager will use the
|
||||
`host` field to request the certificate.
|
||||
|
||||
I abstracted this a bit, and this is for example the ingress definition for the
|
||||
blog you are reading right now:
|
||||
|
||||
```nix
|
||||
{
|
||||
lab.ingresses.blog = {
|
||||
host = "pim.kun.is";
|
||||
|
||||
service = {
|
||||
name = "blog";
|
||||
portName = "web";
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Tailscale
|
||||
|
||||
## Workloads
|
BIN
src/_posts/infra-snapshot-2024/gamepc.jpeg
Normal file
BIN
src/_posts/infra-snapshot-2024/gamepc.jpeg
Normal file
Binary file not shown.
After Width: | Height: | Size: 709 KiB |
BIN
src/_posts/infra-snapshot-2024/nanokvm.png
Normal file
BIN
src/_posts/infra-snapshot-2024/nanokvm.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 100 KiB |
Loading…
Add table
Reference in a new issue