Remove Longhorn #11

Open
opened 2025-05-22 14:52:52 +00:00 by pim · 2 comments
Owner

Longhorn works great most of the time, but it fails way to quickly when I make changes to my cluster topology (e.g. restarting some/all nodes):

  1. Flooding the network with volume syncs, so much that etcd is delayed and starts marking hosts as "down". This causes even more syncs and eventually the whole cluster breaks down.
  2. Corruption of volumes after host reboot.
  3. Volumes being stuck to deleted pods.

Additionally, Longhorn has been the source of most of my maintenance burden, causing many obscure problems requiring manual intervention.

My path forward will be to just use local storage on the different hosts. I will then need to make sure we spread the load of the pods over the different hosts. So in short:

  1. Use hostPath to mount the directories inside the k8s pods
  2. Use nodeSelector to select the particular host where I have the pod's data
Longhorn works great most of the time, but it fails way to quickly when I make changes to my cluster topology (e.g. restarting some/all nodes): 1. Flooding the network with volume syncs, so much that etcd is delayed and starts marking hosts as "down". This causes even more syncs and eventually the whole cluster breaks down. 2. Corruption of volumes after host reboot. 3. Volumes being stuck to deleted pods. Additionally, Longhorn has been the source of most of my maintenance burden, causing many obscure problems requiring manual intervention. My path forward will be to just use local storage on the different hosts. I will then need to make sure we spread the load of the pods over the different hosts. So in short: 1. Use `hostPath` to mount the directories inside the k8s pods 2. Use `nodeSelector` to select the particular host where I have the pod's data
Author
Owner

Our current backup strategy heavily relies on Longhorn as well, so we have to fix that. Current plan is:

  1. Create timer on host for each datastore (freshrss, forgejo, etc.) that fires at a random-ish time in the night.
  2. The timer knows which Kubernetes deployments have the data mounted; it scales down these deployments to 0 first. (If I feel creative, I might even be able to programmatically induce this.)
  3. Then push the data to a Borgbase repo specific for that datastore.
  4. Scale the deployment back to normal amount.
Our current backup strategy heavily relies on Longhorn as well, so we have to fix that. Current plan is: 1. Create timer on host for each datastore (freshrss, forgejo, etc.) that fires at a random-ish time in the night. 2. The timer knows which Kubernetes deployments have the data mounted; it scales down these deployments to 0 first. (If I feel creative, I might even be able to programmatically induce this.) 3. Then push the data to a Borgbase repo specific for that datastore. 4. Scale the deployment back to normal amount.
Author
Owner

Almost done, just have to cleanup scsi and nfs.

Almost done, just have to cleanup scsi and nfs.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: home/kubernetes-deployments#11
No description provided.