Remove Longhorn #11

New issue

Closed

opened 2025-05-22 14:52:52 +00:00 by pim · 2 comments

pim commented

2025-05-22 14:52:52 +00:00

Owner

Longhorn works great most of the time, but it fails way to quickly when I make changes to my cluster topology (e.g. restarting some/all nodes):

Flooding the network with volume syncs, so much that etcd is delayed and starts marking hosts as "down". This causes even more syncs and eventually the whole cluster breaks down.
Corruption of volumes after host reboot.
Volumes being stuck to deleted pods.

Additionally, Longhorn has been the source of most of my maintenance burden, causing many obscure problems requiring manual intervention.

My path forward will be to just use local storage on the different hosts. I will then need to make sure we spread the load of the pods over the different hosts. So in short:

Use hostPath to mount the directories inside the k8s pods
Use nodeSelector to select the particular host where I have the pod's data

Longhorn works great most of the time, but it fails way to quickly when I make changes to my cluster topology (e.g. restarting some/all nodes): 1. Flooding the network with volume syncs, so much that etcd is delayed and starts marking hosts as "down". This causes even more syncs and eventually the whole cluster breaks down. 2. Corruption of volumes after host reboot. 3. Volumes being stuck to deleted pods. Additionally, Longhorn has been the source of most of my maintenance burden, causing many obscure problems requiring manual intervention. My path forward will be to just use local storage on the different hosts. I will then need to make sure we spread the load of the pods over the different hosts. So in short: 1. Use `hostPath` to mount the directories inside the k8s pods 2. Use `nodeSelector` to select the particular host where I have the pod's data

🎉 1

pim commented

2025-05-24 20:22:23 +00:00

Author

Owner

Our current backup strategy heavily relies on Longhorn as well, so we have to fix that. Current plan is:

Create timer on host for each datastore (freshrss, forgejo, etc.) that fires at a random-ish time in the night.
The timer knows which Kubernetes deployments have the data mounted; it scales down these deployments to 0 first. (If I feel creative, I might even be able to programmatically induce this.)
Then push the data to a Borgbase repo specific for that datastore.
Scale the deployment back to normal amount.

Our current backup strategy heavily relies on Longhorn as well, so we have to fix that. Current plan is: 1. Create timer on host for each datastore (freshrss, forgejo, etc.) that fires at a random-ish time in the night. 2. The timer knows which Kubernetes deployments have the data mounted; it scales down these deployments to 0 first. (If I feel creative, I might even be able to programmatically induce this.) 3. Then push the data to a Borgbase repo specific for that datastore. 4. Scale the deployment back to normal amount.

pim commented

2025-05-30 21:44:44 +00:00

Author

Owner

Almost done, just have to cleanup scsi and nfs.