K8s Deep Dive: Storage
K8s Deep Dive: Storage
Section titled “K8s Deep Dive: Storage”This is a companion to the Home Kubernetes Cluster overview and a sibling to the Networking Deep Dive. That page covers Cilium, MetalLB, and ingress. This one covers everything below the compute layer — how data persists, how it survives node failures, and the real operational tradeoffs of running distributed block storage on single-board computers.
Storage on bare-metal Kubernetes is one of those problems that looks simple until you live with it. Cloud environments hand you EBS or Persistent Disks with a click. On a 4-node cluster of Orange Pi 5 SBCs, you’re building the storage layer from scratch, and every choice has consequences you’ll feel at 11 PM when a node drops.
The Problem
Section titled “The Problem”Kubernetes workloads need persistent storage. Databases, configuration files, log archives, agent workspaces — all of it needs to survive pod restarts and node rescheduling. On a single machine, this is trivial: write to a local disk. On a multi-node cluster where pods can land on any node, it’s a real problem.
The specific constraint on SBCs makes it worse. Each Orange Pi 5 has a single NVMe drive. There are no dedicated storage nodes, no SAN, no shared filesystem. The storage medium and the compute medium are the same hardware. Whatever solution I choose has to share resources with the actual workloads.
I evaluated four alternatives before landing on Longhorn. The decision rationale is fully documented in ADR-002: Longhorn for Distributed Storage on ARM64 — I won’t repeat that analysis here, but the comparison table tells the story quickly:
| Alternative | Why It Didn’t Work |
|---|---|
| Rook-Ceph | Gold standard for distributed storage, but catastrophically heavy. Ceph’s OSD, MON, and MGR daemons consume multiple GB of RAM each. On nodes with 16 GB total and real workloads competing for memory, Ceph would starve everything else. Designed for 10+ node clusters with dedicated storage hardware. |
| OpenEBS | Promising architecture, especially Mayastor for NVMe-native storage. But ARM64 support was spotty at evaluation time — community reports of build failures and missing multi-arch images. Too risky for storage you depend on. |
| NFS (single server) | Simple and proven, but a single point of failure by definition. If the NFS server node dies, every volume in the cluster goes with it. Also measurably worse write performance over the network compared to local block storage. Defeats the purpose of a multi-node cluster. |
| local-path-provisioner | Rancher’s lightweight option. Zero overhead, fastest performance — but no replication at all. A node failure means data loss for anything on that node. Fine for ephemeral data, unacceptable for Home Assistant’s database or agent audit logs. |
The pattern is clear: either the solution is too heavy for SBCs, too unreliable on ARM64, a single point of failure, or provides no resilience. Longhorn is the one option that threads the needle.
Longhorn
Section titled “Longhorn”Version: v1.10.1 | Namespace: longhorn-system | Replication: 2x default
Longhorn is a CNCF project that provides distributed block storage for Kubernetes. It’s designed for commodity hardware — it doesn’t assume enterprise SSDs, dedicated storage networks, or nodes with 128 GB of RAM. It runs on the same disks the OS uses, which is exactly the constraint SBCs impose.
Deployed via Helm chart into the longhorn-system namespace. Every node in the cluster participates in storage — there are no dedicated storage nodes.
How It Works
Section titled “How It Works”Longhorn’s architecture is built around three concepts: engines, replicas, and the manager. When a pod requests a persistent volume, here’s what actually happens:
graph TB subgraph app["Application Layer"] pod["Pod with PVC"] end
subgraph csi["CSI Interface"] driver["Longhorn CSI Driver"] end
subgraph longhorn["Longhorn Engine (per volume)"] engine["Engine\n(iSCSI target)"] end
subgraph replicas["Replica Layer"] r1["Replica 1\n(kube01)"] r2["Replica 2\n(kube03)"] end
pod -->|"PVC mount"| driver driver -->|"Attach volume"| engine engine -->|"Synchronous write"| r1 engine -->|"Synchronous write"| r2
style pod fill:#e8f5e9,stroke:#4CAF50 style driver fill:#e8f4fd,stroke:#2196F3 style engine fill:#fff3e0,stroke:#FF9800 style r1 fill:#fce4ec,stroke:#E91E63 style r2 fill:#fce4ec,stroke:#E91E63The flow:
-
PVC request. A workload creates a PersistentVolumeClaim referencing the
longhornStorageClass. Kubernetes asks Longhorn’s CSI driver to provision it. -
Engine creation. Longhorn spins up an engine pod — a lightweight iSCSI target — on the node where the consuming pod is scheduled. The engine is the single access point for that volume.
-
Replica placement. Longhorn creates replica pods on different nodes (2 replicas with 2x replication). Each replica holds a full copy of the volume’s data on that node’s local disk.
-
Synchronous writes. When the application writes data, the engine forwards the write to both replicas synchronously. The write only acknowledges to the application after both replicas confirm. This is what guarantees consistency — there’s no window where one replica has data the other doesn’t.
-
Failover. If a node goes down, the replica on that node becomes unavailable. The engine continues serving reads and writes from the surviving replica. When the node comes back (or Longhorn rebuilds the replica on another node), the data syncs automatically.
Why 2x Replication
Section titled “Why 2x Replication”On a 4-node cluster, the replication factor is a math problem with real consequences:
| Replication | Storage Overhead | Node Failures Survived | Practical Impact |
|---|---|---|---|
| 1x (no replication) | 1x | 0 | Any node failure loses data. Same as local-path-provisioner. |
| 2x (current) | 2x | 1 | Survives any single-node failure. Sweet spot for 4 nodes. |
| 3x | 3x | 2 | 75% of aggregate storage consumed by replicas. On 4 nodes, that means 3 out of 4 nodes hold a copy of every volume. Excessive. |
2x replication is the clear sweet spot. It provides resilience against the most common failure mode (one node going offline for maintenance, power, or hardware issues) without consuming an unreasonable share of storage. The 2x overhead means a 10 Gi volume uses 20 Gi of aggregate cluster storage — meaningful on NVMe drives that aren’t enormous, but manageable with capacity planning.
3x would only make sense if I regularly expected two simultaneous node failures. On a home cluster where nodes share the same power circuit and network switch, correlated failures (power outage) would likely take out all nodes anyway, making 3x no better than 2x for the scenarios that actually matter.
Active Volumes
Section titled “Active Volumes”The cluster currently runs 15+ persistent volumes across all four nodes. Here’s the breakdown by size tier:
Size Distribution
Section titled “Size Distribution”| Size | Count | Workloads |
|---|---|---|
| 1 Gi | ~3 | Security scan results, agent workspace scratch, small configs |
| 2-3 Gi | ~2 | Audit logs, lightweight application databases |
| 10 Gi | ~7 | PostgreSQL instances, OpenClaw platform data, monitoring stack |
| 50 Gi | 1 | Loki log storage (cluster-wide log aggregation) |
The distribution tells a story about the workloads. Most volumes are 10 Gi — that’s the default size I’ve landed on for anything that needs a real database or accumulates data over time. The single 50 Gi volume is Loki, which ingests logs from every pod in the cluster and retains them for querying. Small 1 Gi volumes are for workloads that need persistence but don’t accumulate much data.
Distribution Across Nodes
Section titled “Distribution Across Nodes”Longhorn distributes replicas across kube01 through kube04, with the anti-affinity constraint that no two replicas of the same volume land on the same node. The engine runs on whichever node the consuming pod is scheduled to. This means storage I/O for a volume might be local (engine and a replica on the same node) or remote (engine on one node, both replicas on others), depending on scheduling.
In practice, Longhorn’s scheduler does a reasonable job of balancing replica placement across nodes. I haven’t needed to manually intervene with node tags or affinity rules — the default anti-affinity behavior handles it.
PVC Patterns
Section titled “PVC Patterns”Every workload in the cluster requests storage the same way — a standard PersistentVolumeClaim referencing the longhorn StorageClass. Dynamic provisioning handles the rest.
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: app-data namespace: my-appspec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 10GiA few things to note:
-
ReadWriteOnceis the access mode for every volume. Longhorn supportsReadWriteMany(RWX) via NFS export, but I haven’t needed it. Each volume is consumed by a single pod, which is the right pattern for databases and application state. -
storageClassName: longhornis the only StorageClass in the cluster. There’s no decision to make per-workload — everything gets Longhorn with 2x replication. If I needed a performance tier without replication for scratch data, I could create a second StorageClass withnumberOfReplicas: 1, but I haven’t felt that need yet. -
Dynamic provisioning means I never create PersistentVolume objects manually. The PVC goes in, Longhorn provisions the volume, creates replicas, and binds the PV automatically. Delete the PVC (with the right reclaim policy), and Longhorn cleans up the volume and its replicas.
In Helm charts, the PVC is typically part of a StatefulSet’s volumeClaimTemplates, which gives each replica its own stable volume that follows the pod across rescheduling.
Database Considerations
Section titled “Database Considerations”This is the section I wish I’d written before deploying anything. The single most impactful lesson from running databases on Longhorn:
SQLite on distributed block storage causes WAL locking issues.
SQLite uses file-level locking (specifically, POSIX advisory locks on the WAL file) to coordinate concurrent access. These locks work perfectly on a local filesystem. On a block storage layer that replicates writes across a network to multiple nodes, the locking semantics break down. You get intermittent SQLITE_BUSY errors, WAL file corruption on failover, and databases that work fine for hours then suddenly lock up under write load.
I hit this with an early deployment that used SQLite as its backing store. The symptoms were maddening — intermittent failures that didn’t reproduce consistently, because they depended on exactly when replicas synced and whether the iSCSI connection experienced any latency spike during a lock acquisition.
The Rule: PostgreSQL From Day One
Section titled “The Rule: PostgreSQL From Day One”Every workload that needs a relational database gets PostgreSQL. No exceptions, no “SQLite is simpler for now.” The operational cost of debugging WAL locking issues on distributed storage is far higher than the marginal complexity of running Postgres.
Specifically: postgres:16-alpine — the official PostgreSQL image on Alpine Linux.
Why not the alternatives:
-
Not Bitnami’s PostgreSQL chart. Broadcom’s acquisition of VMware (which owns Bitnami) changed the licensing terms in August 2025. The Bitnami charts are still functional, but the licensing direction is uncertain enough that I don’t want to build dependencies on them. The official
postgres:16-alpineimage is maintained by the PostgreSQL Docker community and has no corporate licensing risk. -
Not CloudNativePG. CloudNativePG is an excellent operator for running PostgreSQL on Kubernetes — it handles failover, backup, point-in-time recovery, and connection pooling. But it’s overkill for single-instance databases on a 4-node homelab. The operator itself consumes resources, adds operational complexity, and introduces another thing to upgrade and monitor. When I need HA Postgres with automated failover, I’ll adopt it. For now, a single StatefulSet with a Longhorn PVC is sufficient and much simpler to reason about.
PostgreSQL handles the network-backed block storage correctly because it uses its own write-ahead log implementation that doesn’t rely on POSIX file locking semantics the way SQLite does. Writes go through PostgreSQL’s WAL, which is designed for exactly this kind of I/O pattern.
Operational Reality
Section titled “Operational Reality”Running distributed storage on SBCs is not install-and-forget. Here’s what the day-to-day actually looks like.
Node Maintenance Workflow
Section titled “Node Maintenance Workflow”The most common operation is taking a node offline for OS updates, kernel patches, or hardware maintenance. The procedure:
-
kubectl drain <node>— evicts all pods from the node. Longhorn detects that the node’s replicas are no longer accessible. -
Perform maintenance. Update packages, reboot, replace hardware, whatever the task requires.
-
kubectl uncordon <node>— marks the node as schedulable again. Pods can be rescheduled to it. -
Replica rebuild. Longhorn detects the node is back and begins rebuilding any replicas that were degraded during the maintenance window. This is an I/O-intensive operation — the full volume data is copied from the healthy replica to the rebuilt one.
The key insight: during step 1-3, all volumes remain available because the second replica on another node continues serving data. This is the entire point of 2x replication. Without it, draining a node means downtime for every volume that had its only copy there.
Replica rebuilds after a node returns can take time depending on volume size. A 1 Gi volume rebuilds in seconds. The 50 Gi Loki volume takes noticeably longer. During rebuild, the volume is in a degraded state (1 healthy replica) — if a second node fails during this window, you lose data. This is the inherent risk of 2x vs 3x replication.
Write Latency
Section titled “Write Latency”Every write is synchronously replicated before being acknowledged. This adds latency compared to local disk writes. For the workloads on this cluster — Home Assistant configuration, PostgreSQL databases with modest write rates, AI agent state — the latency is imperceptible in practice. I can’t measure it in application response times.
Where it would matter: high-throughput sequential writes. If I were running a workload that streamed gigabytes of data to disk, the replication overhead would be visible. But that’s not the workload pattern here.
Resource Overhead
Section titled “Resource Overhead”Longhorn isn’t free. On every node, it runs:
- longhorn-manager — the control plane component that orchestrates volume lifecycle, replica placement, and node monitoring.
- Engine and replica pods — one engine per attached volume, one replica per volume copy on that node.
- CSI driver components — the interface between Kubernetes and Longhorn.
In aggregate, Longhorn’s system components consume memory on every node. On 16 GB nodes running real workloads, this matters. I’ve tuned the resource requests and limits to keep Longhorn functional without letting it balloon. It’s a tax you pay for distributed storage — but on SBCs, every megabyte of overhead is felt more acutely than on a cloud VM.
Monitoring
Section titled “Monitoring”Three layers of visibility into storage health:
-
Longhorn UI. A web dashboard that shows volume status, replica placement, node disk utilization, and rebuild progress. This is the first place I look when something seems off. Accessible via
kubectl port-forwardthrough the Tailscale mesh. -
Prometheus metrics. Longhorn exposes metrics that feed into the cluster’s Grafana dashboards. Volume IOPS, latency percentiles, replica health, and disk capacity trends. The Grafana dashboards give historical context that the Longhorn UI doesn’t — I can see whether write latency has been creeping up over weeks, not just what it is right now.
-
Kubernetes events and alerts. Longhorn emits Kubernetes events for volume degradation, replica failure, and rebuild completion. These integrate with whatever alerting is configured on the cluster.
Backup Strategy
Section titled “Backup Strategy”This is the section where I have to be honest about a known gap.
Current State
Section titled “Current State”There is no S3 backup configured. The cluster relies entirely on 2x replication within the cluster itself. This protects against single-node failure, but it does not protect against:
- Correlated failure. A power outage takes out all four nodes simultaneously. When they come back, Longhorn should recover from its replicas on disk — but “should” is not “will” in a power-loss scenario with synchronous replication mid-flight.
- Logical corruption. An application bug writes garbage to a volume. Both replicas faithfully store the garbage. Replication protects against hardware failure, not software errors.
- Physical disaster. All four nodes live on the same shelf. A house fire, flood, or theft takes everything.
2x replication is not a backup strategy. It’s a high-availability mechanism. I know this. The gap is documented, acknowledged, and on the roadmap.
Planned
Section titled “Planned”The plan, which I’m treating as a near-term priority:
- S3-compatible backup target. Longhorn natively supports backing up volumes to any S3-compatible endpoint. This could be AWS S3, Backblaze B2, MinIO on a separate machine, or any other compatible service.
- Daily backups with 7-day retention. Every volume gets a daily backup, keeping the last 7 days. This covers logical corruption with a recovery point no more than 24 hours old.
- 6-hour snapshots. Longhorn’s snapshot mechanism provides more granular recovery points within the cluster. Snapshots are cheaper than backups (they’re local, delta-based) but don’t protect against node loss.
Until the S3 target is configured, the cluster is operating with an acknowledged risk. The workloads are not irreplaceable — Home Assistant can be reconfigured, databases can be rebuilt, agent state can be regenerated — but the recovery time would be measured in hours, not minutes. That’s the honest tradeoff of a homelab that hasn’t yet completed its backup story.
Summary
Section titled “Summary”| Component | Detail |
|---|---|
| Storage Provider | Longhorn v1.10.1 |
| Namespace | longhorn-system |
| Replication | 2x (survives single-node failure) |
| StorageClass | longhorn (only class, dynamic provisioning) |
| Active Volumes | 15+ across kube01-kube04 |
| Largest Volume | 50 Gi (Loki log storage) |
| Database Policy | PostgreSQL always, never SQLite on distributed storage |
| Backup Status | 2x replication only (S3 backup planned, not yet configured) |
Longhorn solves the distributed storage problem for this cluster in a way that’s operationally manageable by one person. It’s not the fastest, not the most feature-rich, and not the most efficient in terms of resource overhead. But it’s the one that works reliably on ARM64 SBCs with 16 GB of RAM, integrates cleanly with Kubernetes, and provides the single-node failure resilience that makes this cluster trustworthy enough to run real workloads on.
The storage layer is infrastructure you stop thinking about when it works — and that’s exactly where Longhorn has been since deployment. The backup gap is the remaining item that keeps it from being fully production-grade, and closing that gap is the next priority.