Skip to content

K8s Deep Dive: GitOps

This is a companion to the Home Kubernetes Cluster overview and a sibling to the Networking Deep Dive and the Storage Deep Dive. Those pages cover how traffic flows and how data persists. This one covers how changes get to the cluster in the first place — and why “I’ll just kubectl apply it” is a trap.

Every piece of infrastructure described in the sibling pages — Cilium, Longhorn, MetalLB, ingress-nginx, cert-manager — is deployed and managed through a GitOps pipeline. No manual Helm installs, no imperative kubectl commands for steady-state configuration. Git is the single source of truth for what runs on this cluster.

Three properties make GitOps worth the setup cost, especially for infrastructure managed by one person:

Auditability. git log is the audit trail. Every change to the cluster — a new Helm chart, a bumped version, a modified value — is a commit with a timestamp, author, and diff. When something breaks at 11 PM, I don’t have to remember what I changed. The commit history tells me exactly what changed, when, and why (assuming I wrote decent commit messages, which is its own discipline).

Reproducibility. The entire cluster can be reconstructed from a single Git repository. If all four nodes caught fire tomorrow, I could provision new hardware, bootstrap Kubernetes, point Flux at the repo, and walk away. Every namespace, every Helm release, every Kustomization would converge to the declared state. The recovery time is limited by hardware provisioning, not by remembering what was deployed where.

Safety. Pull requests are the review gate. On a solo project, PRs might seem like overhead — who am I reviewing for? Myself, it turns out. The diff in a PR forces me to read the change in context before it hits the cluster. And when a change goes wrong, git revert followed by a push is the rollback mechanism. No fumbling with helm rollback flags or trying to remember the previous values. Revert the commit, Flux reconciles, done.

The GitOps pipeline has a clear flow: a GitHub repository holds the desired state, Flux watches that repo, and Kustomizations and HelmRelease custom resources translate the desired state into actual Kubernetes objects.

graph LR
repo["GitHub Repo\nspencer2211/fiducian-kube"] -->|"SSH poll\n(1 min)"| flux["Flux\n(flux-system)"]
flux -->|"Reconcile"| kust["Kustomizations"]
kust -->|"Render manifests"| hr["HelmRelease CRs"]
hr -->|"Helm install/upgrade"| workloads["Deployed Workloads\n(18 Helm releases)"]
style repo fill:#f5f5f5,stroke:#333
style flux fill:#e8f4fd,stroke:#2196F3
style kust fill:#e8f4fd,stroke:#2196F3
style hr fill:#fff3e0,stroke:#FF9800
style workloads fill:#e8f5e9,stroke:#4CAF50
ComponentRole
GitHub (spencer2211/fiducian-kube)Source of truth. All cluster state lives here.
Flux v2GitOps operator. Watches the repo and reconciles cluster state to match.
KustomizationsFlux CRs that define which paths in the repo to apply and in what order.
HelmRelease CRsDeclarative Helm installs. Flux manages the Helm lifecycle — install, upgrade, rollback.
Vault + ESOSecrets stay in Vault. External Secrets Operator syncs them into Kubernetes Secrets.

Flux runs in the flux-system namespace. It was bootstrapped with flux bootstrap github, which sets up the initial GitRepository source, the self-managing Kustomization, and the deploy key on the GitHub repo.

The source configuration tells Flux where to find the desired state:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
interval: 1m0s
ref:
branch: production
secretRef:
name: flux-system
url: ssh://[email protected]/spencer2211/fiducian-kube

A few things worth noting:

  • SSH transport. The repo is accessed over SSH with a deploy key, not HTTPS with a token. Deploy keys are scoped to a single repository and can be read-only, which is the minimum privilege Flux needs. If the key is compromised, the blast radius is one repo, not my entire GitHub account.

  • production branch. Not main. The production branch is the branch that Flux watches. Changes go through main first via PR, then get merged or promoted to production. This gives me a gate between “merged code” and “deployed code” when I want it, though in practice I often merge directly to production for infrastructure changes.

  • 1-minute poll interval. Flux checks the repo every 60 seconds. For a homelab, this is a good balance — changes deploy within a minute of merging, without hammering GitHub’s API. Flux also supports webhook receivers for instant reconciliation, but polling is simpler and sufficient here.

Flux uses three Kustomizations to manage different parts of the cluster:

KustomizationPathPurpose
flux-systemclusters/productionBase Flux components and cluster-wide resources. Self-referential — Flux manages its own deployment.
llm-kedaapps/llm-kedaKEDA autoscaling configuration for the LLM inference stack. ScaledObjects and triggers for scaling Ollama replicas based on queue depth.
llm-ollamaapps/llm-ollamaOllama LLM inference deployment. Model serving, resource limits, GPU scheduling.

The flux-system Kustomization is the root — it’s what Flux bootstraps first, and it references everything else. The LLM-specific Kustomizations are separated because they have different reconciliation needs. The Ollama deployment changes frequently as I experiment with models and resource allocations, while the base Flux configuration is essentially static.

Every workload on the cluster is deployed as a HelmRelease custom resource. Flux watches these CRs and manages the Helm lifecycle — installing charts, upgrading when the spec changes, and rolling back on failure.

There are currently 18 Helm releases across the cluster:

ReleaseNamespaceChart VersionApp Version
ciliumcilium-system1.16.181.16.18
longhornlonghorn-system1.10.1v1.10.1
cert-managercert-managerv1.13.3v1.13.3
ingress-nginxingress-nginx4.14.11.14.1
metallbmetallb-system0.13.12v0.13.12
vaultvault0.31.01.20.4
home-assistanthome-assistant0.3.432026.2.1
external-secretsexternal-secrets-system1.1.1v1.1.1
kube-prometheusmonitoring80.9.2v0.87.1
grafanamonitoring10.4.012.3.0
lokimonitoring6.46.03.5.7
promtailmonitoring6.17.13.5.1
jaegermonitoring3.4.11.53.0
kedakeda2.18.32.18.3

That’s 14 listed explicitly — the remaining 4 are application-specific releases in namespaces like openclaw, nats, graphiti, and llm-gateway that are covered in their respective project pages.

The releases fall into natural tiers:

  • Infrastructure (Cilium, MetalLB, cert-manager, ingress-nginx, Longhorn) — the foundation everything else depends on. These change rarely and require careful testing when they do.
  • Platform Services (Vault, External Secrets, KEDA) — shared capabilities consumed by applications. Medium change frequency.
  • Observability (kube-prometheus, Grafana, Loki, Promtail, Jaeger) — the monitoring stack. Changes usually mean dashboard updates or retention tuning, not architectural shifts.
  • Applications (Home Assistant, plus the workloads in other namespaces) — the things the cluster exists to run. Highest change frequency.

Here’s what a typical HelmRelease looks like, using Home Assistant as an example:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: home-assistant
namespace: home-assistant
spec:
interval: 5m
chart:
spec:
chart: home-assistant
version: "0.3.43"
sourceRef:
kind: HelmRepository
name: home-assistant
namespace: flux-system
values:
image:
repository: ghcr.io/home-assistant/home-assistant
tag: "2026.2.1"
hostNetwork: true
persistence:
enabled: true
storageClass: longhorn
size: 10Gi
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
memory: 1Gi

The structure is straightforward:

  • chart.spec pins the Helm chart version. This is the version of the chart packaging, not the application itself.
  • values is the equivalent of a values.yaml file — it configures the chart. The image tag, persistence settings, resource limits all go here.
  • interval: 5m means Flux checks this release every 5 minutes and reconciles if the actual state has drifted from the desired state.
  • sourceRef points to a HelmRepository resource that Flux also manages — it’s the chart registry URL.

When I want to upgrade Home Assistant, I change tag: "2026.2.1" to the new version, commit, push, and Flux handles the rolling update. No helm upgrade commands, no remembering which values I passed last time.

The standard deployment workflow is branch-based:

sequenceDiagram
participant Dev as Developer
participant Git as GitHub (fiducian-kube)
participant Flux as Flux (flux-system)
participant K8s as Kubernetes Cluster
Dev->>Git: Create branch, modify HelmRelease
Dev->>Git: Open PR, review diff
Dev->>Git: Merge to production
Git-->>Flux: Poll detects new commit (≤1 min)
Flux->>K8s: Reconcile — helm upgrade
K8s-->>Flux: Release status: deployed
Flux-->>Git: Update status (GitOps Toolkit)

In practice for a solo operator, the PR step is often a quick self-review rather than a formal process. But the discipline of reading the diff before merging has caught real errors — typos in resource limits, wrong chart versions, accidentally removed values. The PR isn’t bureaucracy; it’s a forcing function for a second look.

Rollback is the inverse of deployment:

  1. git revert <commit> — creates a new commit that undoes the change.
  2. git push — Flux detects the new commit within a minute.
  3. Flux reconciles — the cluster converges back to the previous state.

This is fundamentally safer than helm rollback because it operates on the source of truth. A helm rollback changes the cluster state without changing the Git repo, which means Git and the cluster are now out of sync. The next Flux reconciliation would re-apply the change you just rolled back. Git revert avoids this entirely — the revert is in Git, so Flux and the cluster agree on what the desired state should be.

Sometimes you need to fix something immediately. A pod is crash-looping, a misconfiguration is dropping traffic, something is on fire. The workflow:

  1. kubectl apply or kubectl edit — fix the immediate problem.
  2. Immediately commit the same change to Git — this is the critical step.

If you skip step 2, you have undocumented drift. The cluster state doesn’t match Git. The next Flux reconciliation (within a minute) will revert your emergency fix back to whatever Git says. Undocumented drift is a ticking time bomb — it might not explode for a minute, or it might not explode until the next unrelated change triggers a full reconciliation. Either way, the fix you applied at 2 AM disappears silently.

The rule: if you kubectl apply something, you have 60 seconds to get the equivalent change into Git before Flux overwrites it. In practice, I usually suspend Flux reconciliation first (flux suspend kustomization <name>), make the fix, commit to Git, then resume reconciliation. Belt and suspenders.

The one thing that can’t live in Git is secrets. API keys, database passwords, TLS private keys — committing these to a repository, even a private one, is a security failure mode. If the repo is ever exposed, every secret is compromised.

The cluster solves this with HashiCorp Vault and the External Secrets Operator (ESO):

graph LR
vault["HashiCorp Vault\nv1.20.4"] -->|"API"| eso["External Secrets\nOperator v1.1.1"]
eso -->|"Sync"| secret["Kubernetes Secret"]
secret -->|"Volume mount\nor env var"| pod["HelmRelease / Pod"]
style vault fill:#fce4ec,stroke:#E91E63
style eso fill:#fff3e0,stroke:#FF9800
style secret fill:#e8f4fd,stroke:#2196F3
style pod fill:#e8f5e9,stroke:#4CAF50

The flow:

  1. Vault stores the actual secret values. It’s deployed on the cluster itself via Helm (version 1.20.4), with its storage backend on a Longhorn volume. Vault is unsealed using a set of key shares — the unseal process is the one manual step after a full cluster restart.

  2. External Secrets Operator runs in the external-secrets-system namespace. It watches ExternalSecret custom resources, which declare “I need a Kubernetes Secret with key X, and the value comes from Vault path Y.”

  3. The ESO controller authenticates to Vault (using Kubernetes service account auth), reads the secret value, and creates or updates a standard Kubernetes Secret.

  4. Workloads consume the Kubernetes Secret normally — as environment variables or volume mounts. They don’t know or care that the value came from Vault.

The key benefit: the Git repository contains ExternalSecret resources that reference Vault paths, but never the actual secret values. The repo can be shared, reviewed, and even made public without exposing credentials. The secret values live exclusively in Vault, which has its own access control, audit logging, and rotation capabilities.

An ExternalSecret in the repo:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-credentials
namespace: my-app
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: app-credentials
data:
- secretKey: DATABASE_URL
remoteRef:
key: secret/data/my-app
property: database_url

This is safe to commit. It says “create a Kubernetes Secret called app-credentials with a key DATABASE_URL whose value comes from secret/data/my-app in Vault.” The actual database URL never appears in Git.

The entire monitoring stack is itself GitOps-managed — deployed and configured through HelmRelease resources, same as everything else. Monitoring the monitors through the same pipeline they monitor. It’s turtles all the way down, but it works.

ComponentNamespaceChart VersionApp VersionPurpose
kube-prometheusmonitoring80.9.2v0.87.1Prometheus + Alertmanager + default recording/alerting rules. The metrics backbone.
grafanamonitoring10.4.012.3.0Dashboards and visualization. Connects to Prometheus, Loki, and Jaeger as data sources.
lokimonitoring6.46.03.5.7Log aggregation. Receives logs from Promtail, stores on Longhorn (50 Gi volume).
promtailmonitoring6.17.13.5.1Log shipper. DaemonSet that tails container logs on every node and forwards to Loki.
jaegermonitoring3.4.11.53.0Distributed tracing. Captures trace spans from instrumented services.

This stack provides three pillars of observability:

  • Metrics (Prometheus via kube-prometheus-stack) — numeric time-series data. CPU usage, memory consumption, request rates, error rates, Helm release status, Flux reconciliation duration. Prometheus scrapes metrics endpoints across the cluster every 30 seconds.

  • Logs (Loki + Promtail) — structured and unstructured log data from every container. Promtail runs on every node as a DaemonSet, tails the container log files, and ships them to Loki. Queried through Grafana using LogQL. When a Flux reconciliation fails, the logs tell me why.

  • Traces (Jaeger) — distributed request tracing for services that emit OpenTelemetry spans. Most useful for the AI agent workloads where a single user request fans out across multiple services (API gateway, agent runtime, tool execution, LLM inference). Less relevant for infrastructure components, but available when needed.

All three data sources are configured in Grafana, which serves as the single pane of glass. I can go from a spike in a Prometheus metric to the relevant logs in Loki to the trace that shows where latency was introduced — all without leaving the browser.

The fact that the monitoring stack is itself managed through GitOps means upgrades, configuration changes, and dashboard provisioning all go through the same PR workflow. A new Grafana dashboard is a JSON file committed to Git, not a manual creation in the UI that gets lost when the pod restarts.

AspectDetail
GitOps OperatorFlux v2, flux-system namespace
Source Repospencer2211/fiducian-kube, production branch, SSH transport
Reconciliation1-minute poll interval, automatic convergence
Helm Releases18 total across infrastructure, platform, observability, and application tiers
SecretsHashiCorp Vault v1.20.4 + External Secrets Operator v1.1.1
Monitoringkube-prometheus, Grafana, Loki, Promtail, Jaeger — all GitOps-managed
DeploymentBranch, PR, merge, Flux reconciles. Rollback via git revert.
Emergencykubectl apply + immediate Git commit. Suspend Flux if needed.

GitOps on a homelab might seem like overkill. It’s one person managing one cluster — why not just helm install and move on? The answer is the same reason I use version control for code: because my memory is unreliable, my 2 AM self makes questionable decisions, and the ability to see exactly what changed and roll it back without thinking is worth the upfront investment in the pipeline.

The cluster has been running this way since deployment, and the number of “what did I change that broke this” debugging sessions has been exactly zero. Every incident starts with git log and ends with either “ah, that commit” or “nothing changed in Git, so the problem is external.” That clarity alone justifies the approach.