K8s Deep Dive: GitOps

This is a companion to the Home Kubernetes Cluster overview and a sibling to the Networking Deep Dive and the Storage Deep Dive. Those pages cover how traffic flows and how data persists. This one covers how changes get to the cluster in the first place — and why “I’ll just kubectl apply it” is a trap.

Every piece of infrastructure described in the sibling pages — Cilium, Longhorn, MetalLB, ingress-nginx, cert-manager — is deployed and managed through a GitOps pipeline. No manual Helm installs, no imperative kubectl commands for steady-state configuration. Git is the single source of truth for what runs on this cluster.

Why GitOps

Three properties make GitOps worth the setup cost, especially for infrastructure managed by one person:

Auditability. git log is the audit trail. Every change to the cluster — a new Helm chart, a bumped version, a modified value — is a commit with a timestamp, author, and diff. When something breaks at 11 PM, I don’t have to remember what I changed. The commit history tells me exactly what changed, when, and why (assuming I wrote decent commit messages, which is its own discipline).

Reproducibility. The entire cluster can be reconstructed from a single Git repository. If all four nodes caught fire tomorrow, I could provision new hardware, bootstrap Kubernetes, point Flux at the repo, and walk away. Every namespace, every Helm release, every Kustomization would converge to the declared state. The recovery time is limited by hardware provisioning, not by remembering what was deployed where.

Safety. Pull requests are the review gate. On a solo project, PRs might seem like overhead — who am I reviewing for? Myself, it turns out. The diff in a PR forces me to read the change in context before it hits the cluster. And when a change goes wrong, git revert followed by a push is the rollback mechanism. No fumbling with helm rollback flags or trying to remember the previous values. Revert the commit, Flux reconciles, done.

The Stack

The GitOps pipeline has a clear flow: a GitHub repository holds the desired state, Flux watches that repo, and Kustomizations and HelmRelease custom resources translate the desired state into actual Kubernetes objects.

graph LR
    repo["GitHub Repo\nspencer2211/fiducian-kube"] -->|"SSH poll\n(1 min)"| flux["Flux\n(flux-system)"]
    flux -->|"Reconcile"| kust["Kustomizations"]
    kust -->|"Render manifests"| hr["HelmRelease CRs"]
    hr -->|"Helm install/upgrade"| workloads["Deployed Workloads\n(18 Helm releases)"]

    style repo fill:#f5f5f5,stroke:#333
    style flux fill:#e8f4fd,stroke:#2196F3
    style kust fill:#e8f4fd,stroke:#2196F3
    style hr fill:#fff3e0,stroke:#FF9800
    style workloads fill:#e8f5e9,stroke:#4CAF50

Component	Role
GitHub (`spencer2211/fiducian-kube`)	Source of truth. All cluster state lives here.
Flux v2	GitOps operator. Watches the repo and reconciles cluster state to match.
Kustomizations	Flux CRs that define which paths in the repo to apply and in what order.
HelmRelease CRs	Declarative Helm installs. Flux manages the Helm lifecycle — install, upgrade, rollback.
Vault + ESO	Secrets stay in Vault. External Secrets Operator syncs them into Kubernetes Secrets.

Flux Configuration

Flux runs in the flux-system namespace. It was bootstrapped with flux bootstrap github, which sets up the initial GitRepository source, the self-managing Kustomization, and the deploy key on the GitHub repo.

GitRepository Source

The source configuration tells Flux where to find the desired state:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 1m0s
  ref:
    branch: production
  secretRef:
    name: flux-system
  url: ssh://[email protected]/spencer2211/fiducian-kube

A few things worth noting:

SSH transport. The repo is accessed over SSH with a deploy key, not HTTPS with a token. Deploy keys are scoped to a single repository and can be read-only, which is the minimum privilege Flux needs. If the key is compromised, the blast radius is one repo, not my entire GitHub account.
production branch. Not main. The production branch is the branch that Flux watches. Changes go through main first via PR, then get merged or promoted to production. This gives me a gate between “merged code” and “deployed code” when I want it, though in practice I often merge directly to production for infrastructure changes.
1-minute poll interval. Flux checks the repo every 60 seconds. For a homelab, this is a good balance — changes deploy within a minute of merging, without hammering GitHub’s API. Flux also supports webhook receivers for instant reconciliation, but polling is simpler and sufficient here.

Kustomizations

Flux uses three Kustomizations to manage different parts of the cluster:

Kustomization	Path	Purpose
`flux-system`	`clusters/production`	Base Flux components and cluster-wide resources. Self-referential — Flux manages its own deployment.
`llm-keda`	`apps/llm-keda`	KEDA autoscaling configuration for the LLM inference stack. ScaledObjects and triggers for scaling Ollama replicas based on queue depth.
`llm-ollama`	`apps/llm-ollama`	Ollama LLM inference deployment. Model serving, resource limits, GPU scheduling.

The flux-system Kustomization is the root — it’s what Flux bootstraps first, and it references everything else. The LLM-specific Kustomizations are separated because they have different reconciliation needs. The Ollama deployment changes frequently as I experiment with models and resource allocations, while the base Flux configuration is essentially static.

Helm Releases

Every workload on the cluster is deployed as a HelmRelease custom resource. Flux watches these CRs and manages the Helm lifecycle — installing charts, upgrading when the spec changes, and rolling back on failure.

There are currently 18 Helm releases across the cluster:

Release	Namespace	Chart Version	App Version
`cilium`	`cilium-system`	1.16.18	1.16.18
`longhorn`	`longhorn-system`	1.10.1	v1.10.1
`cert-manager`	`cert-manager`	v1.13.3	v1.13.3
`ingress-nginx`	`ingress-nginx`	4.14.1	1.14.1
`metallb`	`metallb-system`	0.13.12	v0.13.12
`vault`	`vault`	0.31.0	1.20.4
`home-assistant`	`home-assistant`	0.3.43	2026.2.1
`external-secrets`	`external-secrets-system`	1.1.1	v1.1.1
`kube-prometheus`	`monitoring`	80.9.2	v0.87.1
`grafana`	`monitoring`	10.4.0	12.3.0
`loki`	`monitoring`	6.46.0	3.5.7
`promtail`	`monitoring`	6.17.1	3.5.1
`jaeger`	`monitoring`	3.4.1	1.53.0
`keda`	`keda`	2.18.3	2.18.3

That’s 14 listed explicitly — the remaining 4 are application-specific releases in namespaces like openclaw, nats, graphiti, and llm-gateway that are covered in their respective project pages.

The releases fall into natural tiers:

Infrastructure (Cilium, MetalLB, cert-manager, ingress-nginx, Longhorn) — the foundation everything else depends on. These change rarely and require careful testing when they do.
Platform Services (Vault, External Secrets, KEDA) — shared capabilities consumed by applications. Medium change frequency.
Observability (kube-prometheus, Grafana, Loki, Promtail, Jaeger) — the monitoring stack. Changes usually mean dashboard updates or retention tuning, not architectural shifts.
Applications (Home Assistant, plus the workloads in other namespaces) — the things the cluster exists to run. Highest change frequency.

Anatomy of a HelmRelease

Here’s what a typical HelmRelease looks like, using Home Assistant as an example:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: home-assistant
  namespace: home-assistant
spec:
  interval: 5m
  chart:
    spec:
      chart: home-assistant
      version: "0.3.43"
      sourceRef:
        kind: HelmRepository
        name: home-assistant
        namespace: flux-system
  values:
    image:
      repository: ghcr.io/home-assistant/home-assistant
      tag: "2026.2.1"
    hostNetwork: true
    persistence:
      enabled: true
      storageClass: longhorn
      size: 10Gi
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        memory: 1Gi

The structure is straightforward:

chart.spec pins the Helm chart version. This is the version of the chart packaging, not the application itself.
values is the equivalent of a values.yaml file — it configures the chart. The image tag, persistence settings, resource limits all go here.
interval: 5m means Flux checks this release every 5 minutes and reconciles if the actual state has drifted from the desired state.
sourceRef points to a HelmRepository resource that Flux also manages — it’s the chart registry URL.

When I want to upgrade Home Assistant, I change tag: "2026.2.1" to the new version, commit, push, and Flux handles the rolling update. No helm upgrade commands, no remembering which values I passed last time.

Deployment Workflow

Normal Changes

The standard deployment workflow is branch-based:

sequenceDiagram
    participant Dev as Developer
    participant Git as GitHub (fiducian-kube)
    participant Flux as Flux (flux-system)
    participant K8s as Kubernetes Cluster

    Dev->>Git: Create branch, modify HelmRelease
    Dev->>Git: Open PR, review diff
    Dev->>Git: Merge to production
    Git-->>Flux: Poll detects new commit (≤1 min)
    Flux->>K8s: Reconcile — helm upgrade
    K8s-->>Flux: Release status: deployed
    Flux-->>Git: Update status (GitOps Toolkit)

In practice for a solo operator, the PR step is often a quick self-review rather than a formal process. But the discipline of reading the diff before merging has caught real errors — typos in resource limits, wrong chart versions, accidentally removed values. The PR isn’t bureaucracy; it’s a forcing function for a second look.

Rollback

Rollback is the inverse of deployment:

git revert <commit> — creates a new commit that undoes the change.
git push — Flux detects the new commit within a minute.
Flux reconciles — the cluster converges back to the previous state.

This is fundamentally safer than helm rollback because it operates on the source of truth. A helm rollback changes the cluster state without changing the Git repo, which means Git and the cluster are now out of sync. The next Flux reconciliation would re-apply the change you just rolled back. Git revert avoids this entirely — the revert is in Git, so Flux and the cluster agree on what the desired state should be.

Emergency Changes

Sometimes you need to fix something immediately. A pod is crash-looping, a misconfiguration is dropping traffic, something is on fire. The workflow:

kubectl apply or kubectl edit — fix the immediate problem.
Immediately commit the same change to Git — this is the critical step.

If you skip step 2, you have undocumented drift. The cluster state doesn’t match Git. The next Flux reconciliation (within a minute) will revert your emergency fix back to whatever Git says. Undocumented drift is a ticking time bomb — it might not explode for a minute, or it might not explode until the next unrelated change triggers a full reconciliation. Either way, the fix you applied at 2 AM disappears silently.

The rule: if you kubectl apply something, you have 60 seconds to get the equivalent change into Git before Flux overwrites it. In practice, I usually suspend Flux reconciliation first (flux suspend kustomization <name>), make the fix, commit to Git, then resume reconciliation. Belt and suspenders.

Secrets Management

The one thing that can’t live in Git is secrets. API keys, database passwords, TLS private keys — committing these to a repository, even a private one, is a security failure mode. If the repo is ever exposed, every secret is compromised.

The cluster solves this with HashiCorp Vault and the External Secrets Operator (ESO):

graph LR
    vault["HashiCorp Vault\nv1.20.4"] -->|"API"| eso["External Secrets\nOperator v1.1.1"]
    eso -->|"Sync"| secret["Kubernetes Secret"]
    secret -->|"Volume mount\nor env var"| pod["HelmRelease / Pod"]

    style vault fill:#fce4ec,stroke:#E91E63
    style eso fill:#fff3e0,stroke:#FF9800
    style secret fill:#e8f4fd,stroke:#2196F3
    style pod fill:#e8f5e9,stroke:#4CAF50

The flow:

Vault stores the actual secret values. It’s deployed on the cluster itself via Helm (version 1.20.4), with its storage backend on a Longhorn volume. Vault is unsealed using a set of key shares — the unseal process is the one manual step after a full cluster restart.
External Secrets Operator runs in the external-secrets-system namespace. It watches ExternalSecret custom resources, which declare “I need a Kubernetes Secret with key X, and the value comes from Vault path Y.”
The ESO controller authenticates to Vault (using Kubernetes service account auth), reads the secret value, and creates or updates a standard Kubernetes Secret.
Workloads consume the Kubernetes Secret normally — as environment variables or volume mounts. They don’t know or care that the value came from Vault.

The key benefit: the Git repository contains ExternalSecret resources that reference Vault paths, but never the actual secret values. The repo can be shared, reviewed, and even made public without exposing credentials. The secret values live exclusively in Vault, which has its own access control, audit logging, and rotation capabilities.

What This Looks Like in Practice

An ExternalSecret in the repo:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-credentials
  namespace: my-app
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: app-credentials
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: secret/data/my-app
        property: database_url

This is safe to commit. It says “create a Kubernetes Secret called app-credentials with a key DATABASE_URL whose value comes from secret/data/my-app in Vault.” The actual database URL never appears in Git.

Monitoring and Observability

The entire monitoring stack is itself GitOps-managed — deployed and configured through HelmRelease resources, same as everything else. Monitoring the monitors through the same pipeline they monitor. It’s turtles all the way down, but it works.

Component	Namespace	Chart Version	App Version	Purpose
`kube-prometheus`	`monitoring`	80.9.2	v0.87.1	Prometheus + Alertmanager + default recording/alerting rules. The metrics backbone.
`grafana`	`monitoring`	10.4.0	12.3.0	Dashboards and visualization. Connects to Prometheus, Loki, and Jaeger as data sources.
`loki`	`monitoring`	6.46.0	3.5.7	Log aggregation. Receives logs from Promtail, stores on Longhorn (50 Gi volume).
`promtail`	`monitoring`	6.17.1	3.5.1	Log shipper. DaemonSet that tails container logs on every node and forwards to Loki.
`jaeger`	`monitoring`	3.4.1	1.53.0	Distributed tracing. Captures trace spans from instrumented services.

This stack provides three pillars of observability:

Metrics (Prometheus via kube-prometheus-stack) — numeric time-series data. CPU usage, memory consumption, request rates, error rates, Helm release status, Flux reconciliation duration. Prometheus scrapes metrics endpoints across the cluster every 30 seconds.
Logs (Loki + Promtail) — structured and unstructured log data from every container. Promtail runs on every node as a DaemonSet, tails the container log files, and ships them to Loki. Queried through Grafana using LogQL. When a Flux reconciliation fails, the logs tell me why.
Traces (Jaeger) — distributed request tracing for services that emit OpenTelemetry spans. Most useful for the AI agent workloads where a single user request fans out across multiple services (API gateway, agent runtime, tool execution, LLM inference). Less relevant for infrastructure components, but available when needed.

All three data sources are configured in Grafana, which serves as the single pane of glass. I can go from a spike in a Prometheus metric to the relevant logs in Loki to the trace that shows where latency was introduced — all without leaving the browser.

The fact that the monitoring stack is itself managed through GitOps means upgrades, configuration changes, and dashboard provisioning all go through the same PR workflow. A new Grafana dashboard is a JSON file committed to Git, not a manual creation in the UI that gets lost when the pod restarts.

Summary

Aspect	Detail
GitOps Operator	Flux v2, `flux-system` namespace
Source Repo	`spencer2211/fiducian-kube`, `production` branch, SSH transport
Reconciliation	1-minute poll interval, automatic convergence
Helm Releases	18 total across infrastructure, platform, observability, and application tiers
Secrets	HashiCorp Vault v1.20.4 + External Secrets Operator v1.1.1
Monitoring	kube-prometheus, Grafana, Loki, Promtail, Jaeger — all GitOps-managed
Deployment	Branch, PR, merge, Flux reconciles. Rollback via `git revert`.
Emergency	`kubectl apply` + immediate Git commit. Suspend Flux if needed.

GitOps on a homelab might seem like overkill. It’s one person managing one cluster — why not just helm install and move on? The answer is the same reason I use version control for code: because my memory is unreliable, my 2 AM self makes questionable decisions, and the ability to see exactly what changed and roll it back without thinking is worth the upfront investment in the pipeline.

The cluster has been running this way since deployment, and the number of “what did I change that broke this” debugging sessions has been exactly zero. Every incident starts with git log and ends with either “ah, that commit” or “nothing changed in Git, so the problem is external.” That clarity alone justifies the approach.