Cybersecurity Agent Design

My homelab runs Home Assistant (which controls physical devices in my house), an AI agent platform, graph databases, and various tool servers — all on a 4-node ARM64 Kubernetes cluster accessible via Tailscale. That’s a real attack surface, and it needs automated security monitoring. But I didn’t want to bolt on an enterprise SIEM. I wanted something that fit the environment: lightweight, ARM64-compatible, and intelligent enough to tell me why a finding matters, not just that it exists.

The result is SecScan — a hybrid architecture where scanning runs as native infrastructure (K8s CronJobs, systemd timers) completely independent of the agent platform, while an LLM serves as the interpretation and alerting layer. The scanning never depends on the agent being up. The agent never needs to run security tools directly. Clean separation of concerns.

The Problem

The cluster has several properties that make security non-trivial:

Physical security implications. Home Assistant controls thermostats, cameras, and door sensors. A compromised HA instance isn’t just a data breach — it’s a physical security issue.
Internet-accessible. Every node joins a Tailscale mesh network. That’s secure in transit, but it means the cluster is reachable from anywhere on the tailnet.
ARM64 constraints. All four nodes are Orange Pi 5 SBCs (Rockchip RK3588S, 8 cores, 16 GB RAM each). Not every security tool publishes ARM64 images, and the vendor kernel (6.1.115-vendor-rk35xx) lacks BTF support for modern eBPF.
Resource-constrained. Running a graph database, an agent platform, home automation, and Kubernetes system components on 16 GB per node means every tool needs to justify its resource footprint.

I needed automated scanning that catches CVEs, misconfigurations, and exposed secrets — but I also needed it to be reliable independently of any single component, and contextual enough to distinguish “critical RCE in Home Assistant” from “low-severity info leak in a weekly batch job.”

Architecture Decision

I evaluated three approaches:

Agent-orchestrated scanning — OpenClaw (my agent platform) triggers scans via exec. Simple, but if the agent platform is down for maintenance, no scans run. Scanning reliability becomes coupled to agent availability.

Two specialized agents — separate K8s and desktop security agents. Clean separation, but duplicated alerting infrastructure and harder to correlate findings across the full attack surface.

Hybrid: decoupled scanning + LLM interpretation — scanning runs as native infrastructure primitives (K8s CronJobs, systemd timers). Results land in a shared PersistentVolume. The agent reads results periodically, contextualizes them with LLM intelligence, and routes alerts. Selected.

The hybrid approach won because scanning reliability is a security property. K8s CronJobs and systemd timers are battle-tested primitives — they just run. If the agent platform is restarting, upgrading, or experiencing issues, scans still execute and results accumulate. The agent adds intelligence on top; it doesn’t gate the scanning itself.

Scan Pipeline

flowchart TB
    subgraph scanning["Scanning Layer — Runs Without Agent Dependency"]
        direction TB
        subgraph k8s["K8s CronJobs (security-scanning namespace)"]
            trivy_img["Trivy Image Scan\nDaily 3 AM CT"]
            trivy_k8s["Trivy K8s Config\nDaily 3 AM CT"]
            trivy_secret["Trivy Secret Scan\nDaily 3 AM CT"]
            kubebench["kube-bench CIS 1.8\nWeekly Sunday"]
            lynis_host["Lynis Host Audit\nWeekly per-node"]
        end
        subgraph desktop["Desktop Cron (systemd timer)"]
            lynis_ws["Lynis Workstation\nWeekly"]
        end
        pv[("Longhorn PVC\n2Gi · 2× replication\n/scans/results/*.json")]
        k8s --> pv
        desktop -->|"scp over Tailscale"| pv
    end

    subgraph interpret["Interpretation Layer — OpenClaw Agent"]
        cron["OpenClaw Cron\nDaily 7 AM CT / Weekly Monday 9 AM CT"]
        skill["SecScan Skill\n1. Read JSON results\n2. Check freshness\n3. Compare suppressions\n4. LLM contextualizes\n5. Route alerts"]
        cron --> skill
    end

    pv -->|"read-only mount"| skill

    skill --> discord["Discord\nDaily/weekly summary"]
    skill --> signal["Signal\nCRITICAL only"]

Tool Selection

Every tool had to pass the ARM64 gate. I built a compatibility matrix before committing to anything.

V1 Stack

Tool	Purpose	ARM64	Resource Budget	Schedule
Trivy v0.69.1	Image CVEs, K8s config audit, secret scanning, SBOM	Multi-arch since v0.18+	256Mi/100m request, 1Gi/500m limit	Daily
kube-bench v0.14.1	CIS Kubernetes 1.8 Benchmark	Go binary, multi-arch	256Mi/100m request, 1Gi/500m limit	Weekly
Lynis	Host-level security audit (300+ tests)	Shell script, arch-independent	Negligible	Weekly

Trivy won the Swiss Army knife award. One binary handles image vulnerability scanning, Kubernetes misconfiguration detection, filesystem secret scanning, IaC auditing, and SBOM generation. For a resource-constrained homelab, running one tool that covers five categories beats running five tools.

kube-bench complements Trivy by checking the infrastructure (kubelet config, API server flags, etcd encryption) against CIS benchmarks, while Trivy checks the workloads running on it. The cluster runs vanilla Kubernetes v1.28.2 via kubeadm, so --benchmark cis-1.8 targets the right profile.

Lynis is a POSIX shell script that runs 300+ host-level tests — SSH hardening, firewall rules, kernel parameters, file permissions, package updates. Architecture-independent, zero dependencies, works on both the Orange Pi nodes and my Ubuntu workstation.

V2 Roadmap (Deferred)

Tool	Why Deferred
Falco (runtime threat detection)	Needs kernel module or legacy eBPF — vendor kernel lacks BTF for CO-RE. Resource planning and rule tuning required.
Kyverno (admission control)	V1 detects; V2 prevents. Shift-left after trust is established.
Hubble (network flow monitoring)	Needs Hubble enabled in Cilium, storage for flow data.

Falco is the highest-value V2 addition — it detects active threats (unexpected shell spawns, file access, network connections) rather than static vulnerabilities. But on the vendor Rockchip kernel, the modern eBPF driver won’t work without BTF (/sys/kernel/btf/vmlinux is absent). Falco would need the kernel module or legacy eBPF driver, which requires kernel headers that may not ship with the vendor BSP. Worth solving, but not a V1 blocker.

The LLM as Interpretation Layer

This is the core design insight: the LLM eliminates the entire glue code layer.

Traditional security tooling requires JSON parsers, alert templates, severity mapping logic, and report generators. With an LLM reading the raw scan output, none of that exists. The agent understands Trivy’s JSON and writes contextual alerts:

“CVE-2026-1234 in your Home Assistant image allows unauthenticated remote code execution via WebSocket. This is especially dangerous because HA controls physical devices on your network and has LAN access to IoT devices.”

vs. the traditional approach:

“CRITICAL: CVE-2026-1234 — ghcr.io/home-assistant/home-assistant:2025.1.0”

The agent also handles deduplication (don’t alert on the same CVE every day), staleness detection (if results are >36 hours old, the CronJob may have failed), and suppression management (known-acceptable findings with documented rationale).

Alert Routing

Severity	Channel	Timing
CRITICAL	Signal + Discord	Immediate
HIGH	Signal + Discord	Business hours or next morning
MEDIUM	Discord	Daily summary
LOW	Discord	Weekly summary

Signal for critical findings means I get a push notification on my phone. Discord for everything else means I can review summaries when convenient. The agent never auto-remediates in V1 — it suggests fixes and waits for my decision. Trust first, automation later.

Infrastructure Details

Scanning Namespace

All scan infrastructure lives in a dedicated security-scanning namespace with its own ServiceAccount (secscan-reader) that has read-only ClusterRole access — get, list, watch only. The CronJob pods write results to a 2Gi Longhorn-backed PVC with 2x replication.

The OpenClaw agent pod mounts the same PVC read-only. It can read scan results but cannot modify them, cannot execute security tools, and has no kubectl access. The agent’s privilege boundary is intentionally minimal: read files, send messages.

CronJob Design

apiVersion: batch/v1
kind: CronJob
metadata:
  name: secscan-image-vuln
  namespace: security-scanning
spec:
  schedule: "0 9 * * *"  # 9 AM UTC = 3 AM CT
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: trivy
            image: aquasec/trivy:0.69.1
            resources:
              requests: { memory: "256Mi", cpu: "100m" }
              limits: { memory: "1Gi", cpu: "500m" }
            volumeMounts:
            - name: scan-results
              mountPath: /scans/results
          volumes:
          - name: scan-results
            persistentVolumeClaim:
              claimName: secscan-results
          restartPolicy: OnFailure
          serviceAccountName: secscan-reader

Scans run at 3 AM Central, when the cluster is idle. The interpretation pass runs at 7 AM — four hours later, ensuring fresh results. kube-bench CronJobs need hostPID: true for CIS benchmark checks and mount /etc/kubernetes read-only. Lynis host audits run as privileged per-node Jobs with node affinity (kube02 and kube04 have the most headroom).

Security of the Security Scanner

The irony isn’t lost on me. The security scanner itself is an attack surface:

Threat	Mitigation
Compromised scan tool (supply chain)	Pin tool versions, verify checksums. No auto-updates without review.
Agent with excessive permissions	Read-only filesystem access. No kubectl, no SSH, no write to K8s resources.
Scan results leak sensitive info	Results stay on-cluster PVC (not committed to git). Alerts summarize, don’t dump raw output.
Resource exhaustion from scanning	Explicit resource limits on all CronJobs. Sequential image scanning (not parallel).
Stale vulnerability database	Trivy auto-updates its DB. Agent alerts if DB is >7 days old.

The OpenClaw pod itself is hardened: non-root user, read-only root filesystem, all Linux capabilities dropped. It has no SSH access to nodes, no write access to cluster resources, and no ability to modify its own suppression list without my explicit approval.

What This Demonstrates

Separation of concerns in security architecture. Scanning reliability doesn’t depend on agent availability. Agent intelligence doesn’t require running privileged tools. The PVC is the only coupling point — a simple, auditable interface.

Practical ARM64 constraints. Not theoretical compatibility charts, but real decisions driven by vendor kernel limitations (no BTF → no modern eBPF → Falco deferred), image availability (Trivy publishes multi-arch; some tools don’t), and resource budgets (256Mi request per scan Job on 16GB nodes).

LLM as glue code replacement. The agent reads raw JSON output and produces contextual, actionable alerts. No parsers, no templates, no severity mapping logic. The tradeoff is token cost (~$0.50-2.00/day for a homelab scan pass), which is acceptable for this scale.

Progressive trust model. V1 detects and reports. V2 prevents (admission control). Eventually, V3 could auto-remediate safe operations (update image tags with known CVE fixes). Each phase builds on demonstrated reliability of the previous one.

The cybersecurity agent isn’t the most complex system I’ve built, but it’s one of the most thoughtful in terms of failure mode analysis. Every design decision traces back to a specific constraint or threat, and the architecture degrades gracefully — if the agent is down, scans still run; if a CronJob fails, the agent detects staleness; if Signal is unreachable, Discord still gets the alert. Defense in depth, applied to the defender itself.