cd /projects
Completed KubernetesAnsibleGitOpsRaspberry PiNetworking

Raspberry Pi K3s Cluster

Four-node ARM Kubernetes cluster for homelab workloads

Raspberry Pi K3s Cluster cover

01. Overview

Reading the Kubernetes documentation is one thing. Actually running a cluster — dealing with ingress, storage classes, node failures, and rolling updates on hardware that cost you real money — is something else entirely. This project was built specifically to develop hands-on Kubernetes experience in a controlled environment where breaking something costs nothing beyond a few minutes of troubleshooting.

The cluster runs four Raspberry Pi 4 (8 GB) nodes, uses K3s as the lightweight Kubernetes distribution, and is managed entirely through GitOps using Flux CD. If it isn't in Git, it doesn't exist in the cluster.

02. Hardware

Nodes 4 × Raspberry Pi 4 Model B (8 GB RAM) Storage 4 × 256 GB Samsung Endurance microSD (OS) + USB SSD per node (Longhorn) Network Gigabit ethernet — 8-port TP-Link managed switch (VLAN-isolated) Power 4-port Anker USB-C GaN hub with per-port power monitoring Enclosure Custom-printed 4U rack (see gallery) Control plane 1 node (pi-01) — the others are worker nodes

03. How It Was Built

Provisioning with Ansible

All four nodes are imaged from the same Raspberry Pi OS Lite (64-bit) base and provisioned using an Ansible playbook that handles system updates, sets hostnames, configures static IPs, enables cgroups v2 and memory accounting in the kernel command line (required for K3s), and installs K3s in server (control-plane) or agent (worker) mode depending on the host group. Reprovisioning a wiped node takes about 8 minutes.

K3s configuration

K3s ships with Traefik as the default ingress controller, which I kept. The default local-path storage class is replaced with Longhorn for distributed, replicated persistent volumes — each PVC is replicated across 2 of the 4 nodes, so a single node failure doesn't take down stateful workloads. Flannel (the default CNI) handles pod networking with VXLAN for cross-node traffic.

GitOps with Flux CD

Flux monitors a private GitHub repository for changes to Kubernetes manifests. Any commit to the main branch that touches a manifest is automatically reconciled into the cluster within 60 seconds. Helm releases are managed through Flux's HelmRelease CRDs, which handle upgrades, rollbacks, and drift detection. Secrets are encrypted in Git using Mozilla SOPS with an age key stored offline.

Workloads running on the cluster

The cluster currently runs: a local DNS override service (not Pi-hole — that stays on its own Pi), a Prometheus + Grafana observability stack, a private Gitea instance for self-hosted Git, a Miniflux RSS reader, and a Bitwarden-compatible Vaultwarden password manager. All are exposed through Traefik with TLS terminated via Let's Encrypt (DNS-01 challenge through Cloudflare API).

04. Lessons Learned

  • ARM-based clusters (aarch64) occasionally surface issues with container images that only publish amd64 — always check for multi-arch manifests before committing to a piece of software.
  • Longhorn's replication adds meaningful overhead on gigabit ethernet with Raspberry Pi CPUs. Some latency-sensitive apps are better off with local-path storage and an explicit backup strategy.
  • GitOps discipline pays off. Being able to nuke the entire cluster and re-converge to the desired state in under 30 minutes is genuinely useful when you experiment as aggressively as I do.
  • MicroSD cards fail under Kubernetes write load. USB SSDs for any stateful workload storage is non-negotiable.
  • Understanding the control loop model — how Kubernetes continuously reconciles actual state to desired state — was the biggest mental shift coming from a "run the command, check it happened" background.

05. What's Next

The immediate next step is migrating the app workloads currently running inside TrueNAS's built-in K3s to this cluster, so the NAS can be updated and rebooted independently. After that, I want to add a fifth node and experiment with multi-control-plane HA — currently a single control-plane failure would bring scheduling down until the node recovers.