I build Kubernetes clusters to survive real faults, not to win configuration contests. This guide shows what I do during Kubernetes setup to raise cluster resilience, and which common pitfalls I slide into if I rush. I keep advice concrete. Expect commands, numbers and specific configuration tips you can apply to your cluster. Read it, adapt it, test it.
Start with the control plane and data store. Run an odd number of control-plane nodes, usually three. That gives etcd a majority quorum while keeping costs reasonable. Put an external load balancer in front of the API servers so the kube-apiserver endpoints remain reachable when a control plane node restarts. Make regular automated etcd backups and test restores on a spare machine, not just once. I store backups off-node and keep at least three days of history. Keep CoreDNS as a Deployment with two or three replicas. Set resource requests and limits for control-plane components; do not let them compete with user pods. Use PodDisruptionBudgets for kube-system critical pods so rolling upgrades do not drop the whole control plane. If you use kubeadm, follow its HA patterns: stacked etcd is easiest for small clusters, external etcd for larger or multi-site setups. For etcd, prefer TLS and client authentication, and rotate certificates before expiry.
Make worker nodes predictable. Label and taint nodes by role: compute, storage, GPU, whatever fits your workloads. Use node selectors or node affinity in pod specs instead of relying on ad hoc placement. Set resource requests and limits on every deployment, and use LimitRanges and ResourceQuotas per namespace to stop a noisy app from starving others. Use liveness and readiness probes on every container; readiness controls rollout, liveness restarts stuck processes. Example probes: HTTP readiness on /healthz returning 200, and TCP liveness with a short timeout. Use at least two replicas for any service you care about; one replica means single point of failure. For stateful services prefer StatefulSets with proper storageclass that supports reclaimPolicy and snapshots. For persistent data, use volume snapshots or external storage that supports multi-az replication if you run across zones. Avoid relying on local disks for critical data unless you have a backup plan.
Network and security matter more than people expect. Choose a CNI you understand, and run it with two or more replicas where applicable. Monitor CNI metrics and test network partition scenarios in a lab. Keep kube-proxy in IPVS mode for performance if traffic patterns justify it. Apply RBAC with least privilege. Create service accounts for controllers and give only the ClusterRoles they need. Audit logs must be enabled and shipped to a central location; they are useless if they disappear with a node. Use PodSecurityPolicies or the newer admission controllers to limit hostPath, privileged, and hostNetwork use. Use network policies to block lateral movement inside the cluster; start with a default-deny and open ports per-namespace as needed. Keep your image registry credentials in Secrets and scan images for CVEs before you deploy them.
Operate with discipline. Automate upgrades on a staggered schedule and practise the upgrade path in a non-production cluster first. Run synthetic transaction checks so you know when an app is functionally broken, not just when pods are failing. Alert on high API server latency, etcd commit duration, and node heartbeat failures. Backups must be exercised: perform a restore at least quarterly. Write short runbooks for common failures: API server unresponsive, etcd degraded, node unreachable. Test cluster resilience: simulate a control-plane node loss, a full-node disk failure, and a network partition. Track configuration changes in Git and apply them with GitOps tooling so you can roll back quickly. My takeaways are concrete: three control-plane nodes, tested etcd backups, resource requests, probes, RBAC, network policies, and rehearsed restores. Get those right and the Kubernetes cluster will survive the faults that matter.