Kubernetes Architecture - Mohammadali Bazyar

What Kubernetes Is For

Docker runs a container on one machine. That works until you have ten microservices on twenty machines and need to keep all of them healthy, balance load between them, restart failures, scale up under load, scale down at night, deploy new versions safely, and survive entire machine outages. Doing all that by hand is impossible at any meaningful scale.

Kubernetes is what you use instead. You declare what you want ("I want 5 instances of my web app, each with 2 CPUs and 4GB RAM, behind a load balancer, accessible at example.com"). Kubernetes makes it happen and keeps it that way. If a pod crashes, K8s starts a new one. If a node dies, K8s reschedules its pods elsewhere. If traffic spikes, K8s adds more pods. If a deploy goes bad, K8s rolls back.

This declarative, self-healing model is the reason Kubernetes won. It is also why Kubernetes is famously complex: implementing it requires solving deep distributed systems problems, and exposing those solutions to users adds many concepts. This article walks through the architecture from the bottom up.

Step 1: The Two Halves

A Kubernetes cluster is split into two roles:

Cluster Topology

Control

API Server

etcd

Scheduler

Controller Manager

commands

Workers

Node 1
kubelet + pods

Node 2
kubelet + pods

Node 3
kubelet + pods

Control plane: the brain. Decides what should run where, tracks state, accepts user commands. A handful of services running together.

Worker nodes: the muscle. Run the actual workloads. Each node hosts many containers (organized into pods).

Production clusters have at least 3 control plane replicas (for high availability) and many worker nodes. Cloud providers (EKS, GKE, AKS) typically run the control plane for you and only charge for worker nodes.

Step 2: Control Plane Components

API Server

The front door. Everything (users, controllers, kubelet on each node) talks to it via REST API. Exposes the Kubernetes API as you know it: kubectl get pods hits the API server.

Validates incoming requests, persists state to etcd, returns results. Stateless itself; multiple replicas run behind a load balancer.

Authentication and authorization happen here. Every API request is authenticated (TLS certs, tokens, OIDC) and authorized (RBAC).

etcd

The database. A consistent key-value store holding the entire cluster state: every pod, every config, every secret, every deployment, all here.

Built on Raft consensus. Production clusters run 3 or 5 etcd replicas. If a majority is unavailable, the cluster cannot make changes (reads of recent data also degrade).

etcd is small (a few GB typically) but mission-critical. Backups of etcd are backups of the cluster.

Scheduler

Watches for new pods that need a node. Evaluates which node should run each pod based on:

Resource requests (does the node have enough CPU/memory?).
Affinity rules (this pod must run with another, or must not).
Taints and tolerations (some nodes accept only certain workloads).
Topology constraints (spread pods across availability zones).

Picks the best node. Tells the API server. The kubelet on that node runs the pod.

The scheduler is one of the more complex pieces; its decisions shape cluster utilization.

Controller Manager

Runs many controllers, each watching for differences between desired state and actual state, taking action to reconcile.

Examples of built-in controllers:

Deployment Controller: ensures the right number of pods are running for each Deployment.
ReplicaSet Controller: manages individual pod replicas (called from Deployment Controller).
Node Controller: notices when a node fails (no heartbeat) and triggers pod rescheduling.
Service Controller: creates load balancers in the cloud when a Service of type LoadBalancer is created.
Endpoints Controller: updates the list of pod IPs behind each Service.

The "controller manager" is a single process that runs all of these. Each is conceptually independent.

Cloud Controller Manager

For clusters running on AWS, GCP, Azure, etc. Translates cluster operations into cloud API calls (creating load balancers, attaching volumes, configuring routes).

Decoupled from the main controller manager so cloud-specific code lives separately.

Step 3: Worker Node Components

Every worker node runs three core components.

kubelet

The agent on each node. The control plane's representative. Receives pod specs from the API server, runs containers via the runtime, reports node health back.

kubelet is the only thing in the cluster that actually runs containers. Everything else just decides what should be run.

Container Runtime

The thing that actually runs containers. Originally Docker. Now usually containerd or CRI-O. Implements the Container Runtime Interface (CRI) that kubelet uses.

The runtime pulls images from registries, creates containers, manages their lifecycle.

kube-proxy

Manages networking. Implements Service abstractions via iptables or IPVS rules.

When a Service is created with cluster IP X, kube-proxy on every node sets up rules so that traffic to X gets distributed to one of the Service's pod IPs.

kube-proxy is what makes "send traffic to my-service" work, regardless of which pods are currently running.

Step 4: The Core Resources

Pod

The smallest deployable unit in Kubernetes. One or more containers that share a network namespace and storage.

Usually one container per pod. Multi-container pods are for tightly coupled "sidecars": a logging sidecar, an SSL-terminating proxy, a service mesh sidecar (Envoy in Istio).

Pods are ephemeral. They get killed and replaced. They have IPs but those IPs don't persist; new pods get new IPs.

You almost never create pods directly. You create higher-level resources (Deployment, StatefulSet, etc.) that manage pods for you.

Deployment

Declares "I want N copies of this pod." Manages rolling updates, rollbacks, scaling.

The most common resource you'll create. Behind the scenes, a Deployment creates a ReplicaSet which creates pods. Updates create a new ReplicaSet alongside the old, gradually rolls pods over, then decommissions the old.

Service

A stable network endpoint for a set of pods. Pods come and go (with different IPs each time); a Service has a fixed IP and DNS name.

Services match pods via labels. app=myapp says "this Service routes to all pods labeled app=myapp." Add a pod with that label, traffic flows to it. Remove the pod, traffic stops.

Service types:

ClusterIP: internal only. Default. Only reachable from within the cluster.
NodePort: exposes on a port on every node. Reachable via any node IP plus port.
LoadBalancer: creates a cloud load balancer (ALB/NLB on AWS, etc.). Reachable from the internet.
ExternalName: a DNS alias. Maps a service name to an external DNS name.

Ingress

HTTP routing layer. Maps URLs to services. example.com/api goes to the api service; example.com/web goes to the web service. Both on the same external IP.

Implemented by an ingress controller (nginx, Traefik, AWS ALB Ingress Controller). The controller watches Ingress resources and configures itself to route traffic accordingly.

Ingress is the right way to expose multiple services on one external IP. Without it, each Service of type LoadBalancer creates a separate cloud load balancer (expensive).

ConfigMap and Secret

Configuration. ConfigMap for plain config, Secret for sensitive values (encrypted at rest in etcd).

Mount as files into containers (the most common pattern), or expose as environment variables. Apps can reload changes via file watcher or restart.

Secrets in default Kubernetes are base64-encoded, not encrypted, in etcd unless encryption-at-rest is configured. Real production setups use Sealed Secrets, External Secrets Operator, or a vault integration.

StatefulSet

Like Deployment but for stateful applications (databases, queues). Pods get stable identities (db-0, db-1, db-2 instead of random names) and stable storage (each pod's volume persists across restarts).

Used for running databases, Kafka, Elasticsearch, etc. inside Kubernetes. Most teams now run those as managed services outside K8s; StatefulSets are still useful for some workloads.

DaemonSet

One pod per node. Used for node-level agents: log collectors (Fluentd, Vector), monitoring agents (Datadog, Prometheus Node Exporter), CNI plugins.

If you add a new node to the cluster, the DaemonSet automatically runs a pod on it.

Job and CronJob

Job: runs a pod to completion. Used for one-shot tasks (database migrations, batch processing).
CronJob: runs Jobs on a cron schedule. The Kubernetes-native way to schedule recurring work.

Namespace

A logical partition of the cluster. Different teams' workloads, dev vs staging environments. Names of resources only need to be unique within a namespace, not the whole cluster.

Quotas can be set per namespace (max CPU, max pods). RBAC can grant access per namespace.

Step 5: How Reconciliation Works

The fundamental Kubernetes loop, applied everywhere:

1. You declare desired state ("3 replicas of nginx").
2. Controller compares to actual state ("currently 1 replica running").
3. Controller takes action to reduce the gap (creates 2 more pods).
4. Repeat forever.

This is why Kubernetes is "self-healing." Pod crashes? Controller notices replica count is wrong, schedules a new one. Node dies? Controller reschedules its pods elsewhere. Deploy a bad image? Controller keeps the old pods running until new ones become healthy.

The Watch-Based Architecture

Controllers don't poll. They watch the API server for events. When a resource changes, the API server pushes events to subscribed watchers. Each controller reacts to events relevant to its job.

This is why Kubernetes scales: thousands of resources, dozens of controllers, all communicating efficiently through the API server.

Step 6: Networking

Kubernetes networking is famously complex. Three concepts to understand.

Pod-to-Pod Networking

Every pod has its own IP address. Pods can talk to each other directly without NAT, regardless of which node they are on. The cluster's networking plugin (CNI) handles this: typically Calico, Flannel, Cilium, or AWS VPC CNI.

This is a strong design choice: from a pod's perspective, the network looks flat. No port collisions. No special routing logic.

Service Discovery

DNS within the cluster. my-service.my-namespace.svc.cluster.local resolves to the Service's cluster IP. Inside a namespace, just use the short name (my-service).

The cluster runs CoreDNS as the in-cluster DNS server. kubelet configures every pod's resolv.conf to use it.

Network Policies

Like firewalls for pods. Restrict which pods can talk to which. By default, all pods can talk to all pods (open). Network policies tighten this.

Example: "only pods labeled role=frontend can reach pods labeled role=backend on port 8080."

Critical for zero-trust networking within the cluster. Requires CNI plugin support (Calico, Cilium support; basic Flannel doesn't).

Step 7: Storage

Volumes

Pods can mount volumes for data persistence. Volume lifecycle is tied to the pod (or the StatefulSet's volume claim).

PersistentVolume (PV)

A piece of storage in the cluster. An actual disk somewhere (an EBS volume on AWS, a GCE disk on GCP, an NFS mount).

PersistentVolumeClaim (PVC)

A request for storage by a pod. "I need 10GB of fast storage." The cluster matches the claim to an available PV (or dynamically provisions one).

StorageClass

A template that dynamically provisions PVs. Define classes like "gp3-ssd," "premium-ssd," "throughput-optimized." A PVC requests by class; the cluster creates a matching PV automatically.

This abstraction lets pods work with persistent data without coupling to specific storage implementations. Move from AWS to GCP, the abstractions stay the same.

Stateful Workloads in Kubernetes

You can run databases in K8s using StatefulSets and PVs. Operators like CrunchyData (Postgres), MongoDB Operator, Strimzi (Kafka) make this practical.

However, many teams choose managed services (RDS, Cloud SQL, MongoDB Atlas) instead. The operational complexity of running databases in K8s is real; managed services trade money for simplicity.

Step 8: Operators and Custom Resources

Kubernetes is extensible. You can teach it about new resource types beyond pods, services, deployments.

Custom Resource Definitions (CRDs)

CRDs let you add new resource types. Define a "PostgresCluster" resource, an "ElasticsearchCluster," a "RedisInstance," whatever your domain needs.

Once defined, users can create instances of these resources via kubectl, just like pods.

Operators

Controllers that manage custom resources. They embody operational knowledge in code.

Example flow: a user creates a PostgresCluster resource ("I want a Postgres cluster with 3 nodes, 100GB storage, automatic backups").

The Postgres Operator watches for PostgresCluster resources. When it sees one, it:

Creates a StatefulSet to run the Postgres pods.
Creates a Service for connectivity.
Sets up replication between primary and replicas.
Configures backups to S3.
Handles failover when the primary fails.
Manages upgrades.

All the operational knowledge that a DBA would do, expressed as code in the operator.

Why Operators Matter

This pattern is how complex systems integrate with Kubernetes. Strimzi for Kafka, OpenShift's many operators, Prometheus Operator, cert-manager (for TLS certs), External Secrets Operator. The Kubernetes ecosystem is largely operators.

Building your own operator: there are SDKs (Operator Framework, Kubebuilder) that handle the boilerplate. You write the reconcile logic.

Step 9: Helm and Package Management

Kubernetes manifests pile up. A modest application might have 10-20 YAML files: deployments, services, ingress, configmaps, secrets, network policies, monitoring configs. Plus environment-specific variants.

Helm is the package manager for Kubernetes. A "chart" is a templated bundle of manifests with configurable values.

helm install my-app ./mychart -f values-prod.yaml installs the entire app with production values.

Helm has a registry (Helm Hub, OCI registries) where many open-source projects publish their charts. Installing Postgres or Prometheus is one command.

Alternatives: Kustomize (overlay-based instead of templated), and just plain YAML for simple cases.

Step 10: Autoscaling

Kubernetes has multiple autoscaling layers.

Horizontal Pod Autoscaler (HPA)

Adjusts the number of pod replicas based on CPU, memory, or custom metrics.

"Scale this Deployment between 3 and 20 replicas, target 70% CPU utilization." HPA monitors and adjusts.

For custom metrics (requests per second, queue depth), HPA can integrate with Prometheus or other metric sources.

Vertical Pod Autoscaler (VPA)

Adjusts the resource requests of pods (more or less CPU/memory per pod) based on historical usage.

Less commonly used than HPA. Some workloads benefit from it.

Cluster Autoscaler

Adds or removes worker nodes. If pending pods can't be scheduled because no node has space, Cluster Autoscaler adds a node. If nodes are mostly empty, it removes them.

Works with cloud providers (AWS, GCP, Azure). Critical for cost efficiency.

Karpenter (Modern Alternative)

An AWS-developed alternative to Cluster Autoscaler with more flexible node provisioning. Faster, more efficient.

Step 11: Observability in Kubernetes

Running services in Kubernetes is easy. Knowing what's happening is harder. Observability stack:

Logging

Containers write to stdout/stderr. Node-level agent (Fluentd, Vector) collects logs and ships them to a centralized store (Elasticsearch, Loki, Datadog).

kubectl logs pod-name reads logs of a single pod, but for production you need centralized logs you can search.

Metrics

Prometheus is the de facto standard. Pods expose metrics via HTTP endpoints; Prometheus scrapes them periodically.

Grafana on top for dashboards. Alertmanager for alerts.

Kubernetes itself exposes lots of metrics (kubelet, API server, etcd). Apps add their own.

Tracing

Distributed tracing (Jaeger, Tempo, Datadog APM) shows the path of requests across services. Critical for microservice debugging.

kubectl Commands

The CLI for inspecting state. kubectl get pods, kubectl describe deployment myapp, kubectl logs, kubectl exec -it pod-name -- bash. Mastery of kubectl is operational survival.

Step 12: What Makes Kubernetes Hard

The model is elegant. Operating it is a different story.

Many Moving Parts

Control plane components, workers, controllers, networking, storage. When something breaks, "what's wrong?" might involve any layer.

Networking is Genuinely Complex

Multiple layers: container networking (CNI), service networking (kube-proxy), ingress (controller of choice), service mesh (optional). Tracing why a request fails can mean inspecting iptables rules, network policies, ingress configs.

Resource Limits Tuning

Set requests and limits too low: pods get OOM-killed. Too high: cluster wastes capacity. Right values come from profiling. Many teams underspend or overspend significantly.

Cluster Upgrades

K8s releases every 4 months. Upgrading without breaking workloads requires care. Deprecated APIs, changed behaviors. Cloud-managed K8s helps but doesn't eliminate the work.

Storage in Cloud K8s

Surprising costs. Persistent volumes don't auto-resize. Cross-AZ I/O is expensive. Database storage in K8s often costs more than expected.

Security Misconfigurations

RBAC is complex. Pod security defaults often allow more than they should. Network policies forgotten until incident. Container escapes are real risks.

Skills Required

Operating a real cluster requires DevOps engineers who understand Linux, networking, distributed systems, and Kubernetes specifics. Junior engineers can deploy apps; senior engineers debug the infrastructure.

Step 13: Managed Kubernetes

Most teams use managed Kubernetes. Cloud providers run the control plane; you only manage worker nodes (or in fully managed cases, neither).

Major Offerings

EKS (AWS): the most established. Solid but some rough edges. Pay for the control plane separately.
GKE (Google): often considered the best K8s product. Smoothest user experience.
AKS (Azure): free control plane. Tight Azure integration.
Linode Kubernetes Engine, DigitalOcean Kubernetes: cheaper, simpler. For smaller workloads.

Fully Managed (Serverless K8s)

GKE Autopilot: Google manages everything; you just submit workloads. No node management.
AWS Fargate (with EKS or ECS): task-level serverless containers.
Cloud Run: Google's container-as-a-service. Not strictly K8s but often the right alternative for simple workloads.

Most teams should default to managed K8s. Self-hosting Kubernetes is full-time engineering work.

Step 14: When NOT to Use Kubernetes

Kubernetes is overkill for many use cases. Heuristics:

Single small app: just use Docker Compose, Cloud Run, ECS Fargate, or Heroku-style PaaS.
Static website: serverless or CDN. K8s is wasteful.
Team without DevOps capacity: managed PaaS or simpler container service.
Small services that fit on one or two servers: containers without orchestration.
Workloads with very specialized requirements: bare metal, GPU clusters not in K8s, weird networking.

K8s shines when you have many services, complex relationships, the need for self-healing, and a team that can operate it. Otherwise, it adds complexity without proportional benefit.

Step 15: Recap of Key Decisions

Kubernetes is a control loop. Declared state, observed state, controllers reconcile them.
Pods are ephemeral. Designed to be killed and replaced.
Services give stable network endpoints. Pods come and go; Services don't.
Ingress for HTTP routing to many services. Saves on cloud load balancers.
Operators encode operational knowledge. The pattern that integrates databases, queues, etc.
Networking is complex. CNI, kube-proxy, ingress, service mesh layered.
Storage abstractions: PV, PVC, StorageClass. Decouple workloads from specific storage.
Autoscale at multiple layers. HPA for pods, Cluster Autoscaler for nodes.
Observability is critical. Logs, metrics, traces. Without them, you fly blind.
Use managed K8s when possible. Self-hosting is full-time work.

The One Thing to Remember

Kubernetes is a control loop: you declare desired state, controllers reconcile actual to match. Pods, deployments, services, ingress are the user-facing primitives that turn that loop into a deployment platform. The architecture (control plane plus nodes) and the patterns (declarative configs, reconciliation, operators) are the parts worth deeply understanding. Everything else is configuration syntax to look up when needed. Kubernetes is genuinely complex, but the complexity reflects real distributed systems problems that any container orchestrator must solve. Once you internalize the reconciliation model, K8s feels less arbitrary and more like a natural way to manage cloud infrastructure. The ecosystem of operators built on top is what makes it so powerful: the same primitives manage your apps, your databases, your monitoring, your networking. That uniformity is the value proposition.