// INFRA.exe

The foundation everything sits on.

AWS, networking, containers, Kubernetes, IaC, CI/CD, secrets, observability, deploys, cost. The pieces underneath your code.

Most of this is unglamorous. All of it is what stops production from melting at 3 AM. The map is the syllabus — read across, then go deep on what you actually run.

// LEGENDREAL-WORLDIMPLEMENTATIONPITFALLWAR_STORY— click to expand any block

// TABLE_OF_CONTENTSclick to jump · sticky map shows on the right →

01.The big mental model
02.AWS fundamentals — the 15 services that matter
03.Networking — VPC, subnets, security groups, load balancers
04.Compute — ECS vs EKS vs Lambda vs Fargate vs EC2
05.Containers and Docker
06.Kubernetes
07.Infrastructure as Code — Terraform, Pulumi, CDK
08.CI/CD
09.Secrets and identity
10.Observability — logs, metrics, traces
11.Scaling strategies
12.Deploy strategies
13.Databases (ops perspective)
14.Caching and queues
15.CDN and edge
16.Cost management
17.Security hardening
18.War stories

// SECTION_01

The big mental model

Infrastructure is the layer between your code and the physical machines running it. The job of an infra engineer is to make that layer reliable, secure, observable, and cost-effective — without being in the way.

The modern stack has six concerns:

Compute — where your code runs (containers, serverless, VMs).
Networking — how traffic gets to it (DNS, load balancers, CDN, VPC).
Storage — where data persists (S3, databases, caches).
Identity — who can do what (IAM, secrets, certificates).
Observability — what's actually happening (logs, metrics, traces).
Delivery — how code gets from laptop to production (CI/CD, IaC).

Think of infrastructure as the building services of a city — water, power, sewage, roads, security. The application engineers are the residents and businesses. The infra engineers don't decide what businesses open, but they make sure the lights stay on, the streets are safe, and there's a sensible permitting process for new construction.

Bad infra: residents have to install their own water pipes. Good infra: water just works, and there's a clear way to request a new connection.

The 2026 canonical stack

Concern	Default choice	Alternatives
Cloud	AWS	GCP, Azure, Cloudflare, Vercel, Fly
Compute	ECS Fargate or EKS	Lambda, EC2, Vercel/Railway/Fly
IaC	Pulumi (TypeScript) or Terraform	CDK, CloudFormation, OpenTofu
CI/CD	GitHub Actions	CircleCI, Buildkite, GitLab CI
Container registry	ECR	Docker Hub, GHCR
Secrets	AWS Secrets Manager	HashiCorp Vault, Doppler, Infisical
Observability	Datadog	Grafana+Prometheus, New Relic, Honeycomb
CDN	CloudFront or Cloudflare	Fastly, Bunny
DNS	Route 53 or Cloudflare	NS1

This guide uses AWS as the canonical because it has the largest ecosystem and the most patterns to learn from. The concepts transfer to other clouds — VPCs, IAM, load balancers, object storage, managed databases all have direct equivalents.

Multi-cloud is mostly a myth

Most teams pick one cloud and stay there. "Multi-cloud" usually means "using one cloud's services and another for one specific feature" (e.g., AWS for compute + Cloudflare for CDN). True multi-cloud (the same workload running on multiple clouds) is expensive, complex, and rarely worth it.

The infra question is always "what's the smallest, simplest thing that meets the requirements?" Most teams overengineer — Kubernetes when ECS would do, custom service mesh when ALB would do, Terraform modules when copy-paste would do. Senior infra work is often subtraction.

// SECTION_02

AWS fundamentals — the 15 services that matter

AWS has 200+ services. You don't need to know them all. Master 15 and you can build almost any production system.

The 15 services that matter most

Service	What it is
EC2	Virtual machines. The original AWS service. Use directly only when serverless/managed alternatives don't fit.
S3	Object storage. The most-used AWS service. Buckets of files, indexed by key. Eleven 9s of durability.
VPC	Virtual Private Cloud. Your isolated network within AWS. Subnets, route tables, security groups.
IAM	Identity and access management. Who can do what, on which resources.
RDS	Managed relational databases (Postgres, MySQL, etc.). Backups, replicas, upgrades handled.
DynamoDB	Managed NoSQL. Serverless, scales automatically. Different mental model from RDS.
Lambda	Run code without managing servers. Triggers from events. Pay per invocation.
ECS / EKS	Container orchestration. ECS is AWS-specific and simpler; EKS is Kubernetes.
Fargate	Serverless containers. ECS or EKS without managing the underlying EC2 instances.
ALB / NLB	Application Load Balancer (HTTP/L7) and Network Load Balancer (TCP/L4).
CloudFront	CDN. Edge caching for static assets and dynamic responses.
Route 53	DNS. Hosts your domains, can do health-check-based routing.
SQS / SNS	Message queue (SQS) and pub/sub (SNS). Decouple services.
Secrets Manager	Stores API keys, DB passwords, certificates. Auto-rotates.
CloudWatch	Logs, metrics, alarms. Often supplemented or replaced by Datadog/Grafana.

Regions and availability zones

AWS is geographically distributed:

Region — a geographic area (us-east-1, eu-west-1). ~33 regions globally.
Availability Zone (AZ) — an isolated data center within a region. Each region has 3-6 AZs.
Edge location — CloudFront PoPs for caching. ~600 globally.

Production systems use multiple AZs for high availability. Multi-region is for disaster recovery or geographic latency.

S3 — what makes it special

Object storage with HTTP API. Each object has a key (path-like string), value (bytes), and metadata.

# key concepts
- Buckets: top-level containers, globally unique names
- Objects: files inside buckets
- Versioning: keep multiple versions, never lose data
- Lifecycle policies: auto-archive to cheaper storage classes
- Storage classes: Standard, IA, Glacier (cheaper, slower)
- Encryption: SSE-S3, SSE-KMS, SSE-C
- Access: public-by-default OFF, presigned URLs for sharing

S3 is durable to 11 nines (99.999999999%) — you'd lose one object every billion years per object stored. The downside: it's eventually consistent in some operations (was, until 2020 — now strongly consistent).

REAL-WORLDWhy every system uses S3 for something

Modern apps use S3 for at least one of:

User uploads — photos, attachments, documents. Use presigned URLs so clients upload directly to S3, not through your servers.
Static asset hosting — JS bundles, images, fonts. Served via CloudFront for low latency.
Backups — DB snapshots, log archives. Lifecycle policies move old ones to Glacier.
Data lake — raw data files (Parquet, JSON) queried by Athena, Snowflake, Databricks.
Static site hosting — for marketing sites, docs, etc.
Build artifacts — CI/CD outputs, container layer caches.

The pattern that matters most: direct upload via presigned URLs. Your backend generates a temporary, scoped URL; the client uploads directly to S3; your backend never touches the file bytes. Saves bandwidth, scales automatically.

IAM — the permissions system

IAM controls who can do what. Three main concepts:

Users — humans with API keys / login. Avoid for production code.
Roles — permissions assumed by services or users. The right way for code to access AWS.
Policies — JSON documents describing allowed actions.

// Example policy: allow reading from one S3 bucket
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}

The principle: least privilege. Every role gets the minimum permissions needed. Never use the root account for normal work; never share credentials between services.

Cost — the part nobody teaches

AWS bills are confusing because services have many pricing dimensions. The big ones:

EC2/Fargate — per second of compute time.
S3 — storage + requests + data transfer out.
Data transfer — out of AWS to internet ($0.05-0.09/GB), between AZs ($0.01/GB), within an AZ (free).
NAT gateway — $0.045/hour per gateway + $0.045/GB processed. Often the silent cost killer.
RDS — instance hours + storage + I/O + backup storage.

The 80/20 of cost optimization:

Right-size instances (most are oversized).
Use Savings Plans / Reserved Instances for steady workloads (~30% savings).
Watch data transfer costs (they hide).
Lifecycle old S3 data to Glacier.
Shut down dev/staging environments outside business hours.

AWS rewards understanding 15 services deeply over 100 services shallowly. Pick the smallest set that covers your needs, build mastery, expand only when forced. Most production AWS architectures fit in: VPC + ECS Fargate + RDS + S3 + CloudFront + Route 53 + IAM + Secrets Manager + CloudWatch. Eight services. That's enough for most companies.

// SECTION_03

Networking — VPC, subnets, security groups, load balancers

Networking is the layer that decides whether traffic reaches your application or not. Get it wrong and nothing works. Get it right and it disappears.

VPC — your isolated network

A Virtual Private Cloud is a private network within AWS. You define the IP range (CIDR block), subnets, and routing.

VPC: 10.0.0.0/16  (65,536 IPs)
├── Subnet (public, AZ-a): 10.0.1.0/24  (256 IPs)
├── Subnet (public, AZ-b): 10.0.2.0/24
├── Subnet (private, AZ-a): 10.0.10.0/24
├── Subnet (private, AZ-b): 10.0.11.0/24
└── Route tables: where traffic goes

Public vs private subnets

Public subnet — route to the internet via Internet Gateway. Resources here have public IPs. Load balancers, bastion hosts.
Private subnet — no direct internet route. Resources have only private IPs. Application servers, databases.

Why this matters: databases and app servers should never be reachable from the public internet. They live in private subnets. Only the load balancer is public.

NAT Gateway — outbound from private subnets

Private resources sometimes need to reach the internet (download packages, call third-party APIs). NAT Gateway provides outbound access without exposing them to inbound traffic.

private subnet → NAT Gateway → Internet Gateway → internet

NAT gateways cost money — both per hour AND per GB of data processed. They're often a silent budget hole. Workarounds for common cases:

VPC Endpoints for AWS services — traffic stays inside AWS, no NAT needed.
VPC Peering or Transit Gateway for connecting VPCs without internet.

Security groups — the stateful firewall

Security groups control what traffic is allowed in/out of resources. Rules are stateful — if traffic is allowed in, the response is automatically allowed out.

// example: web tier security group
Inbound:
  - 443 from 0.0.0.0/0 (HTTPS from anywhere)
  - 80 from 0.0.0.0/0 (redirect to HTTPS)
Outbound:
  - All to 0.0.0.0/0 (default; lock down for sensitive systems)

// example: DB security group
Inbound:
  - 5432 from web-tier-sg (Postgres only from web tier)
Outbound:
  - none needed

The pattern: reference security groups by ID, not by IP. web-tier-sg can talk to db-tier-sg regardless of which specific instances are running.

Network ACLs — the stateless backup

NACLs are subnet-level rules. Stateless (must explicitly allow both directions). Most production systems use security groups primarily and leave NACLs at defaults.

Load balancers — ALB vs NLB

	ALB (Layer 7)	NLB (Layer 4)
Protocol	HTTP/HTTPS, gRPC, WebSocket	TCP, UDP, TLS pass-through
Routing	Path, host, header, query	Port-based
TLS	Terminates TLS	Pass-through or terminate
Latency	~5-10ms	~1ms
Use for	HTTP services (most apps)	Non-HTTP, ultra-low-latency, static IPs

Default to ALB. Use NLB only when you have specific reasons (websocket-heavy, non-HTTP, regulatory IP allowlisting).

CloudFront — the CDN

Caches content at AWS edge locations near users. Reduces origin load, improves latency.

What CloudFront caches by default:

Anything matching the cache behavior (path patterns).
Subject to Cache-Control headers from the origin.
Per-edge-location caching with TTL.

What it doesn't cache:

Requests with cookies (by default).
POST/PUT/DELETE.
Anything with Cache-Control: no-store.

IMPLEMENTATIONSetting up CloudFront in front of an ALB

The standard pattern for a public web app:

User → CloudFront (edge cache) → ALB → ECS containers

CloudFront terminates TLS at the edge using your ACM certificate.
Static assets cached aggressively at edge (years TTL with hashed filenames).
HTML cached briefly (60s with stale-while-revalidate).
API requests pass through with no caching (or short TTL with cache key including auth).
Origin is the ALB; CloudFront forwards Host header so ALB routing works.

Costs: CloudFront data transfer is much cheaper than direct ALB egress, especially for static assets. For high-traffic sites this saves money even with cache misses.

Route 53 — DNS

AWS-managed DNS. Routes domain names to resources.

Record types worth knowing:

A — domain to IPv4 address.
AAAA — domain to IPv6.
CNAME — domain to another domain.
ALIAS — Route 53 specific; like CNAME but works at the apex (example.com, not just www).
MX — mail server.
TXT — verification, SPF/DKIM/DMARC.

Routing policies:

Simple — one record, one answer.
Weighted — split traffic by percentage (canary deploys).
Latency-based — route to lowest-latency region.
Failover — primary + secondary, automatic switch on health check failure.
Geolocation — route by user's country.

The networking debugging order

When traffic isn't flowing, check in this order:

DNS — does nslookup/dig resolve?
Reachability — does curl/ping get through?
Security groups — is the source allowed?
NACLs — subnet-level allow?
Route table — does the subnet have a route?
Application — is it actually listening on the port?

VPC Reachability Analyzer (built-in tool) walks this for you for any source-destination pair.

Network design is a question of trust boundaries. Public subnets trust the internet (load balancers); private subnets trust only the public subnets (app servers); database subnets trust only the app subnets. Each layer's security group only allows traffic from the layer above. The blast radius of a compromise is bounded by what each layer can reach.

// SECTION_04

Compute — ECS vs EKS vs Lambda vs Fargate vs EC2

Where your code actually runs. AWS has too many compute options; understanding the differences is the difference between a sane architecture and accidental complexity.

The five major compute options

	What it is	You manage	AWS manages
EC2	Virtual machines	OS, runtime, app, scaling	Hardware
ECS on EC2	Containers on your VMs	VMs, scaling	Container orchestration
ECS Fargate	Serverless containers	Containers only	Everything underneath
EKS	Managed Kubernetes	K8s workloads, sometimes nodes	K8s control plane
Lambda	Function-as-a-service	Function code only	Everything else

Lambda — pay per invocation

You write a function. AWS runs it on demand. No server, no scaling, no deploy. Triggered by events: HTTP requests, S3 uploads, queue messages, schedules.

// Lambda handler (Node.js)
export async function handler(event) {
  const userId = event.pathParameters.id;
  const user = await fetchUser(userId);
  return {
    statusCode: 200,
    body: JSON.stringify(user),
  };
}

Strengths:

Pay only when running. Idle = $0.
Auto-scales from 0 to thousands instantly.
No server management.
Great for event-driven workloads (S3 triggers, scheduled jobs, webhooks).

Weaknesses:

Cold starts (~100ms-2s for first invocation after idle).
15-minute max execution time.
Stateless — each invocation is fresh.
Local development is awkward (need SAM/serverless framework).
Connection pooling is hard (each invocation may need a fresh DB connection).

ECS Fargate — serverless containers

You write a Docker container; ECS runs it. Fargate means AWS manages the underlying VMs. You see only the containers.

# task definition (simplified)
{
  "family": "my-app",
  "containerDefinitions": [{
    "name": "app",
    "image": "123456.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.2.3",
    "portMappings": [{ "containerPort": 3000 }],
    "memory": 512,
    "cpu": 256
  }],
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"]
}

Strengths:

Same container model as Kubernetes, simpler to operate.
Long-running processes work fine.
No 15-minute limit.
Direct VPC networking — talks to private resources naturally.
Auto-scaling based on CPU/memory/custom metrics.

Weaknesses:

More expensive per-CPU than EC2 if you can keep instances utilized.
Slower scale-up than Lambda (containers take 30s+ to start).
No GPU support directly (use ECS on EC2).

EKS — managed Kubernetes

AWS-managed Kubernetes control plane; you bring worker nodes (or use Fargate for workers).

Strengths:

Standard Kubernetes — works the same as on-prem or other clouds.
Massive ecosystem (Helm charts, operators, service mesh).
Multi-region, multi-cluster, hybrid possible.
Best for complex microservice architectures.

Weaknesses:

Steepest learning curve.
$72/month per cluster control plane (before workers).
More YAML, more concepts (pods, deployments, services, ingress, namespaces).
Requires more ops investment.

EC2 — the original

Just a VM. Use directly when:

You need GPUs.
You need very specific OS tuning (kernel parameters, network stack).
You're running a stateful service that doesn't fit container patterns (database, message broker).
You need spot instances for cheap batch work.

VS / COMPARISONECS Fargate vs EKS vs Lambda — when each

If your workload is...	Pick
Web API, long-running, simple stack	ECS Fargate
Event-driven, sporadic, short-lived	Lambda
Microservice mesh with 20+ services	EKS
Cron jobs, S3 triggers, webhooks	Lambda
Background workers (queue consumers)	ECS Fargate
Multi-cloud or on-prem requirement	EKS (Kubernetes is portable)
Team of 3 engineers	ECS Fargate or Lambda (skip EKS)
Team of 50 engineers, complex deployments	EKS

The default for most teams in 2026: ECS Fargate for the main app, Lambda for event-driven work, RDS for databases. Add EKS only when scale or multi-cloud forces it.

The Kubernetes mistake: small teams adopt EKS because "everyone uses Kubernetes" and spend a year on infra instead of product. ECS Fargate covers 90% of use cases at 20% of the operational cost.

The serverless platforms — Vercel, Fly, Railway

For some apps, AWS itself is too complex. Higher-level platforms wrap cloud infrastructure with developer experience focus:

	Best for	Trade-off
Vercel	Next.js apps, frontend-heavy	Premium pricing, vendor lock-in
Fly.io	Long-running apps, multi-region	Smaller ecosystem than AWS
Railway	Full-stack apps, fast iteration	Less control, fewer features
Render	Heroku-style PaaS	Limited customization
Cloudflare Workers	Edge compute, very low latency	Limited runtime (V8 isolates)

Use these when developer time is more expensive than infrastructure cost. Move to AWS directly when you outgrow them.

Compute choice is a developer experience question disguised as a technical one. The technical differences are real but rarely the deciding factor — most apps would run fine on any of these. The real question is "what's the smallest amount of operational complexity that meets the requirements?" For most teams, that's ECS Fargate + Lambda + RDS. Kubernetes is the answer when you've genuinely outgrown that, not before.

// SECTION_05

Containers and Docker

A container is a packaged application with all its dependencies, isolated from the host. Docker is the dominant tool for building them. Containers are the unit of deployment for almost all modern infrastructure.

What a container actually is

A container is a process running with isolation features the Linux kernel provides:

Namespaces — isolated view of processes, network, filesystem, users.
cgroups — resource limits (CPU, memory).
Layered filesystem — image is built up of layers (base OS + app dependencies + your code).

The container shares the host kernel. Lighter than a VM (no separate OS), but less isolated.

Dockerfile — building an image

# Multi-stage build for a Node.js app
# Stage 1: build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: runtime
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./

# run as non-root user
USER node

EXPOSE 3000
CMD ["node", "dist/server.js"]

The multi-stage pattern

Build dependencies (compilers, dev packages) shouldn't ship to production. Multi-stage builds use one image to build, copy artifacts to a smaller runtime image. Final image: small, no build tools, only runtime needs.

Without multi-stage: Node app might be 1.2GB. With multi-stage: 150MB.

Image best practices

Use alpine variants for small images — node:20-alpine not node:20.
Order layers by change frequency — copy package.json before app code, so dependency installs cache.
Use .dockerignore — exclude node_modules, .git, tests, secrets.
Pin versions — node:20.10.0-alpine not node:latest. Reproducible builds.
Run as non-root — USER node or specific UID. Security.
Use distroless or scratch images for compiled languages (Go, Rust). Tiny, no shell.

The 12-factor app principles

Containers work best when apps follow these patterns:

Codebase — one repo per service.
Dependencies — explicitly declared, vendored or locked.
Config — in environment variables, not code.
Backing services — DBs, queues are attached resources, swappable.
Build, release, run — three separate stages.
Processes — stateless. State goes to backing services.
Port binding — app exports HTTP on a port; ingress routes to it.
Concurrency — scale by adding processes, not threads.
Disposability — fast startup, graceful shutdown.
Dev/prod parity — keep environments similar.
Logs — stdout/stderr, captured by infrastructure.
Admin processes — one-off jobs run in same environment.

Container registries

Where built images live before deployment.

ECR (AWS) — integrates with ECS/EKS, IAM-based access.
GHCR (GitHub) — free for public, paid for private. Integrates with GitHub Actions.
Docker Hub — original. Rate limits on pulls.
Self-hosted (Harbor, Nexus) — for air-gapped or strict-control environments.

Image scanning

Containers contain dependencies; dependencies have CVEs. Scan images for vulnerabilities:

Trivy — open-source, fast, runs in CI.
Snyk — commercial, deeper analysis.
ECR scanning — built-in if you use ECR.
GitHub dependabot — alerts on Dockerfile base image issues.

# Trivy in GitHub Actions
- name: Scan image
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'my-app:${{ github.sha }}'
    format: 'sarif'
    severity: 'CRITICAL,HIGH'
    exit-code: '1'

PITFALLCommon Dockerfile mistakes

Building everything in one stage — ships compilers and dev deps to prod. Use multi-stage.
Running as root — container escape leads directly to root on the host. Use a non-root user.
Including secrets at build time — secrets baked into image layers. Anyone with the image can extract them. Use runtime injection.
Using :latest tags — non-reproducible. Pin versions.
COPY everything — including .git, node_modules, IDE configs. Use .dockerignore.
Wrong layer order — copying app code before npm install means every code change reinstalls dependencies. Copy package files first.
Not handling SIGTERM — Kubernetes/ECS sends SIGTERM, then SIGKILL after a grace period. Apps that don't handle SIGTERM lose in-flight requests.
HEALTHCHECK missing — orchestrators don't know if your app is alive. Add a /health endpoint and HEALTHCHECK directive.

Docker Compose — for local development

# docker-compose.yml
services:
  app:
    build: .
    ports: ['3000:3000']
    environment:
      DATABASE_URL: postgres://postgres:secret@db:5432/myapp
    depends_on: [db, redis]

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine

volumes:
  pgdata:

One command (docker compose up) brings up the entire local environment. Used for development; production uses orchestrators (ECS, Kubernetes), not Compose.

Containers solve the "works on my machine" problem by shipping the machine. The mental model: the container is the deployment unit. Once the image is built, the deploy process is "run this image with these environment variables." This decouples the code from the runtime, which decouples teams from each other, which is why the industry standardized on it.

// SECTION_06

Kubernetes

Kubernetes is a container orchestration platform — it runs containers across many machines and handles scheduling, scaling, networking, and self-healing. It's powerful, complex, and often overused.

The core concepts

Concept	What it is
Pod	The smallest unit. Usually one container, sometimes a tightly-coupled group.
Deployment	Manages pods. "I want 3 copies of this image, replace them on changes."
Service	Stable network endpoint for a set of pods. Pods are ephemeral; service IPs aren't.
Ingress	HTTP routing to services. Like a reverse proxy with rules.
ConfigMap	Configuration data injected as env vars or files.
Secret	Like ConfigMap but for sensitive data. Base64-encoded (not encrypted).
Namespace	Logical isolation within a cluster. Used for multi-tenancy or environment separation.
Node	A worker machine (VM or physical) running pods.
StatefulSet	Deployment for stateful apps (databases, queues). Stable names, ordered startup.
DaemonSet	One pod per node. For log collectors, monitoring agents.
Job / CronJob	One-off or scheduled tasks.

A minimal deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-app:1.2.3
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 3000

This: 3 pods running the image, with health checks, resource limits, and a stable network endpoint.

Probes — liveness vs readiness

Liveness — "is this container still alive?" Failure = restart pod.
Readiness — "is this container ready to receive traffic?" Failure = remove from service routing.
Startup — for slow-starting apps. Liveness/readiness don't run until startup passes.

Common mistake: same probe for both. The right pattern: liveness checks "process is alive and not stuck" (lightweight), readiness checks "I can serve requests" (heavier — DB connection, dependencies).

Resources — requests and limits

Requests — what the pod is guaranteed. Kubernetes uses this for scheduling.
Limits — the maximum it can use. Exceeding CPU = throttle. Exceeding memory = OOM kill.

Typical pattern:

requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }

Request what you typically need; limit at peak. Setting requests too high wastes capacity. Setting them too low causes scheduling on busy nodes.

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Scales pod count based on CPU utilization (or custom metrics). Combined with cluster autoscaling, the cluster grows and shrinks automatically.

Helm — packaged Kubernetes apps

Helm packages Kubernetes manifests as templated "charts." Install with one command, parameterize per environment.

# install Postgres via Helm
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-postgres bitnami/postgresql \
  --set auth.postgresPassword=secretpass \
  --set primary.persistence.size=20Gi

Most production apps either use Helm or Kustomize (built into kubectl, simpler).

Service mesh — Istio, Linkerd

A layer between services that handles cross-cutting concerns: mTLS between services, traffic routing, retries, circuit breaking, observability.

Adds significant complexity. Use only when you have specific problems it solves (regulatory mTLS, advanced traffic shaping, deep service-to-service observability). Most apps don't need this.

PITFALLWhen Kubernetes is the wrong answer

Common mistakes:

Adopting it too early. 5-engineer team, one app — ECS Fargate is simpler. Kubernetes adds 6 months of learning curve.
Running stateful workloads without expertise. Postgres on Kubernetes is fine, but only if your team knows StatefulSets, PVCs, backup/restore, and the failure modes. Otherwise, RDS.
Not setting resource limits. One runaway pod consumes the node, kills others. Always set limits.
Treating it as a magic black box. Kubernetes failures are subtle. Without understanding networking, scheduling, and storage, you can't debug.
Building custom operators for things that aren't custom. Most needs are met by existing operators or Helm charts.

The senior judgment: Kubernetes when scale, multi-cloud, or complex orchestration justifies the operational cost. Otherwise, use simpler alternatives.

Managed Kubernetes options

Service	Cloud	Notes
EKS	AWS	$72/mo control plane, you manage workers (or use Fargate)
GKE	GCP	Often considered the best K8s experience, free tier control plane
AKS	Azure	Free control plane, decent integration
DigitalOcean Kubernetes	DO	Cheaper, simpler, less feature-rich
Civo, Linode	Various	Even cheaper, smaller ecosystems

Kubernetes is a powerful, opinionated orchestrator that's worth its complexity at scale and overkill for small apps. The mental model: declarative state — you describe what you want (3 pods of this image with these resources), Kubernetes figures out how to make it match. When something fails, Kubernetes restarts; when traffic spikes, autoscaler responds. The cost is the learning curve and operational overhead. Pick it when the alternative would require building these features yourself.

// SECTION_07

Infrastructure as Code — Terraform, Pulumi, CDK

Infrastructure as Code defines your cloud resources in version-controlled files. Same code, deployed identically across environments. The alternative — clicking around the AWS console — doesn't scale beyond one engineer.

The four major IaC tools

	Language	State	Cloud
Terraform	HCL (custom DSL)	Backend (S3, Cloud, etc.)	All
OpenTofu	HCL (Terraform fork)	Same as Terraform	All
Pulumi	Real languages (TS, Python, Go)	Pulumi Cloud or self-hosted	All
AWS CDK	Real languages (TS, Python, etc.)	CloudFormation	AWS only
CloudFormation	YAML/JSON	AWS-managed	AWS only

Terraform — the industry standard

# terraform/main.tf
terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket = "my-tf-state"
    key    = "prod/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "uploads" {
  bucket = "my-app-uploads-prod"
}

resource "aws_s3_bucket_versioning" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_public_access_block" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

output "uploads_bucket" {
  value = aws_s3_bucket.uploads.id
}

# commands
terraform init      # download providers, set up backend
terraform plan      # show what will change
terraform apply     # execute the changes
terraform destroy   # tear it all down

Pulumi — IaC in real languages

// infra/index.ts
import * as aws from "@pulumi/aws";

const uploadsBucket = new aws.s3.Bucket("uploads", {
  bucket: "my-app-uploads-prod",
  versioning: { enabled: true },
});

new aws.s3.BucketPublicAccessBlock("uploads-block", {
  bucket: uploadsBucket.id,
  blockPublicAcls: true,
  blockPublicPolicy: true,
  ignorePublicAcls: true,
  restrictPublicBuckets: true,
});

export const uploadsBucketName = uploadsBucket.id;

Same result, but with TypeScript: real loops, conditions, functions, modules, type checking, and IDE autocomplete. The infra is just code.

VS / COMPARISONTerraform vs Pulumi vs CDK — which IaC

	Terraform	Pulumi	CDK
Language	HCL	TS/Python/Go/.NET	TS/Python/Java/.NET
Cloud support	All major + many	All major + many	AWS only
Ecosystem	Largest	Growing	AWS-deep
Learning curve	Medium (HCL)	Lower if you know the language	Lower if you know the language
Type safety	Limited	Full	Full
Loops/conditions	Awkward	Native	Native
State management	You configure backend	Pulumi Cloud free tier or self-host	CloudFormation manages
Hiring	Most engineers know it	TS engineers pick it up fast	AWS shops

The 2026 picture:

Terraform if your team values stability, broadest ecosystem, multi-cloud.
Pulumi if your team is strong in a real language and wants type safety + abstraction power.
CDK if you're AWS-only and want first-party support.
OpenTofu if you want Terraform without HashiCorp's BSL license.

For most teams in 2026, the two-way race is Terraform vs Pulumi. CDK is fine for AWS-shops; CloudFormation is legacy.

State management

IaC tools track the resources they've created in a "state file." This is critical:

State maps logical names to real cloud resources.
Without it, the tool can't know what to update or delete.
State must be shared (team) and locked (prevent concurrent applies).

For Terraform: store state in S3 with DynamoDB locking, or use Terraform Cloud / Spacelift / Atlantis.

For Pulumi: Pulumi Cloud (free for individuals, paid for teams), or self-host with S3.

The IaC discipline

Everything in code. No clicking in the console for production. Drift is the enemy.
One state per environment. Dev, staging, prod separate.
Plan in CI, apply via PR. Code review for infra changes.
Modules / components for reuse. Common patterns extracted, parameterized.
Tagging. Every resource tagged with environment, owner, cost-center.
Drift detection. Run plan regularly to find resources that changed outside IaC.

PITFALLThe state file disasters

Local state in production. Engineer leaves; nobody can manage the infra. State must be remote.
State file committed to git. Contains secrets in plaintext. .gitignore it; use proper backend.
Two engineers run apply at the same time. Race condition corrupts state. Always use locking.
Manual changes in console + terraform apply = the apply wipes the manual changes (since they're not in code). Or worse, errors. Lock down console access.
Lost state file. Recovery is painful — import every resource by hand. Backup state files; use versioned S3.
State file corruption. Always have backups. terraform state pull regularly to a versioned location.

Modules and abstractions

Once you've written one VPC, you don't want to write another. Extract patterns:

# Terraform module usage
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.0"

  name = "prod-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
  public_subnets  = ["10.0.1.0/24",  "10.0.2.0/24",  "10.0.3.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false  # one per AZ for HA
}

Public modules from the Terraform Registry cover most common patterns. Avoid reinventing.

IaC turns infrastructure into a code review. Want to add a database? Open a PR. The senior team has every piece of production defined in code, planned in CI, applied via merged PRs. Drift is impossible because nobody touches the console. The discipline pays off when you need to spin up a duplicate environment, recover from disaster, or onboard a new engineer — they can read the code instead of asking.

// SECTION_08

CI/CD

CI/CD is how code goes from a developer's laptop to production. Continuous integration runs tests on every change. Continuous deployment ships passing changes automatically. The pipeline is the seam between development and operations.

What a pipeline does

Trigger — push to branch, PR opened, schedule, manual.
Checkout — get the source code.
Setup — install dependencies, configure cache.
Lint and typecheck — fast checks that fail early.
Test — unit, integration, possibly e2e.
Build — compile, bundle, build Docker image.
Push — image to registry, artifacts to storage.
Deploy — to staging, then production (with approval gates).
Verify — smoke tests, health checks.
Notify — Slack, email on success/failure.

GitHub Actions — the canonical CI/CD

# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci

      - run: npm run lint
      - run: npm run typecheck
      - run: npm run test

      - run: npm run build

      - uses: actions/upload-artifact@v4
        with:
          name: build
          path: dist/

  deploy:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # for OIDC
      contents: read
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-deploy
          aws-region: us-east-1

      - run: ./deploy.sh

Caching

Most pipeline time is spent installing dependencies. Caching them between runs is the single biggest speedup.

# cache npm
- uses: actions/setup-node@v4
  with:
    node-version: '20'
    cache: 'npm'

# cache Docker layers
- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

# cache custom paths
- uses: actions/cache@v4
  with:
    path: ~/.cargo
    key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}

Matrix builds

Run the same job across multiple configurations.

strategy:
  matrix:
    node: ['18', '20', '21']
    os: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.os }}

Six jobs run in parallel, testing every combination.

Authentication — OIDC, not long-lived secrets

The old pattern: store AWS access keys in GitHub Secrets. Rotation hell, leak risk.

The 2026 pattern: OIDC federation. GitHub Actions identifies itself to AWS as a workflow run; AWS assumes a role on the fly. No long-lived credentials.

permissions:
  id-token: write  # request OIDC token
  contents: read

- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/github-deploy
    aws-region: us-east-1

The AWS IAM role has a trust policy that only accepts tokens from your specific GitHub repo and branches. No secret to leak.

Deploy strategies

Different ways to ship code:

Rolling — replace instances one at a time. Default in ECS/Kubernetes.
Blue/green — bring up new version alongside old, switch traffic, keep old around for instant rollback.
Canary — send small % of traffic to new version, ramp up if metrics look good.
Feature flag — code is deployed but disabled until flag is flipped.

The CI/CD principles

Fast feedback. CI should finish in under 10 minutes for normal changes. Optimize ruthlessly.
Fail fast. Run cheap checks (lint, typecheck) before expensive ones (e2e tests).
Reproducible. Same code → same artifact, every time. Pin versions, lock files.
Idempotent deploys. Re-running a deploy yields the same result.
Roll back, don't roll forward. When something breaks, the safest action is to deploy the previous version.
Trunk-based with PRs. Short-lived feature branches, merged into main daily. Long-lived branches accumulate conflicts.

REAL-WORLDA real CI/CD pipeline for a Next.js + ECS app

PR opened — runs lint, typecheck, unit tests, builds preview deployment.
PR approved + merged to main — full test suite runs, including integration tests against staging DB.
Tests pass — Docker image built with multi-stage, pushed to ECR with git SHA tag.
Image scanned — Trivy fails the pipeline if HIGH/CRITICAL CVEs found.
Deploy to staging — ECS service updated with new task definition. Wait for rollout to complete.
Smoke tests run against staging — Playwright tests against the staging URL. Fail = rollback.
Manual approval gate — Slack notification, on-call approves promotion to prod.
Deploy to prod — same image, prod ECS service updated. Health checks pass before old containers are terminated.
Post-deploy — Datadog deployment marker, notify Slack.

Total time: 12-15 minutes from merge to prod. Rollback if needed: 2 minutes (just deploy previous SHA).

Other CI/CD platforms

	Best for	Trade-off
GitHub Actions	GitHub-hosted, most teams	Tied to GitHub
GitLab CI	GitLab users	Tied to GitLab
CircleCI	Independent, mature	Cost at scale
Buildkite	Self-hosted runners, large enterprises	More setup
Jenkins	Legacy, on-prem requirement	Operational overhead
Argo CD / Flux	Kubernetes GitOps	K8s-only, declarative model

CI/CD is the immune system of your engineering org. Every commit is a potential pathogen; the pipeline catches the bad ones. The mature pipeline is fast (sub-10-minute), reliable (no flaky tests, no random failures), and trusted (engineers don't bypass it). When the pipeline becomes the bottleneck, fix the pipeline; don't ship around it.

// SECTION_09

Secrets and identity

Secrets are the credentials your code needs to access other systems — database passwords, API keys, signing keys, certificates. Mishandling them is the single most common cause of breaches.

The cardinal rules

Never commit secrets to git. Once committed, assume compromised. Even private repos.
Never put secrets in container images. Anyone with the image has them.
Never log secrets. Filter them in logging libraries.
Use short-lived credentials when possible. Rotation reduces blast radius.
Audit who has access to what. Log every secret access.

Where secrets live

Service	Best for	Cost
AWS Secrets Manager	AWS-native, auto-rotation	$0.40/secret/month
AWS Parameter Store	Free tier, simpler	Free for standard, $0.05/secret for advanced
HashiCorp Vault	Multi-cloud, complex policies	Self-hosted (free) or HCP
Doppler / Infisical	Developer experience focused	Per-user pricing
1Password Secrets Automation	Teams using 1Password	Per-user
GitHub Secrets	CI/CD only	Free with GH

The injection patterns

How secrets get into your running app:

Environment variables (most common)

# ECS task definition
"environment": [
  { "name": "NODE_ENV", "value": "production" }
],
"secrets": [
  {
    "name": "DATABASE_URL",
    "valueFrom": "arn:aws:secretsmanager:us-east-1:123:secret:db-url-AbCdEf"
  }
]

ECS pulls the secret at task startup, injects as env var. Container code reads process.env.DATABASE_URL.

Sidecar container (Kubernetes)

Vault Agent or External Secrets Operator runs alongside your app, fetches secrets, writes to a shared volume.

SDK fetch at runtime

// fetch from Secrets Manager directly
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';

const client = new SecretsManagerClient({ region: 'us-east-1' });
const { SecretString } = await client.send(
  new GetSecretValueCommand({ SecretId: 'db-credentials' })
);
const { username, password } = JSON.parse(SecretString);

More flexible (rotation without restart) but adds complexity.

Secret rotation

Secrets should rotate regularly. Static API keys are time bombs.

Database passwords — Secrets Manager can rotate via Lambda. New password generated, applied to RDS, app re-fetches at next connection.
API keys (Stripe, OpenAI) — manually create new key, update secret, deploy, deactivate old.
JWT signing keys — rotate by issuing new tokens with new key while old ones remain valid for transition period.
TLS certificates — automated via Let's Encrypt + cert-manager (Kubernetes) or AWS ACM (auto-renewal).

OIDC and short-lived credentials

The modern shift: don't have long-lived credentials at all.

IAM Roles for Service Accounts (IRSA) in EKS — pods assume IAM roles directly, no AWS keys.
ECS task roles — same idea for ECS.
OIDC federation for CI/CD — GitHub Actions assumes IAM role per workflow.
Workload Identity in GCP — same pattern.

Code doesn't see credentials at all. The platform handles authentication; your code just makes API calls.

REAL-WORLDSecrets management for a typical web app

The setup:

Local dev: .env.local file (gitignored). Engineers populate from a shared 1Password vault.
CI/CD: No long-lived AWS keys. GitHub Actions uses OIDC to assume an IAM role for deploys.
Staging/prod runtime: ECS task role allows reading from Secrets Manager. Secrets injected as env vars at task startup.
Application secrets (Stripe, OpenAI, Sentry): stored in Secrets Manager, one secret per service, rotated quarterly.
Database credentials: Secrets Manager auto-rotates monthly via Lambda. App reconnects on next query.
Internal service tokens: mTLS via internal CA, certificates auto-rotated by cert-manager.

What's noticeably absent: long-lived AWS access keys anywhere. The closest thing is the personal IAM user for local development, with read-only console access.

PITFALLHow secrets actually leak

git history. Engineer pastes a key into config.json, commits, pushes. Removing it later doesn't help — git history retains it. Bots scrape GitHub for AWS keys; compromise time is under 60 seconds.
Container layers. RUN echo "API_KEY=secret" > .env bakes the secret into a layer. Anyone who pulls the image extracts it.
Logs. logger.info(`Using key: ${apiKey}`) while debugging. Datadog now has the key. SIEM has the key. Backups have the key.
Error messages. Stack traces include connection strings. URLs include tokens. Logged and sent to Sentry.
Frontend bundles. Server-side env var accidentally exposed via NEXT_PUBLIC_ prefix. Now in client JS, viewable in DevTools.
Slack messages. "Hey, here's the staging API key" in a DM. Slack message search makes it findable.
Screen shares. Engineer's terminal scrolling through env vars during a meeting. Recording saved.

Defenses:

Pre-commit hooks: gitleaks, trufflehog. Block commits with secret patterns.
GitHub push protection: detects secrets at push time. Free for public repos.
Container scanning: detects secrets in image layers.
Logging libraries: filter known sensitive keys (password, token, secret, key).
Frontend: lint rule catching NEXT_PUBLIC_ prefix on sensitive vars.
Rotate any secret that even might have leaked.

The discipline summary

Secrets in a manager, never in code.
Pre-commit hooks catch accidents.
OIDC for CI/CD — no long-lived keys.
IAM roles for runtime — no API keys.
Rotation automated where possible, scheduled where not.
Audit logs on secret access. Alert on anomalies.
Treat any leaked secret as compromised — rotate, don't try to "clean up."

Secrets management is one of those areas where the right approach takes setup but very little ongoing work. Wrong approaches feel easier (just paste the key in env) but generate constant risk and ongoing remediation. The senior approach: never have a long-lived credential a human or script could leak. Use platform identity (IAM roles, OIDC, IRSA), rotate everything else, audit access. The tooling exists; the discipline is the work.

// SECTION_10

Observability — logs, metrics, traces

Observability is your ability to understand what's happening in production. The three pillars — logs, metrics, traces — answer different questions. You need all three.

The three pillars

	What it answers	Cardinality
Logs	What happened (specific events with context)	Unlimited (every request)
Metrics	How much / how often (numeric, aggregated)	Limited (tag values multiply)
Traces	How (call graph, where time went)	Sampled (typically 1-10%)

Logs are best for debugging a specific incident. Metrics are best for monitoring trends and alerting. Traces are best for understanding distributed system performance.

Structured logging

The right way to log: structured JSON with fields, not freeform text.

// bad — unsearchable
console.log(`User ${userId} bought ${productId} for $${amount}`);

// good — structured
logger.info('purchase_completed', {
  user_id: userId,
  product_id: productId,
  amount,
  currency: 'USD',
  payment_method: 'stripe',
});

Now you can query: "all purchases over $100 in the last hour." With freeform logs, you can't.

Log levels

DEBUG — detailed info for debugging. Off in production.
INFO — normal events worth recording (request received, user signed up).
WARN — unexpected but recoverable (retrying, fallback used).
ERROR — something failed (request couldn't complete, exception thrown).
FATAL — service is going down.

Production usually runs at INFO. Tools query and aggregate; humans rarely scroll through logs.

Metrics

Numbers over time, queryable by tags.

// counter — monotonically increasing
metrics.counter('http.requests', { method: 'POST', route: '/api/login', status: 200 });

// gauge — current value
metrics.gauge('queue.depth', 42, { queue: 'emails' });

// histogram — distribution
metrics.histogram('http.duration_ms', 145, { route: '/api/search' });

Metric types matter:

Counters — only go up. Track totals (requests, errors). Calculate rate as rate(counter[5m]).
Gauges — go up and down. Track current values (queue depth, active connections, memory).
Histograms / summaries — distributions. Track latency percentiles, request sizes.

The four golden signals

Google SRE's framework for monitoring any service:

Latency — how long requests take (p50, p95, p99).
Traffic — how many requests (RPS).
Errors — how many fail (error rate).
Saturation — how full the system is (CPU, memory, disk, queue depth).

Dashboards for any service start with these four.

Distributed tracing

One request flows through many services. A trace shows the full call graph with timing.

POST /api/checkout                                            450ms
├── auth.verify                                                15ms
├── inventory.check                                            22ms
├── payment.charge                                            380ms  ← bottleneck
│   ├── stripe.api                                            350ms
│   └── ledger.record                                          25ms
├── email.send (async)                                          5ms
└── response                                                   28ms

Without tracing, you'd see "checkout is slow." With tracing, you see Stripe is the bottleneck.

OpenTelemetry is the standard now — vendor-neutral instrumentation. Most languages have auto-instrumentation libraries that capture HTTP, DB, and RPC calls automatically.

Datadog — the canonical observability platform

Single platform for logs, metrics, traces, infra monitoring, APM, RUM (real user monitoring), synthetics. Expensive but comprehensive.

// Node.js APM auto-instrumentation
import tracer from 'dd-trace';
tracer.init();

// custom span
const span = tracer.startSpan('process.payment');
try {
  await processPayment(order);
  span.setTag('order_id', order.id);
} catch (err) {
  span.setTag('error', err);
  throw err;
} finally {
  span.finish();
}

Alerting

Alerts wake humans up. Bad alerts waste sleep and trust.

Good alerts are:

Symptomatic — alert on user-facing symptoms (latency, error rate), not on causes (high CPU). Causes change; symptoms don't.
Actionable — there's something the on-call can do. If not, why are you waking them?
Worth waking up for — reserve pages for real impact. Email or Slack for warnings.

The alert hierarchy:

Pages (call/SMS) — user-impacting outage. Site is down, payments are failing.
Slack alerts — concerning trends, errors above baseline. Check during business hours.
Email digests — daily summaries, low-priority items.
Dashboards — for proactive monitoring, not alerts.

PITFALLAlert fatigue and how to avoid it

Common patterns that lead to alert fatigue:

Static thresholds. "CPU > 80%" pages every Friday at 5 PM during normal load. Use anomaly detection or rate-based thresholds.
Alerting on causes. Pod restarts page. They restarted because something failed; the symptom is what matters. Alert on user impact.
Flapping. Metric oscillates above/below threshold. Use sustained durations (">5min") or hysteresis.
No runbook. Page fires; on-call has no idea what to do. Every alert needs a linked runbook.
Too many alerts per service. 50 alerts; nobody knows which matters. Aggregate to symptom-level alerts.

The metric that matters: page rate per on-call shift. If it's more than 1-2 per shift, the alert config needs work, not the on-call rotation.

SLOs and error budgets

An SLO (Service Level Objective) is a measurable promise about reliability:

SLO: 99.9% of requests succeed in < 500ms over a 30-day window
Error budget: 0.1% of requests = ~43 minutes of downtime per month

The team can spend the error budget on risky deploys, experimentation, etc. When the budget runs out, freeze deploys until reliability improves.

SLOs replace "uptime" as the metric that matters — they're focused on user-experienced reliability, not on whether servers are technically running.

The observability stack alternatives

	What	Pricing
Datadog	Full platform	Premium ($15-30/host/mo + per-feature)
Grafana + Prometheus + Loki + Tempo	Open-source stack	Self-hosted (free) or Grafana Cloud
New Relic	Full platform	Per-user pricing (better for small teams)
Honeycomb	Best-in-class tracing/observability	Per-event pricing
Sentry	Error tracking + APM	Per-event/per-user
CloudWatch	AWS-native, basic	Cheap but limited

Observability is about asking new questions in production, not just answering pre-planned ones. Logs let you search for anything. Metrics tell you what's normal. Traces tell you why something is slow. The mature org instruments everything from day one (cheap to add) rather than scrambling during incidents (expensive). When you're debugging an issue at 3 AM, the only question that matters is "what data do I have?"

// SECTION_11

Scaling strategies

Scaling is about adding capacity without proportionally adding complexity. The mistake teams make is reaching for distributed systems before they've vertically scaled. The right order matters.

The order of scaling

Vertical scaling (bigger machines) — usually the cheapest first move.
Read replicas — most apps are read-heavy.
Caching — Redis in front of expensive queries.
Horizontal scaling — more app instances behind a load balancer.
CDN — push static and cacheable content to edge.
Async processing — move slow work to background queues.
Sharding / partitioning — split data across DBs when one isn't enough.
Multi-region — geographic distribution for global latency or DR.

Vertical scaling — buying a bigger box

Boring but effective. A db.r6g.xlarge handles dramatically more than a db.t3.micro. Most teams' first scaling problem is "we should have used bigger instances from the start."

Limits:

Eventually you hit the largest available instance.
Single point of failure.
Diminishing returns past a certain size.

But: vertical scaling is the right answer until you actually outgrow it. Don't preemptively shard at 1k users.

Horizontal scaling — more instances

Stateless services scale horizontally by adding copies behind a load balancer.

// ECS service with auto-scaling
{
  "desiredCount": 3,
  "minCapacity": 3,
  "maxCapacity": 30,
  "scaling": {
    "targetCpuUtilization": 70
  }
}

The hard part: making your app actually stateless. Sessions in Redis, not memory. Files in S3, not disk. WebSocket connections through a pub/sub layer.

Auto-scaling triggers

CPU utilization — most common. Add instances when avg CPU > 70%.
Request count — scale by traffic.
Queue depth — scale background workers when queue backs up.
Custom metrics — anything you can graph (response time, error rate).
Schedule — predictable traffic patterns (scale up at 9 AM, down at 6 PM).

Tune the cooldown carefully. Scale up fast (don't drop traffic), scale down slow (don't churn).

Database read replicas

Most apps are 80%+ reads. Read replicas multiply read capacity:

writes → primary
reads → primary OR replicas

RDS Multi-AZ + read replicas give you HA + read scaling. Configure your app to route reads to replicas, writes to primary.

Caveat: replica lag. Async replicas are slightly stale. For "read your writes" consistency, route a user's reads to primary for N seconds after their write, or use a causal consistency token.

Caching layers

The cheapest performance gain you can buy.

Layer	Speed	Use for
L1: in-process	Sub-microsecond	Hot config, small lookup tables
L2: Redis/Memcached	Sub-millisecond	Session, cached query results, rate limit counters
L3: CDN edge	Sub-50ms (geo-dependent)	Static assets, public API responses
L4: Database	1-100ms	Source of truth

The patterns:

Cache-aside — app checks cache, falls back to DB on miss, populates cache.
Write-through — every write goes to cache and DB simultaneously.
Write-behind — writes go to cache; cache writes to DB asynchronously.

Cache-aside is the default. Other patterns when you have specific consistency requirements.

Async processing — the queue pattern

Slow work shouldn't block user-facing requests. Move it to background workers.

// synchronous (slow)
async function signUp(email, password) {
  const user = await createUser(email, password);
  await sendWelcomeEmail(user);  // 800ms
  await indexUserInSearch(user);  // 200ms
  await provisionTenant(user);    // 1500ms
  return user;
  // total: 2.5+ seconds
}

// async (fast)
async function signUp(email, password) {
  const user = await createUser(email, password);
  await queue.enqueue('user.signed_up', { userId: user.id });
  return user;
  // total: 100ms; the rest happens in workers
}

// worker
queue.consume('user.signed_up', async (msg) => {
  await sendWelcomeEmail(...);
  await indexUserInSearch(...);
  await provisionTenant(...);
});

The tradeoff: eventual consistency. The user is created instantly, but the welcome email may be 5 seconds late. For most flows, this is fine. The fast response matters more.

Sharding — splitting data

When one database can't hold all your data or handle all the writes:

Vertical partitioning — split tables across DBs by domain (users in one, orders in another).
Horizontal sharding — split rows of one table across DBs by some key (user ID hash, geographic region).

Sharding is a one-way door. Cross-shard queries are expensive or impossible. Joins across shards are pain. Plan extensively before sharding.

Many teams that thought they needed sharding just needed: better indexes, read replicas, caching, or moving to a bigger instance.

REAL-WORLDA scaling story for a typical SaaS

10k users, web app + background workers + Postgres.

Stage 1 (1k users): Single ECS task, db.t3.medium. Plenty of headroom. Total cost: $200/mo.

Stage 2 (10k users): Auto-scaling group of 3-10 ECS tasks. db.r6g.large. CloudFront for static assets. Cost: $1.5k/mo.

Stage 3 (100k users): Add read replica for analytics queries. Add Redis for session storage. Move email sending to async queue with SQS + Lambda. Cost: $5k/mo.

Stage 4 (1M users): Multi-AZ database. Multiple read replicas. ElastiCache cluster. Full async processing pipeline. Costs increase non-linearly because of cross-AZ data transfer. Cost: $25k/mo.

Stage 5 (10M+ users): Now sharding becomes a real consideration. Multi-region for global latency. Service decomposition (microservices). Costs scale with engineering complexity, not just users.

The lesson: most companies don't need to design for 10M users from day one. The first three stages are well-trodden patterns, not heroic engineering.

Scaling is sequential. Each step removes a specific bottleneck. The mistake is jumping ahead — building for 1M users when you have 1k. The right order: vertical, replicas, cache, horizontal, CDN, async, shard. Most teams never need to shard. Most apps never outgrow well-provisioned RDS Multi-AZ + 5 ECS tasks. Build for current scale + 10x; plan for 100x; don't engineer for 1000x until you actually need it.

// SECTION_12

Deploy strategies

Deploy strategies are how you ship code to production. The right one lets you deploy 50 times a day with confidence; the wrong one turns a 5-second bug into a 5-hour outage.

The four main strategies

Rolling deploy

Replace instances one at a time. Default in Kubernetes and ECS.

State 1: [v1] [v1] [v1] [v1] [v1]
State 2: [v2] [v1] [v1] [v1] [v1]
State 3: [v2] [v2] [v1] [v1] [v1]
...
State 5: [v2] [v2] [v2] [v2] [v2]

Pros: simple, no extra resources, gradual.
Cons: rollback = another rolling deploy. During deploy, both versions are live — code must be backwards-compatible.

Blue/green

Two complete environments. Blue runs live; deploy v2 to green, test, switch traffic. Old version stays warm for instant rollback.

Pros: instant rollback (just flip traffic), test new version with production-like conditions before cutover.
Cons: 2x resources during deploy, sudden cutover.

Canary

Send small percentage of traffic to new version. Watch metrics. Ramp up if good, roll back if bad.

State 1: v1 (100%) | v2 (0%)
State 2: v1 (99%)  | v2 (1%)   ← measure
State 3: v1 (90%)  | v2 (10%)  ← measure
State 4: v1 (50%)  | v2 (50%)
State 5: v1 (0%)   | v2 (100%)

Pros: small blast radius, real production traffic, automated promotion based on metrics.
Cons: requires sophisticated traffic management and metric analysis.

Feature flag

Code is deployed to all instances but disabled. Flip the flag to enable for cohorts.

Pros: deployment decoupled from release. Instant rollback (flip flag off). Targeted releases (1%, beta users, specific tenants).
Cons: dual code paths in your app. Flag debt accumulates without discipline.

Database migrations — the hard part

Code rolls back fast. Schemas don't. The mature pattern is expand-contract:

Expand — add new column/table, code reads old AND new.
Backfill — populate new column for old rows.
Migrate reads — code reads new only.
Stop writing old — code writes new only.
Contract — drop old column/table.

Each step is independently rollback-safe. Slower (5 deploys instead of 1) but never breaks production.

IMPLEMENTATIONA safe column rename

Goal: rename users.username to users.handle.

-- Step 1 (deploy): add column, code reads username, writes both
ALTER TABLE users ADD COLUMN handle VARCHAR(50);

-- Step 2 (background job): backfill
UPDATE users SET handle = username WHERE handle IS NULL;

-- Step 3 (deploy): code reads handle, falls back to username; writes both
-- Step 4 (deploy): code reads handle only, still writes both
-- Step 5 (deploy): code writes handle only

-- Step 6 (deploy): drop column
ALTER TABLE users DROP COLUMN username;

What feels excessive is the level of caution. What's actually being avoided: hours of downtime if any step breaks. Each version of the code is compatible with the schema before AND after the next migration.

PITFALLMigrations that lock the table

Production at 3 AM:

ALTER TABLE accounts ADD COLUMN preferences JSONB DEFAULT '{}';

This rewrites the entire table because of the default value. On a 500M-row table, that's an exclusive lock for hours. Every query against the table hangs.

The fix:

-- step 1: add column without default (instant in modern Postgres)
ALTER TABLE accounts ADD COLUMN preferences JSONB;

-- step 2: backfill in batches
UPDATE accounts SET preferences = '{}' WHERE id BETWEEN 1 AND 100000;
-- repeat in chunks; vacuum between

-- step 3: add default for new rows
ALTER TABLE accounts ALTER COLUMN preferences SET DEFAULT '{}';

-- step 4: NOT NULL after backfill complete
ALTER TABLE accounts ALTER COLUMN preferences SET NOT NULL;

Tools that catch this in CI: strong_migrations (Rails), squawk (Postgres SQL linter).

Smoke tests after deploy

Don't trust your deploy was successful just because the pipeline turned green. Run smoke tests against the deployed environment:

// after deploy completes
- name: Smoke test
  run: |
    curl -f https://api.example.com/health || exit 1
    curl -f https://api.example.com/api/version | grep "${{ github.sha }}" || exit 1
    npm run test:smoke -- --baseUrl=https://api.example.com

If smoke tests fail, automated rollback. The pipeline that deploys must own the verification.

Rollback discipline

Every deploy must be rollback-able to the previous version with one command.
Database migrations must be backwards-compatible during the deploy window.
Roll back first, investigate second. Don't let users suffer while you debug.
Automate rollback triggers — error rate > X for > Y minutes = automatic rollback.

The deploy frequency lever

Counterintuitive but proven: deploy more often, not less. Small frequent deploys mean:

Each deploy has fewer changes — easier to debug if something breaks.
Rollbacks are smaller — less to undo.
Confidence builds — engineers trust the pipeline.
Issues surface fast — caught while context is fresh.

Top-tier teams deploy multiple times per day per service. Big batched releases (weekly, monthly) accumulate risk and turn each deploy into an event.

Deploy strategy is a question of "how do you reduce blast radius when something goes wrong?" Rolling for routine, blue/green for instant rollback on critical changes, canary for risky ones, feature flags to decouple deploy from release. Database migrations always follow expand-contract. The unifying principle: every step is independently rollback-safe. If you can't roll back, you don't really have a deploy strategy.

// SECTION_13

Databases (ops perspective)

Databases from the infra side: choosing managed vs self-hosted, configuring for production, monitoring, backups, and the operational concerns that don't appear in the schema.

Managed vs self-hosted

Managed (RDS, Cloud SQL, Aurora, PlanetScale): cloud provider handles backups, upgrades, replication, failover.

Self-hosted (Postgres on EC2, on Kubernetes): you handle everything.

For 95% of teams in 2026, use managed. The premium pays for itself in operational time saved. Self-host only when you have specific reasons:

Regulatory requirement (data residency, audit access).
Specific extensions not supported by managed offerings.
Cost at very large scale (managed becomes expensive over ~$100k/month DB spend).
You have a dedicated DBA team.

RDS configuration that matters

Setting	What it means
Multi-AZ	Sync replica in another AZ. Auto-failover on primary failure. Always on for production.
Read replicas	Async replicas for read scaling. Up to 15 per primary in RDS.
Automated backups	Daily snapshots + continuous WAL archive. Enables point-in-time recovery.
Backup retention	1-35 days. Set to maximum (35) unless you have a reason not to.
Maintenance window	When AWS applies patches. Pick low-traffic time.
Performance Insights	Free tier exists. Enable it. Shows what's slow.
Encryption at rest	KMS-encrypted. Enable on every prod DB.
Deletion protection	Prevents accidental deletion. Enable on prod.

Connection limits

Postgres backs each connection with a process. Too many connections = high memory, context-switching overhead, performance collapse.

db.t3.medium: ~80 connections.
db.r6g.large: ~200 connections.
db.r6g.4xlarge: ~5000 connections.

Most apps don't need 5000 connections. Most apps need 50, achieved through connection pooling.

PgBouncer / RDS Proxy

A connection pooler sits between your app and Postgres. Apps open lots of "virtual" connections; the pooler maps a small number of real connections. Especially important for serverless (Lambda) where each invocation might want a fresh connection.

5000 Lambda invocations
  ↓ (each has its own connection pool of, say, 5)
[ RDS Proxy ]
  ↓ (multiplexes to a small real pool)
20 actual connections to Postgres

RDS Proxy is AWS-managed; PgBouncer is self-hosted. Both work.

Backups and disaster recovery

Two metrics:

RPO (Recovery Point Objective) — how much data are you willing to lose? "We're OK losing 5 minutes" → RPO 5 minutes.
RTO (Recovery Time Objective) — how long can you be down? "We must be back in 1 hour" → RTO 1 hour.

Cheaper RPO/RTO = more expensive infrastructure. Set them based on actual business impact, not "as low as possible."

Backup strategy

Automated daily snapshots (35-day retention).
Continuous WAL archive for point-in-time recovery.
Cross-region snapshots for disaster recovery — encrypted with separate keys.
Quarterly restore drill — actually restore from backup. Untested backups don't exist.

Slow query analysis

-- enable slow query logging
ALTER SYSTEM SET log_min_duration_statement = 1000;  -- log queries > 1s
SELECT pg_reload_conf();

-- find current long-running queries
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds';

-- find queries with most total time
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

The pg_stat_statements extension tracks query stats. Essential for debugging what's actually slow vs occasionally slow.

REAL-WORLDDatabase operations checklist for a typical app

Multi-AZ enabled — automatic failover.
One read replica for analytics queries (so they don't impact OLTP).
Backup retention: 35 days. Cross-region copy for DR.
Encryption at rest with KMS.
Connection pooling via RDS Proxy or PgBouncer.
Performance Insights enabled, dashboards in place.
Slow query log at 1s threshold, sent to CloudWatch.
Alerts: CPU > 80%, free memory < 10%, connection count > 80% of max, replica lag > 60s, free storage < 20%.
Quarterly restore drill — pull a backup, restore to a temp instance, verify data.
Schema migrations run via expand-contract pattern, reviewed by a DBA or senior engineer.
Deletion protection on prod.
Tags for cost allocation (env, team, app).

NoSQL alternatives

	Best for
DynamoDB	Serverless, predictable access patterns, scaling to massive throughput
MongoDB / DocumentDB	Flexible schemas, document-shaped data
Cassandra / Scylla	Write-heavy, time-series, multi-region
Redis (as primary)	Real-time apps, leaderboards, caching as data store

For most apps in 2026, the answer is still Postgres + Redis cache. NoSQL when you have specific reasons. "We might need to scale" isn't a specific reason.

Database operations is mostly discipline. Multi-AZ, encrypted backups, monitoring, query analysis, and tested restores. The work is unglamorous but the failure mode is catastrophic. Most database "outages" are actually issues that should have been caught months earlier — slow queries that grew worse, replica lag that crept up, free storage that quietly approached zero.

// SECTION_14

Caching and queues

Caching speeds up reads. Queues smooth out writes. Together they're the two infrastructure patterns that scale most apps further than anything else.

Redis — the universal cache

In-memory key-value store. Sub-millisecond latency. Used for caching, sessions, rate limiting, pub/sub, locks, queues, leaderboards.

# basic operations
SET user:123 '{"name":"Alex"}' EX 300       # set with 5min TTL
GET user:123
DEL user:123

# atomic counters (rate limiting)
INCR rate:user:123:requests
EXPIRE rate:user:123:requests 60

# hashes for partial updates
HSET user:123 name "Alex" email "a@b.com"
HGET user:123 name

# sorted sets (leaderboards)
ZADD scores 100 "user:1" 200 "user:2"
ZREVRANGE scores 0 9              # top 10

# pub/sub
PUBLISH channel "message"
SUBSCRIBE channel

AWS ElastiCache — managed Redis

Two flavors:

Cluster mode disabled — single primary + read replicas. Simpler, scales to ~100GB.
Cluster mode enabled — sharded across nodes. Scales horizontally, but client must support clustering.

For most apps, cluster mode disabled with a few replicas is plenty.

Cache patterns

Cache-aside (lazy loading)

function getUser(id) {
  const cached = redis.get(`user:${id}`);
  if (cached) return JSON.parse(cached);

  const user = db.query('SELECT * FROM users WHERE id = ?', id);
  redis.setex(`user:${id}`, 300, JSON.stringify(user));
  return user;
}

Most common pattern. Simple, handles cache misses gracefully. Downside: brief cache miss after invalidation.

Write-through

function updateUser(id, data) {
  db.update('users', id, data);
  redis.setex(`user:${id}`, 300, JSON.stringify(data));
}

Cache always fresh. Costs an extra write per update.

Write-behind

Writes go to cache; cache async-writes to DB. Fast but risk of data loss if cache fails. Niche.

Cache invalidation — the hard problem

Two famous quotes apply: "There are only two hard things in computer science: cache invalidation and naming things." (Karlton)

Strategies:

TTL only — accept staleness up to TTL. Simplest.
Invalidate on write — delete the cached value when the source changes.
Versioned keys — bump a version number on write; reads include version. Old versions age out.
Pub/sub invalidation — emit invalidation messages for distributed cache layers.

Stampede prevention

Cache expires; thousands of requests hit the DB simultaneously trying to refresh it. The DB collapses.

Solutions:

Locks — only one process refreshes the cache; others wait.
Probabilistic early expiration — refresh slightly before actual expiry, randomized.
Stale-while-revalidate — serve stale value, refresh in background.

REAL-WORLDLayered caching for an API endpoint

Endpoint: GET /api/products/:id. Hit hard, data changes infrequently.

Layer 1 — CDN (CloudFront)
  - Cache-Control: public, max-age=60, stale-while-revalidate=300
  - Most requests served from edge. Origin sees a fraction.

Layer 2 — Application cache (Redis)
  - Key: product:{id}, TTL 5 minutes
  - Origin DB sees only cache misses

Layer 3 — Application in-process cache
  - Local LRU, 10s TTL
  - Avoid even Redis trips for hot products in the same instance

Layer 4 — Database (Postgres)
  - With proper indexing on products.id

The result: a viral product hits the DB once every 5 minutes per region, regardless of how popular it gets.

Invalidation when a product changes:

Update DB.
DEL product:{id} in Redis (next request will repopulate).
CloudFront invalidation (or use surrogate keys with Fastly/Cloudflare).
In-process caches expire on their own within 10s.

Message queues — SQS, RabbitMQ, Kafka

	Best for	Notes
SQS	AWS-native, simple work queue	Standard or FIFO, automatic retries with DLQ
SNS	Pub/sub fanout	Combined with SQS for filtered queues
RabbitMQ	Complex routing, AMQP	Self-hosted or Amazon MQ
Kafka	Event streaming, durable log	Self-hosted or MSK; more complex than SQS
Redis Streams	Lightweight streaming	If you already have Redis

SQS basics

// producer
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
const sqs = new SQSClient({});
await sqs.send(new SendMessageCommand({
  QueueUrl: 'https://sqs.us-east-1.amazonaws.com/123/my-queue',
  MessageBody: JSON.stringify({ userId: 123, action: 'send_email' }),
}));

// consumer (polls in a loop, or use Lambda trigger)
const { Messages } = await sqs.send(new ReceiveMessageCommand({
  QueueUrl: '...',
  MaxNumberOfMessages: 10,
  WaitTimeSeconds: 20,  // long polling
}));

for (const msg of Messages || []) {
  try {
    await processMessage(JSON.parse(msg.Body));
    await sqs.send(new DeleteMessageCommand({
      QueueUrl: '...',
      ReceiptHandle: msg.ReceiptHandle,
    }));
  } catch (err) {
    // don't delete; will be redelivered after visibility timeout
    // SQS auto-moves to DLQ after N failed attempts
  }
}

Dead-letter queues

Messages that fail repeatedly should go to a DLQ — a separate queue for failed messages. Manual investigation, optional reprocessing.

# SQS configuration
{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-queue-dlq",
    "maxReceiveCount": 5
  }
}

After 5 failures, the message moves to the DLQ. Alarms watch DLQ depth — non-zero = bug.

Idempotency

Queues guarantee at-least-once delivery, sometimes exactly-once. Most are at-least-once. Consumers must be idempotent — processing the same message twice should be safe.

async function processMessage(msg) {
  const { idempotencyKey, action, data } = msg;

  // check if we've already processed this
  if (await processed.has(idempotencyKey)) {
    return;  // skip
  }

  await doTheWork(action, data);
  await processed.set(idempotencyKey, true);
}

Event-driven vs queue-based

Queue (SQS) — point-to-point, one consumer per message. Work distribution.
Pub/sub (SNS) — one publisher, many subscribers, all see every message. Event broadcasting.
Stream (Kafka) — durable log, multiple consumers replay history independently. Event sourcing.

Caching and queueing are how you decouple performance from correctness. Cache reads to make them fast; queue writes to make them resilient. Both introduce eventual consistency — the trade-off you're making is "user sees stale data briefly" or "operation completes eventually" in exchange for scaling that wouldn't otherwise be possible. The mature design picks per-feature: instant for cart, queued for emails, cached for product listings.

// SECTION_15

CDN and edge

A CDN is just servers near users. Edge computing pushes more than caching to those servers — actually running code there. The line between CDN and edge platform is blurring.

What CDNs do

You point your domain at the CDN (CNAME or anycast IP).
User's DNS lookup returns an IP near them.
Request hits the closest CDN edge node.
If cached, edge serves directly (10-50ms).
If not, edge fetches from origin and caches per the response headers.

What CDNs cache

By default:

Static assets (JS, CSS, images) with long TTLs.
Anything with Cache-Control: public, max-age=....
Responses to GET (and sometimes HEAD) requests.

By default, NOT:

Anything with cookies (varies by configuration).
Anything with Cache-Control: private or no-store.
POST/PUT/DELETE requests.

Cache headers — the actual control

Cache-Control: public, max-age=31536000, immutable
# Static asset that never changes (hashed filename). Cache forever.

Cache-Control: public, max-age=60, stale-while-revalidate=600
# API response. Fresh for 60s; if requested 60-660s later, serve stale
# while fetching fresh in background. Best of both worlds.

Cache-Control: private, max-age=300
# User-specific. Browser cache only, not shared (CDN won't cache).

Cache-Control: no-store
# Don't cache anywhere. For sensitive data.

Cache-Control: public, max-age=0, must-revalidate
# Always check with server before using. Negotiated by ETag.

ETag: "v3-abc123"
# Validation. Browser sends If-None-Match: "v3-abc123" on next request.
# Server compares; if match, returns 304 Not Modified (no body).

Vary: Accept-Encoding, Authorization
# Cache key includes these headers. Different cached versions per value.

The combination max-age + stale-while-revalidate is magic for APIs. Users get fast cached responses; refreshes happen in background.

CloudFront vs Cloudflare vs Fastly

	CloudFront	Cloudflare	Fastly
Best for	AWS-native apps	Most websites, DDoS protection	Real-time invalidation, advanced caching
DDoS	Shield Standard included	Industry-leading	Good
Edge compute	CloudFront Functions, Lambda@Edge	Workers (V8 isolates)	Compute@Edge (WASM)
Cost model	Per GB + per request	Often free tier; per request	Per request
Origin choices	Best within AWS	Anywhere	Anywhere
Setup complexity	Higher	Lower	Medium

For AWS-hosted apps: CloudFront is fine. For websites with worldwide audience: Cloudflare is hard to beat. For complex CDN logic with instant cache purges: Fastly.

Edge compute

Run code at edge nodes near users. Use cases:

Authentication checks — reject unauth'd requests before they hit your origin.
A/B test routing — split traffic at the edge.
Header manipulation — rewrite URLs, add security headers.
Geolocation routing — different content per country.
Personalization — small dynamic pieces without full origin fetch.

// Cloudflare Worker
export default {
  async fetch(request) {
    const country = request.cf.country;
    const url = new URL(request.url);
    if (country === 'GB') {
      url.hostname = 'uk.example.com';
    }
    return fetch(url, request);
  },
};

Origin shielding

The CDN has many edge locations. Without shielding, a cache miss in each location separately fetches from origin. With shielding, edges fetch through one or two "shield" locations that aggregate misses.

Practical effect: origin sees one cache miss per shield, not one per edge location.

Cache purging

Time-based — wait for TTL to expire. Simple but slow.
URL-based purge — invalidate specific paths. Available in all major CDNs.
Tag-based purge (Fastly surrogate keys, Cloudflare cache tags) — tag responses, purge by tag. Best for invalidating "all pages featuring product X."

REAL-WORLDCDN setup for a typical SaaS

The pattern:

Marketing site (example.com): Cloudflare with aggressive caching. Static-generated, full edge cache.
Web app (app.example.com): CloudFront in front of ALB. Static assets cached forever; HTML cached briefly; API responses pass-through.
API (api.example.com): CloudFront with auth-aware caching. Most endpoints don't cache; a few public ones cached briefly.
User uploads (uploads.example.com): CloudFront in front of S3. Long TTL on hashed filenames.

Why two CDNs: marketing benefits from Cloudflare's free tier and DDoS protection; the app stays in AWS for tighter integration with ALB/Lambda.

CDN is the cheapest performance lever you have. Static assets at edge = sub-50ms globally. Even API responses cache surprisingly well with stale-while-revalidate. The trick is being explicit about what's cacheable — public vs private, what varies, what TTL. Once configured, the CDN handles a huge percentage of your traffic, your origin scales further, and users get a faster experience.

// SECTION_16

Cost management

Cloud bills get out of control quickly when nobody's watching. Cost optimization is a real discipline — most companies waste 20-40% of their cloud spend on the wrong things.

Where the money actually goes

Typical cloud spend distribution for a web app:

Compute (EC2/Fargate/Lambda) — 30-50%
Database (RDS) — 15-30%
Data transfer — 10-20% (often hidden until you look)
Storage (S3, EBS) — 5-15%
NAT gateways — surprisingly large for many setups
Other services — 5-15%

The biggest wasted-money patterns

1. Oversized instances

Most teams pick instance sizes by guess. Then never revisit. Run rightsizing analysis:

Look at p99 CPU/memory utilization over 30 days.
If always under 30%, the instance is too big.
Step down one size, watch metrics, step down again if safe.

AWS Compute Optimizer does this analysis. Trusted Advisor flags overprovisioned instances.

2. Forgotten resources

EC2 instances spun up for testing, never terminated.
EBS volumes detached but not deleted ($0.10/GB/month).
Old snapshots from years ago.
Empty load balancers ($16/mo each).
Elastic IPs not associated with anything.
Dev/staging environments running 24/7.

3. Data transfer

The most opaque cost. Data transfer between AZs is $0.01/GB. Out to internet is $0.05-0.09/GB. NAT gateway processed traffic is $0.045/GB. Cross-region traffic is $0.02/GB.

For high-traffic apps: this can be 20% of the bill. The fixes:

VPC Endpoints for AWS services (S3, DynamoDB) — traffic stays inside AWS, no NAT.
CloudFront for outbound to internet — cheaper than direct from origin.
Single-AZ for non-critical workloads (avoid cross-AZ transfer).

4. NAT Gateway misuse

NAT Gateways cost $0.045/hour ($32/month) plus $0.045/GB processed. With multi-AZ deployment, that's three NAT gateways at minimum.

If your private subnets just need to reach AWS services (S3, DynamoDB, ECR), use VPC Endpoints instead. Free or much cheaper. Removes NAT from the path.

5. CloudWatch Logs ingestion

$0.50/GB ingested. Verbose logging at scale is expensive.

Fixes:

Set log levels appropriately (INFO in prod, DEBUG only when debugging).
Export to S3 + Glacier for long-term storage instead of keeping in CloudWatch.
Filter noisy logs at the source.

Savings Plans and Reserved Instances

For steady workloads, commit to spending and save 30-72%.

Savings Plans — commit to $X/hour of compute spending. Flexible across instance types/regions/services. 1 or 3 year terms.
Reserved Instances — commit to specific instance types. Less flexible, slightly higher discount.

Most teams should use Savings Plans for ECS/Fargate/EC2/Lambda baseline. Even 1-year no-upfront saves ~30%.

Spot instances

EC2 capacity sold at 60-90% discount, with 2-minute notice if AWS reclaims them. Right for:

Batch processing.
Stateless web servers (with autoscaling backup).
CI/CD runners.
Dev/test environments.

Wrong for: stateful workloads, anything that can't tolerate sudden termination.

S3 storage classes

Class	Cost	Best for
Standard	$0.023/GB	Hot data, frequent access
Standard-IA (Infrequent Access)	$0.0125/GB	Backups, less-frequent access
One Zone-IA	$0.01/GB	Recreatable data, single AZ
Glacier Instant Retrieval	$0.004/GB	Archives, instant access when needed
Glacier Deep Archive	$0.00099/GB	Long-term archives, hours retrieval

Lifecycle policies automate transitions: "logs > 30 days → IA, > 90 days → Glacier."

Tagging for cost allocation

tags:
  Environment: production
  Team: platform
  Cost-Center: engineering
  Application: api
  ManagedBy: terraform

Cost Explorer can break down spend by tag. Now you can see "Team X spent $50k this month, of which $30k was on dev environments."

REAL-WORLDA real cost optimization audit

SaaS company, AWS bill: $42k/month. Engineering decided to look.

Findings after one week of investigation:

Three idle RDS instances from old projects. $1,800/mo. Deleted.
EBS snapshots from 2022 still around. 8TB worth. $400/mo. Lifecycle rule added.
Overprovisioned ECS service running at 12% CPU. Halved instance count. $2,200/mo savings.
NAT Gateway processed 3TB/month. Investigated — it was S3 traffic from private subnet. Added VPC Endpoint. $1,400/mo savings.
Dev/staging ECS running 24/7. Auto-shutdown weekends/nights. $1,800/mo savings.
Reserved Instances expired 6 months ago, never renewed. New 1-year Savings Plan. $4,500/mo savings.
CloudWatch Logs ingesting 200GB/day of debug logs. Reduced to INFO. $2,500/mo savings.

Total: $14,600/mo savings. ~35% of the bill, with no architectural changes. One engineer-week of work.

The cost discipline

Tags on everything. Cost allocation reports monthly.
Budgets with alerts. AWS Budgets emails when projected spend exceeds threshold.
Quarterly rightsizing review. Run Compute Optimizer; act on findings.
Savings Plans for baseline workloads.
Auto-shutdown non-production environments.
Regular cleanup sweeps — old snapshots, detached volumes, unused IPs.
One person owns cost — typically a senior infra engineer or finance partner.

Cost is a feature. Companies that ignore it accumulate waste; companies that own it free up budget for real engineering. The 80/20: tag everything, set budgets, rightsize quarterly, use Savings Plans, kill idle resources. Most teams find 20-30% savings with one focused week of work. The savings compound — once visibility is established, the team makes better choices going forward.

// SECTION_17

Security hardening

Infrastructure security is layered defense. Each layer assumes the others might fail. The senior posture: minimize blast radius, audit everything, automate the hard parts.

The shared responsibility model

AWS handles security of the cloud (physical, hypervisor, default service security). You handle security in the cloud (your configuration, code, data, access).

Don't assume "AWS handles security." Misconfigured S3 buckets, leaked IAM keys, open security groups — these are all your responsibility.

The high-leverage controls

1. IAM — least privilege

No root account use. Set up a billing alarm and lock it away.
IAM users only when necessary; prefer SSO + IAM Identity Center.
Roles for services. Pods/tasks/Lambdas get roles, not access keys.
Policies scoped to specific resources, specific actions.
MFA enforced on all IAM users.
Quarterly access reviews — remove unused permissions.

2. Network — defense in depth

Public subnet only for load balancers and bastions.
Private subnets for everything else.
Security groups reference each other (web-sg → db-sg) not IP ranges.
VPC flow logs enabled. Stored in S3 for forensics.
WAF in front of public-facing ALBs.
No SSH access in production — use SSM Session Manager.

3. Data — encrypted at rest and in transit

EBS volumes encrypted (default-on).
RDS encrypted at rest with KMS.
S3 buckets with default encryption (SSE-KMS for sensitive data).
TLS 1.2+ everywhere. ACM certificates with auto-renewal.
Internal service-to-service over TLS too for sensitive workloads.

4. Logging and audit

CloudTrail enabled, multi-region, with log file integrity validation.
VPC Flow Logs for network forensics.
Application logs to immutable storage.
S3 access logs for sensitive buckets.
Logs retained per compliance requirements (often 7 years for finance, 6 years for HIPAA).

5. Detection and response

GuardDuty enabled. ML-based threat detection.
Security Hub aggregates findings across services.
Inspector for vulnerability scanning of EC2/ECR.
Config to track resource configuration drift.
Alerts integrated with on-call rotation.

Public S3 buckets — the breach pattern

Most-publicized AWS breaches involve public S3 buckets. By 2026, AWS has block-public-access ON by default, but legacy buckets and misconfigurations still happen.

The defense:

Account-level public access block (overrides any bucket settings).
Bucket-level public access block.
S3 Storage Lens flags public buckets.
For "public" content (CDN-served images), use CloudFront with Origin Access Identity. Bucket stays private.

Secrets in code

Pre-commit hooks, GitHub push protection, and runtime scanning catch secrets in code. Detailed in the Secrets section.

Container security

Scan images for CVEs (Trivy, ECR scanning).
Run as non-root.
Read-only root filesystem where possible.
No latest tags — pin specific versions.
Minimal base images (alpine, distroless).
Sign images (Cosign / Notary).

The compliance overlap

If you're subject to SOC 2, HIPAA, PCI-DSS, GDPR, etc., security work overlaps heavily with compliance:

Encryption at rest and in transit (all of them).
Access logs and audit trails (SOC 2, HIPAA, SOX).
Access reviews quarterly (SOC 2).
Vulnerability management (SOC 2, PCI).
Backup and DR plans (SOC 2, HIPAA).
Vendor management (SOC 2, GDPR DPAs).

Build the security controls; compliance evidence comes from them.

REAL-WORLDA security baseline checklist

Root account: MFA, no API keys, used only for billing.
SSO via IAM Identity Center; no individual IAM users for engineers.
Service-linked roles only — no shared API keys.
CloudTrail multi-region, log integrity, sent to a security account.
GuardDuty + Security Hub enabled, findings to Slack.
VPC: private subnets for compute, NACLs at defaults, security groups reference each other.
WAF in front of public ALBs (rate limiting, OWASP rules, IP allowlists for admin).
S3: account-level public block, default encryption, lifecycle policies.
RDS: encrypted, in private subnets, no public access.
Secrets Manager for credentials, rotated automatically.
OIDC for CI/CD; no long-lived AWS keys in GitHub.
Container scanning in CI; signed images.
Datadog or equivalent for runtime monitoring.
Quarterly access reviews; off-boarding removes access immediately.
Annual penetration test for production infrastructure.

Security infrastructure is mostly invisible when working. The investment is upfront — IAM properly scoped, networks segmented, logging set up, scanning in place. Once the foundation exists, individual features inherit it. The mistake is treating security as last-minute work; the senior approach is building the platform such that doing the wrong thing is hard. A misconfigured S3 bucket should fail at terraform apply, not after the breach.

// SECTION_18

War stories

Production failures cluster around recurring patterns. These are the classics — the ones that show up in post-mortems again and again.

The cascading timeout

Setup: Service A calls Service B with a 30-second timeout. B normally responds in 50ms. B's database has a slow query that occasionally takes 10 seconds.

What happens: The slow query causes B to fall behind. B's response time creeps from 50ms to 5s. A's threads pile up waiting on B. A runs out of worker threads. A starts rejecting unrelated requests. Other services that depend on A start failing.

Root cause: The 30-second timeout. A 30-second timeout means a single slow downstream takes out 30 seconds of worker capacity per request. With 100 workers and 100 RPS, you have 3000 worker-seconds available; one bad downstream eating 30 seconds × 100 requests = 3000 worker-seconds gone.

Lessons: Timeouts must be much shorter than the upstream timeout — usually under 1s for sync dependencies. Bulkheads — limit concurrency per dependency. Circuit breakers stop the bleeding.

The midnight migration

Setup: Engineer adds a new column to a 500-million-row table. Runs ALTER TABLE accounts ADD COLUMN preferences JSONB DEFAULT '{}' at 11 PM Tuesday.

What happens: Postgres rewrites the entire table because of the default value. Holds an exclusive lock. Every query against the table hangs. By 11:05, the entire service is down.

Lessons: Always understand what locks a migration takes. Test on production-sized data, not on dev. Tools like strong_migrations or squawk catch this in CI.

The certificate expiration

Setup: An internal service uses a self-signed cert that "rotates yearly." Engineer who set it up left the company. Cert expires.

What happens: Saturday 3 AM, all internal calls to that service start failing with cert errors. On-call wakes up. Discovers the cert. Discovers the renewal process is undocumented.

Lessons: Automate cert renewal — cert-manager + Let's Encrypt for public, internal CA + cert-manager for private. Alert at 30, 14, 7, 1 day before expiration. Document even automated processes; people leave.

The runaway Lambda

Setup: S3 trigger fires Lambda on every uploaded file. Lambda processes the file and writes results back to the same bucket.

What happens: The Lambda's output triggers another Lambda invocation, which writes more output. Infinite loop. Hundreds of thousands of invocations in minutes. AWS bill spikes by $40k.

Lessons: S3 triggers should write to a different bucket or different prefix from the trigger source. Set Lambda concurrency limits. Set CloudWatch billing alarms. Always test event-driven systems for loops before deploying.

The DNS TTL trap

Setup: Migrating from one provider to another. Update DNS to point at new servers.

What happens: Old DNS TTL is 24 hours. Half the world's users keep going to old servers for the next 24 hours. Migration "complete" but users still report issues for a day.

Lessons: Reduce TTL to 60s a few days before migration. Verify with dig from multiple locations. Plan for a long tail of stragglers — keep old servers up well past the TTL window.

The runaway query

Setup: Analyst runs a "quick" report on production database. Query joins 4 tables, no indexes match, scans 200M rows.

What happens: Query runs for 4 hours. While running, MVCC keeps every row version from being garbage-collected. Database table sizes balloon, indexes bloat, autovacuum can't keep up. Performance degrades for everyone.

Lessons: Read replicas for analytics, ideally a separate warehouse (Snowflake, BigQuery). Statement timeouts on the OLTP database (SET statement_timeout = '30s'). Monitor for transactions older than X minutes; alert and investigate.

The thundering herd

Setup: Cache TTL of 5 minutes. Cache miss causes DB query that takes 2 seconds.

What happens: Cache expires; thousands of requests arrive in the same second; thousands of DB queries fire simultaneously; DB chokes; cache stays empty; more requests pile up; everything down.

Lessons: Stale-while-revalidate (serve stale value, refresh in background). Single-flight pattern (only one process refreshes; others wait). Probabilistic early refresh (refresh randomly before TTL).

The forgotten staging environment

Setup: Staging environment created during launch, used for testing. After launch, team mostly tests in production with feature flags.

What happens: Staging keeps running. Same instance sizes as production. Costs $8k/month. Nobody notices because it's lumped in with production billing. After 18 months, $144k spent.

Lessons: Auto-shutdown non-prod outside business hours. Tag everything for cost allocation. Quarterly review of running resources by environment. Even better: ephemeral environments (created per PR, destroyed on merge).

The IAM wildcard

Setup: Engineer needed an IAM role to read from one S3 bucket. Confused by the syntax, granted "Action": "*", "Resource": "*" to "make it work."

What happens: Months later, that role's credentials leak via a bug. Attacker has full AWS access. Spins up 1000 GPU instances mining crypto. $300k in 6 hours before alarms catch it.

Lessons: IAM policies must be specific. Service Control Policies at the org level prevent "*","*". Billing anomaly alerts catch crypto mining quickly. CloudTrail + GuardDuty would have flagged the unusual EC2 spawning.

The cross-region surprise

Setup: Application in us-east-1. Reads from S3 bucket "for analytics" — but the bucket is in us-west-2.

What happens: Every read crosses regions. Costs $0.02/GB. Application reads ~10TB/day. Bill: $200/day extra, completely invisible until cost review.

Lessons: Match resource regions when possible. Use S3 Cross-Region Replication if data must exist in multiple regions. Cost Explorer's "Data Transfer" breakdown surfaces these. Tag everything by region; cost allocation reports flag anomalies.

The one common thread

These stories share a pattern: a small problem cascades because the system wasn't designed to fail gracefully. The slow query, the long timeout, the missing index, the forgotten cert — none of these were the actual cause. The actual cause was that the system amplified the small problem instead of containing it.

The defenses are well-known: short timeouts with bulkheads, retry budgets, circuit breakers, expand-contract migrations, automated cert renewal, statement timeouts, stale-while-revalidate, auto-shutdown of non-prod, specific IAM policies, billing alarms, region tagging.

Production failures are pattern-matching exercises. Once you've seen a few cascading timeouts, every "service is slow" feels like a potential one. The senior infra engineer reads post-mortems from other companies the way a doctor reads case studies — to recognize the next one before it happens. Most outages are preventable; the work is putting the prevention in place before the outage proves you needed it.