Site Reliability Engineer

Google-originated discipline for building reliable, scalable production systems

10 milestones in this roadmap

Step 1beginner6-8 weeks

Systems Fundamentals

Master the core operating system concepts that every SRE relies on daily to diagnose issues, optimise performance, and maintain production Linux systems.

Curriculum

1Linux kernel architecture, system calls, and user-space vs kernel-space boundaries
2Process lifecycle management: fork, exec, signals, and process groups
3Memory management: virtual memory, paging, swap, OOM killer, and cgroups memory limits
4I/O subsystems: block devices, file systems (ext4, XFS), I/O schedulers, and buffer cache
5Networking stack: socket programming, TCP state machine, netfilter/iptables, and network namespaces
6Performance observation: /proc, /sys, strace, ltrace, and perf basics

Tools & Platforms

Linux (Ubuntu/RHEL)strace / ltraceperf / bpftracesystemd / journalctl

🐧

Step 1beginner6-8 weeks

Systems Fundamentals

Master the core operating system concepts that every SRE relies on daily to diagnose issues, optimise performance, and maintain production Linux systems.

Curriculum

1Linux kernel architecture, system calls, and user-space vs kernel-space boundaries
2Process lifecycle management: fork, exec, signals, and process groups
3Memory management: virtual memory, paging, swap, OOM killer, and cgroups memory limits
4I/O subsystems: block devices, file systems (ext4, XFS), I/O schedulers, and buffer cache
5

Step 2beginner6-8 weeks

Networking Deep Dive

Develop expert-level networking knowledge to troubleshoot connectivity issues, design resilient network topologies, and optimise data transfer across production systems.

Curriculum

1TCP/IP internals: three-way handshake, congestion control (cubic, BBR), window scaling, and Nagle algorithm
2DNS resolution: recursive vs iterative queries, record types, TTL strategies, and DNS-based load balancing
3Load balancing algorithms: round-robin, least connections, consistent hashing, and weighted distribution
4

Step 3intermediate4-6 weeks

Monitoring & Alerting

Build production-grade monitoring and alerting systems that provide actionable insights and wake you up only when it truly matters.

Curriculum

1Prometheus architecture: scraping, TSDB storage, federation, and remote write
2PromQL mastery: selectors, aggregations, rate(), histogram_quantile(), and recording rules
3Grafana dashboard design: variable templates, annotations, and alert panels
4Alert fatigue reduction: multi-window burn-rate alerts, severity classification, and routing

Step 4intermediate3-4 weeks

Incident Management

Learn the organisational and communication frameworks that turn chaotic outages into structured responses and drive systemic reliability improvements.

Curriculum

1Incident response lifecycle: detection, triage, mitigation, resolution, and follow-up
2Severity level classification: SEV1-SEV4 definitions, escalation criteria, and SLA mapping
3Incident commander role: communication templates, status page updates, and stakeholder management
4Blameless postmortem culture: contributing factors analysis, action items, and knowledge sharing

Step 5intermediate3-4 weeks

SLOs, SLIs & Error Budgets

Master the quantitative framework that Google pioneered to make reliability an engineering problem with measurable targets and data-driven trade-offs.

Curriculum

1Service level indicators: request latency percentiles, availability ratio, throughput, and correctness
2Service level objectives: target-setting methodology, window-based vs rolling, and user-journey SLOs
3Error budget calculation: budget remaining, burn rate, and budget exhaustion forecasting
4

Step 6intermediate4-6 weeks

Capacity Planning

Develop the analytical and engineering skills to ensure systems have the right resources at the right time without over-provisioning or under-provisioning.

Curriculum

1Load testing methodology: baseline measurement, ramp-up patterns, soak tests, and spike tests
2Traffic modeling: seasonal patterns, growth projections, and burst capacity estimation
3Resource forecasting: CPU, memory, disk, and network utilisation trends and extrapolation
4Horizontal vs vertical scaling: stateless service scaling, database read replicas, and data partitioning

Step 7advanced6-8 weeks

Distributed Systems Reliability

Gain deep knowledge of the theoretical foundations and practical patterns that make distributed systems reliable despite the inevitability of partial failures.

Curriculum

1Consensus algorithms: Raft leader election, log replication, and safety guarantees
2Replication strategies: synchronous, asynchronous, semi-synchronous, and chain replication
3CAP theorem and PACELC: practical partition tolerance trade-offs in real systems
4Eventual consistency: conflict resolution, vector clocks, CRDTs, and read-your-writes semantics

Step 8advanced4-6 weeks

Chaos Engineering

Build confidence in system resilience by systematically injecting failures and verifying that your reliability investments actually work under real conditions.

Curriculum

1Steady-state hypothesis: defining normal system behaviour with measurable metrics
2Failure injection techniques: process killing, network partitions, latency injection, and resource exhaustion
3GameDay exercises: planning, scope definition, safety controls, and stakeholder communication
4Chaos Monkey and Netflix Simian Army: random instance termination in production

Step 9advanced6-8 weeks

Performance Engineering

Develop the skills to find and eliminate performance bottlenecks across the entire stack from application code to database queries to network paths.

Curriculum

1Application profiling: CPU flame graphs, memory allocation profiling, and lock contention analysis
2Distributed tracing: OpenTelemetry instrumentation, trace propagation, and span analysis
3Latency optimisation: P50/P99 reduction strategies, tail latency amplification, and hedged requests
4Database query optimisation: EXPLAIN ANALYZE, index tuning, query plan analysis, and connection pooling

Step 10advanced3-4 weeks

SRE Culture & Practices

Synthesise all SRE skills into organisational practices that embed reliability into the software development lifecycle and team culture.

Curriculum

1Production readiness reviews: checklist design, review cadence, and graduation criteria for new services
2Launch checklists: monitoring, alerting, capacity, rollback plans, and load test verification
3On-call rotations: schedule design, compensation, handoff procedures, and burnout prevention
4Toil reduction: identifying repetitive manual work, automation ROI calculation, and elimination strategies

Ready to start this journey?

Browse our courses and books to begin your learning path.

Browse Courses Browse Books

Networking stack: socket programming, TCP state machine, netfilter/iptables, and network namespaces

6Performance observation: /proc, /sys, strace, ltrace, and perf basics

Tools & Platforms

Linux (Ubuntu/RHEL)strace / ltraceperf / bpftracesystemd / journalctl

CDN architecture: edge caching, origin shielding, cache invalidation, and geo-routing

5HTTP/2 multiplexing, server push, header compression (HPACK), and HTTP/3 QUIC protocol

6gRPC fundamentals: protocol buffers, streaming modes, and service mesh integration

Tools & Platforms

Wireshark / tcpdumpdig / nslookup / mtrcurl / httpieNginx / HAProxy

Step 2beginner6-8 weeks

Networking Deep Dive

Develop expert-level networking knowledge to troubleshoot connectivity issues, design resilient network topologies, and optimise data transfer across production systems.

Curriculum

1TCP/IP internals: three-way handshake, congestion control (cubic, BBR), window scaling, and Nagle algorithm
2DNS resolution: recursive vs iterative queries, record types, TTL strategies, and DNS-based load balancing
3Load balancing algorithms: round-robin, least connections, consistent hashing, and weighted distribution
4CDN architecture: edge caching, origin shielding, cache invalidation, and geo-routing
5HTTP/2 multiplexing, server push, header compression (HPACK), and HTTP/3 QUIC protocol
6gRPC fundamentals: protocol buffers, streaming modes, and service mesh integration

Tools & Platforms

Wireshark / tcpdumpdig / nslookup / mtrcurl / httpieNginx / HAProxy

5Runbook automation: documented response procedures, automated remediation triggers

6On-call tooling: PagerDuty escalation policies, schedule management, and override workflows

Tools & Platforms

PrometheusGrafanaAlertmanagerPagerDutyThanos / Mimir

Step 3intermediate4-6 weeks

Monitoring & Alerting

Build production-grade monitoring and alerting systems that provide actionable insights and wake you up only when it truly matters.

Curriculum

1Prometheus architecture: scraping, TSDB storage, federation, and remote write
2PromQL mastery: selectors, aggregations, rate(), histogram_quantile(), and recording rules
3Grafana dashboard design: variable templates, annotations, and alert panels
4Alert fatigue reduction: multi-window burn-rate alerts, severity classification, and routing
5Runbook automation: documented response procedures, automated remediation triggers
6On-call tooling: PagerDuty escalation policies, schedule management, and override workflows

Tools & Platforms

PrometheusGrafanaAlertmanagerPagerDutyThanos / Mimir

5Communication during outages: internal war rooms, customer notifications, and executive briefings

6SLA management: uptime commitments, credit policies, and contractual obligations

Tools & Platforms

PagerDuty / OpsgenieStatuspage / Atlassian StatuspageJira / LinearSlack incident channelsBlameless / FireHydrant

Step 4intermediate3-4 weeks

Incident Management

Learn the organisational and communication frameworks that turn chaotic outages into structured responses and drive systemic reliability improvements.

Curriculum

1Incident response lifecycle: detection, triage, mitigation, resolution, and follow-up
2Severity level classification: SEV1-SEV4 definitions, escalation criteria, and SLA mapping
3Incident commander role: communication templates, status page updates, and stakeholder management
4Blameless postmortem culture: contributing factors analysis, action items, and knowledge sharing
5Communication during outages: internal war rooms, customer notifications, and executive briefings
6SLA management: uptime commitments, credit policies, and contractual obligations

Tools & Platforms

PagerDuty / OpsgenieStatuspage / Atlassian StatuspageJira / LinearSlack incident channelsBlameless / FireHydrant

Error budget policies: development freezes, reliability sprints, and stakeholder agreements

5Toil measurement: manual vs automated work tracking, toil budgets, and reduction strategies

6Reliability targets: diminishing returns of nines, cost of additional reliability, and business alignment

Tools & Platforms

Prometheus + recording rulesGrafana SLO dashboardsGoogle SLO GeneratorNobl9 / Datadog SLO

Step 5intermediate3-4 weeks

SLOs, SLIs & Error Budgets

Master the quantitative framework that Google pioneered to make reliability an engineering problem with measurable targets and data-driven trade-offs.

Curriculum

1Service level indicators: request latency percentiles, availability ratio, throughput, and correctness
2Service level objectives: target-setting methodology, window-based vs rolling, and user-journey SLOs
3Error budget calculation: budget remaining, burn rate, and budget exhaustion forecasting
4Error budget policies: development freezes, reliability sprints, and stakeholder agreements
5Toil measurement: manual vs automated work tracking, toil budgets, and reduction strategies
6Reliability targets: diminishing returns of nines, cost of additional reliability, and business alignment

Tools & Platforms

Prometheus + recording rulesGrafana SLO dashboardsGoogle SLO GeneratorNobl9 / Datadog SLO

5Auto-scaling policies: metric-based triggers, cooldown periods, predictive scaling, and cluster autoscaler

6Cost optimisation: right-sizing instances, reserved capacity, spot instances, and resource quotas

Tools & Platforms

k6 / Locust / GatlingKubernetes HPA / VPAAWS Auto Scaling / GCP MIGGrafana capacity dashboards

Step 6intermediate4-6 weeks

Capacity Planning

Develop the analytical and engineering skills to ensure systems have the right resources at the right time without over-provisioning or under-provisioning.

Curriculum

1Load testing methodology: baseline measurement, ramp-up patterns, soak tests, and spike tests
2Traffic modeling: seasonal patterns, growth projections, and burst capacity estimation
3Resource forecasting: CPU, memory, disk, and network utilisation trends and extrapolation
4Horizontal vs vertical scaling: stateless service scaling, database read replicas, and data partitioning
5Auto-scaling policies: metric-based triggers, cooldown periods, predictive scaling, and cluster autoscaler
6Cost optimisation: right-sizing instances, reserved capacity, spot instances, and resource quotas

Tools & Platforms

k6 / Locust / GatlingKubernetes HPA / VPAAWS Auto Scaling / GCP MIGGrafana capacity dashboards

5Circuit breaker pattern: closed, open, half-open states, failure thresholds, and timeout configuration

6Bulkhead pattern: resource isolation, thread pool segregation, and blast radius containment

Tools & Platforms

etcd / ZooKeeper / ConsulHystrix / Resilience4jEnvoy proxyJepsen testing framework

Step 7advanced6-8 weeks

Distributed Systems Reliability

Gain deep knowledge of the theoretical foundations and practical patterns that make distributed systems reliable despite the inevitability of partial failures.

Curriculum

1Consensus algorithms: Raft leader election, log replication, and safety guarantees
2Replication strategies: synchronous, asynchronous, semi-synchronous, and chain replication
3CAP theorem and PACELC: practical partition tolerance trade-offs in real systems
4Eventual consistency: conflict resolution, vector clocks, CRDTs, and read-your-writes semantics
5Circuit breaker pattern: closed, open, half-open states, failure thresholds, and timeout configuration
6Bulkhead pattern: resource isolation, thread pool segregation, and blast radius containment

Tools & Platforms

etcd / ZooKeeper / ConsulHystrix / Resilience4jEnvoy proxyJepsen testing framework

5Litmus framework: Kubernetes-native chaos experiments, probes, and workflow orchestration

6Blast radius management: canary chaos, progressive rollout, and automated rollback triggers

Tools & Platforms

Chaos Monkey / Simian ArmyLitmus ChaosGremlinAWS Fault Injection SimulatorChaos Mesh

Step 8advanced4-6 weeks

Chaos Engineering

Build confidence in system resilience by systematically injecting failures and verifying that your reliability investments actually work under real conditions.

Curriculum

1Steady-state hypothesis: defining normal system behaviour with measurable metrics
2Failure injection techniques: process killing, network partitions, latency injection, and resource exhaustion
3GameDay exercises: planning, scope definition, safety controls, and stakeholder communication
4Chaos Monkey and Netflix Simian Army: random instance termination in production
5Litmus framework: Kubernetes-native chaos experiments, probes, and workflow orchestration
6Blast radius management: canary chaos, progressive rollout, and automated rollback triggers

Tools & Platforms

Chaos Monkey / Simian ArmyLitmus ChaosGremlinAWS Fault Injection SimulatorChaos Mesh

5Caching strategies: cache-aside, write-through, write-behind, TTL policies, and cache stampede prevention

6Connection pooling: database pool sizing, HTTP keep-alive tuning, and gRPC connection management

Tools & Platforms

Jaeger / Tempo / Zipkinasync-profiler / pprofPgBouncer / ProxySQLRedis / MemcachedOpenTelemetry

Step 9advanced6-8 weeks

Performance Engineering

Develop the skills to find and eliminate performance bottlenecks across the entire stack from application code to database queries to network paths.

Curriculum

1Application profiling: CPU flame graphs, memory allocation profiling, and lock contention analysis
2Distributed tracing: OpenTelemetry instrumentation, trace propagation, and span analysis
3Latency optimisation: P50/P99 reduction strategies, tail latency amplification, and hedged requests
4Database query optimisation: EXPLAIN ANALYZE, index tuning, query plan analysis, and connection pooling
5Caching strategies: cache-aside, write-through, write-behind, TTL policies, and cache stampede prevention
6Connection pooling: database pool sizing, HTTP keep-alive tuning, and gRPC connection management

Tools & Platforms

Jaeger / Tempo / Zipkinasync-profiler / pprofPgBouncer / ProxySQLRedis / MemcachedOpenTelemetry

5Automation philosophy: automate first, document second, and eliminate toil as a team sport

6SRE team models: embedded, consulting, and platform SRE, plus engagement frameworks with dev teams

Tools & Platforms

Terraform / Pulumi (IaC)Backstage (developer portal)Runbook automation toolsSRE workbook templates

Step 10advanced3-4 weeks

SRE Culture & Practices

Synthesise all SRE skills into organisational practices that embed reliability into the software development lifecycle and team culture.

Curriculum

1Production readiness reviews: checklist design, review cadence, and graduation criteria for new services
2Launch checklists: monitoring, alerting, capacity, rollback plans, and load test verification
3On-call rotations: schedule design, compensation, handoff procedures, and burnout prevention
4Toil reduction: identifying repetitive manual work, automation ROI calculation, and elimination strategies
5Automation philosophy: automate first, document second, and eliminate toil as a team sport
6SRE team models: embedded, consulting, and platform SRE, plus engagement frameworks with dev teams

Tools & Platforms

Terraform / Pulumi (IaC)Backstage (developer portal)Runbook automation toolsSRE workbook templates