Google-originated discipline for building reliable, scalable production systems
10 milestones in this roadmap
Step 1beginner6-8 weeks
Systems Fundamentals
Master the core operating system concepts that every SRE relies on daily to diagnose issues, optimise performance, and maintain production Linux systems.
Curriculum
1Linux kernel architecture, system calls, and user-space vs kernel-space boundaries
2Process lifecycle management: fork, exec, signals, and process groups
4I/O subsystems: block devices, file systems (ext4, XFS), I/O schedulers, and buffer cache
5Networking stack: socket programming, TCP state machine, netfilter/iptables, and network namespaces
6Performance observation: /proc, /sys, strace, ltrace, and perf basics
Tools & Platforms
Linux (Ubuntu/RHEL)strace / ltraceperf / bpftracesystemd / journalctl
🐧
Step 1beginner6-8 weeks
Systems Fundamentals
Master the core operating system concepts that every SRE relies on daily to diagnose issues, optimise performance, and maintain production Linux systems.
Curriculum
1Linux kernel architecture, system calls, and user-space vs kernel-space boundaries
2Process lifecycle management: fork, exec, signals, and process groups
4I/O subsystems: block devices, file systems (ext4, XFS), I/O schedulers, and buffer cache
5
Step 2beginner6-8 weeks
Networking Deep Dive
Develop expert-level networking knowledge to troubleshoot connectivity issues, design resilient network topologies, and optimise data transfer across production systems.
Curriculum
1TCP/IP internals: three-way handshake, congestion control (cubic, BBR), window scaling, and Nagle algorithm
2DNS resolution: recursive vs iterative queries, record types, TTL strategies, and DNS-based load balancing
3Load balancing algorithms: round-robin, least connections, consistent hashing, and weighted distribution
4
Step 3intermediate4-6 weeks
Monitoring & Alerting
Build production-grade monitoring and alerting systems that provide actionable insights and wake you up only when it truly matters.
Curriculum
1Prometheus architecture: scraping, TSDB storage, federation, and remote write
2PromQL mastery: selectors, aggregations, rate(), histogram_quantile(), and recording rules
3Grafana dashboard design: variable templates, annotations, and alert panels
4Alert fatigue reduction: multi-window burn-rate alerts, severity classification, and routing
Step 4intermediate3-4 weeks
Incident Management
Learn the organisational and communication frameworks that turn chaotic outages into structured responses and drive systemic reliability improvements.
Curriculum
1Incident response lifecycle: detection, triage, mitigation, resolution, and follow-up
2Severity level classification: SEV1-SEV4 definitions, escalation criteria, and SLA mapping
3Incident commander role: communication templates, status page updates, and stakeholder management
Develop the analytical and engineering skills to ensure systems have the right resources at the right time without over-provisioning or under-provisioning.
2Traffic modeling: seasonal patterns, growth projections, and burst capacity estimation
3Resource forecasting: CPU, memory, disk, and network utilisation trends and extrapolation
4Horizontal vs vertical scaling: stateless service scaling, database read replicas, and data partitioning
Step 7advanced6-8 weeks
Distributed Systems Reliability
Gain deep knowledge of the theoretical foundations and practical patterns that make distributed systems reliable despite the inevitability of partial failures.
Curriculum
1Consensus algorithms: Raft leader election, log replication, and safety guarantees
2Replication strategies: synchronous, asynchronous, semi-synchronous, and chain replication
3CAP theorem and PACELC: practical partition tolerance trade-offs in real systems
4Eventual consistency: conflict resolution, vector clocks, CRDTs, and read-your-writes semantics
Step 8advanced4-6 weeks
Chaos Engineering
Build confidence in system resilience by systematically injecting failures and verifying that your reliability investments actually work under real conditions.
Curriculum
1Steady-state hypothesis: defining normal system behaviour with measurable metrics
2Failure injection techniques: process killing, network partitions, latency injection, and resource exhaustion
3GameDay exercises: planning, scope definition, safety controls, and stakeholder communication
4Chaos Monkey and Netflix Simian Army: random instance termination in production
Step 9advanced6-8 weeks
Performance Engineering
Develop the skills to find and eliminate performance bottlenecks across the entire stack from application code to database queries to network paths.
Curriculum
1Application profiling: CPU flame graphs, memory allocation profiling, and lock contention analysis
2Distributed tracing: OpenTelemetry instrumentation, trace propagation, and span analysis
Develop expert-level networking knowledge to troubleshoot connectivity issues, design resilient network topologies, and optimise data transfer across production systems.
Curriculum
1TCP/IP internals: three-way handshake, congestion control (cubic, BBR), window scaling, and Nagle algorithm
2DNS resolution: recursive vs iterative queries, record types, TTL strategies, and DNS-based load balancing
3Load balancing algorithms: round-robin, least connections, consistent hashing, and weighted distribution
4CDN architecture: edge caching, origin shielding, cache invalidation, and geo-routing
5HTTP/2 multiplexing, server push, header compression (HPACK), and HTTP/3 QUIC protocol
6gRPC fundamentals: protocol buffers, streaming modes, and service mesh integration
Develop the analytical and engineering skills to ensure systems have the right resources at the right time without over-provisioning or under-provisioning.
Gain deep knowledge of the theoretical foundations and practical patterns that make distributed systems reliable despite the inevitability of partial failures.
Curriculum
1Consensus algorithms: Raft leader election, log replication, and safety guarantees
2Replication strategies: synchronous, asynchronous, semi-synchronous, and chain replication
3CAP theorem and PACELC: practical partition tolerance trade-offs in real systems
4Eventual consistency: conflict resolution, vector clocks, CRDTs, and read-your-writes semantics
Build confidence in system resilience by systematically injecting failures and verifying that your reliability investments actually work under real conditions.
Curriculum
1Steady-state hypothesis: defining normal system behaviour with measurable metrics
2Failure injection techniques: process killing, network partitions, latency injection, and resource exhaustion
3GameDay exercises: planning, scope definition, safety controls, and stakeholder communication
4Chaos Monkey and Netflix Simian Army: random instance termination in production
5Litmus framework: Kubernetes-native chaos experiments, probes, and workflow orchestration