Back to RoadmapsData Engineer
Master the art of building reliable, scalable data infrastructure and pipelines that power modern analytics and ML systems
10 milestones in this roadmap
Step 1beginner5-6 weeks
SQL Mastery
Master advanced SQL including window functions, CTEs, query optimization, and execution plans
Curriculum
- 1Complex Joins & Subqueries
- 2Window Functions & Analytical Queries
- 3Common Table Expressions (CTEs) & Recursive Queries
- 4Query Optimization & Execution Plans
- 5Index Types & Database Performance Tuning
Tools & Platforms
PostgreSQLMySQLDBeaverpgAdminSQLFluff
Step 1beginner5-6 weeks
SQL Mastery
Master advanced SQL including window functions, CTEs, query optimization, and execution plans
Curriculum
- 1Complex Joins & Subqueries
- 2Window Functions & Analytical Queries
- 3Common Table Expressions (CTEs) & Recursive Queries
- 4Query Optimization & Execution Plans
- 5Index Types & Database Performance Tuning
Step 2beginner4-5 weeks
Python for Data Engineering
Build robust Python skills for scripting, API integration, and file processing in data pipelines
Curriculum
- 1File Format Processing: CSV, JSON, Parquet, Avro
- 2REST API Consumption & Rate Limiting
- 3Error Handling, Logging & Configuration Management
- 4Python Concurrency: Threading, Asyncio, Multiprocessing
- 5
Step 3intermediate5-6 weeks
Data Warehousing
Design data warehouses using dimensional modeling, star schemas, and slowly changing dimensions
Curriculum
- 1Dimensional Modeling & Kimball Methodology
- 2Star Schema & Snowflake Schema Design
- 3Slowly Changing Dimensions (SCD Type 1, 2, 3)
- 4OLAP vs OLTP: Workload Characteristics
- 5
Step 4intermediate5-7 weeks
ETL/ELT Pipelines
Design and build reliable ETL/ELT pipelines with idempotency and proper error recovery
Curriculum
- 1ETL vs ELT Paradigms & Design Patterns
- 2Incremental Loading & Change Data Capture (CDC)
- 3Transformation Patterns & Data Deduplication
- 4Idempotency & Exactly-Once Processing
- 5
Step 5intermediate6-8 weeks
Big Data Processing
Process large-scale data with Apache Spark, distributed computing, and partitioning strategies
Curriculum
- 1Distributed Computing Fundamentals & MapReduce
- 2Apache Spark Architecture & Execution Model
- 3Spark DataFrames, Datasets & Spark SQL
- 4Partitioning, Shuffling & Join Optimization
- 5
Step 6intermediate6-8 weeks
Streaming & Real-Time Data
Build real-time data pipelines with Kafka, Flink, and event-driven architectures
Curriculum
- 1Apache Kafka: Topics, Partitions & Consumer Groups
- 2Stream Processing: Windowing & Watermarks
- 3Apache Flink: Stateful Stream Processing
- 4Event Sourcing & CQRS Patterns
- 5
Step 7advanced6-8 weeks
Cloud Data Platforms
Master Snowflake, BigQuery, Redshift, and Databricks for cloud-native data architecture
Curriculum
- 1Snowflake: Virtual Warehouses, Time Travel & Data Sharing
- 2Google BigQuery: Slots, Partitioning & Materialized Views
- 3Amazon Redshift: Distribution Styles & Spectrum
- 4Databricks: Unity Catalog & Photon Engine
- 5
Step 8advanced5-6 weeks
Data Orchestration
Orchestrate complex data workflows with Airflow, Dagster, and modern scheduling tools
Curriculum
- 1Apache Airflow: DAGs, Operators & Sensors
- 2Scheduling, Retries & Dependency Management
- 3Dynamic DAG Generation & Templating
- 4Dagster: Software-Defined Assets & Type System
- 5
Step 9advanced5-6 weeks
Data Quality & Governance
Ensure data reliability through validation, lineage tracking, and governance frameworks
Curriculum
- 1Data Validation & Quality Frameworks
- 2dbt Testing: Schema Tests, Custom Tests & Freshness
- 3Data Lineage Tracking & Impact Analysis
- 4Schema Registry & Schema Evolution Management
- 5
Step 10advanced6-8 weeks
Data Architecture at Scale
Design modern data architectures including data mesh, data fabric, and medallion patterns
Curriculum
- 1Data Mesh: Domains, Products & Federated Governance
- 2Data Fabric & Unified Data Access Patterns
- 3Medallion Architecture: Bronze, Silver & Gold Layers
- 4Cost Optimization & Capacity Planning
- 5
Ready to start this journey?
Browse our courses and books to begin your learning path.
PostgreSQLMySQLDBeaverpgAdminSQLFluff
Unit Testing & Data Validation Patterns
Tools & Platforms
Python 3requestspytestPydanticpyarrowaiohttp
Step 2beginner4-5 weeks
Python for Data Engineering
Build robust Python skills for scripting, API integration, and file processing in data pipelines
Curriculum
- 1File Format Processing: CSV, JSON, Parquet, Avro
- 2REST API Consumption & Rate Limiting
- 3Error Handling, Logging & Configuration Management
- 4Python Concurrency: Threading, Asyncio, Multiprocessing
- 5Unit Testing & Data Validation Patterns
Tools & Platforms
Python 3requestspytestPydanticpyarrowaiohttp
Columnar Storage & Compression Strategies
Tools & Platforms
PostgreSQLAmazon RedshiftGoogle BigQuerydbtLucidChart
Step 3intermediate5-6 weeks
Data Warehousing
Design data warehouses using dimensional modeling, star schemas, and slowly changing dimensions
Curriculum
- 1Dimensional Modeling & Kimball Methodology
- 2Star Schema & Snowflake Schema Design
- 3Slowly Changing Dimensions (SCD Type 1, 2, 3)
- 4OLAP vs OLTP: Workload Characteristics
- 5Columnar Storage & Compression Strategies
Tools & Platforms
PostgreSQLAmazon RedshiftGoogle BigQuerydbtLucidChart
Schema Evolution & Backward Compatibility
Tools & Platforms
Apache AirflowdbtFivetranAirbyteSingerPython
Step 4intermediate5-7 weeks
ETL/ELT Pipelines
Design and build reliable ETL/ELT pipelines with idempotency and proper error recovery
Curriculum
- 1ETL vs ELT Paradigms & Design Patterns
- 2Incremental Loading & Change Data Capture (CDC)
- 3Transformation Patterns & Data Deduplication
- 4Idempotency & Exactly-Once Processing
- 5Schema Evolution & Backward Compatibility
Tools & Platforms
Apache AirflowdbtFivetranAirbyteSingerPython
Caching, Broadcast Variables & Performance Tuning
Tools & Platforms
Apache SparkPySparkSpark SQLHadoop HDFSDelta LakeDatabricks
Step 5intermediate6-8 weeks
Big Data Processing
Process large-scale data with Apache Spark, distributed computing, and partitioning strategies
Curriculum
- 1Distributed Computing Fundamentals & MapReduce
- 2Apache Spark Architecture & Execution Model
- 3Spark DataFrames, Datasets & Spark SQL
- 4Partitioning, Shuffling & Join Optimization
- 5Caching, Broadcast Variables & Performance Tuning
Tools & Platforms
Apache SparkPySparkSpark SQLHadoop HDFSDelta LakeDatabricks
Exactly-Once Semantics & Delivery Guarantees
Tools & Platforms
Apache KafkaApache FlinkKafka StreamsConfluent PlatformApache PulsarRedpanda
Step 6intermediate6-8 weeks
Streaming & Real-Time Data
Build real-time data pipelines with Kafka, Flink, and event-driven architectures
Curriculum
- 1Apache Kafka: Topics, Partitions & Consumer Groups
- 2Stream Processing: Windowing & Watermarks
- 3Apache Flink: Stateful Stream Processing
- 4Event Sourcing & CQRS Patterns
- 5Exactly-Once Semantics & Delivery Guarantees
Tools & Platforms
Apache KafkaApache FlinkKafka StreamsConfluent PlatformApache PulsarRedpanda
Data Lakehouse Architecture & Delta Lake
Tools & Platforms
SnowflakeGoogle BigQueryAmazon RedshiftDatabricksDelta LakeApache Iceberg
Step 7advanced6-8 weeks
Cloud Data Platforms
Master Snowflake, BigQuery, Redshift, and Databricks for cloud-native data architecture
Curriculum
- 1Snowflake: Virtual Warehouses, Time Travel & Data Sharing
- 2Google BigQuery: Slots, Partitioning & Materialized Views
- 3Amazon Redshift: Distribution Styles & Spectrum
- 4Databricks: Unity Catalog & Photon Engine
- 5Data Lakehouse Architecture & Delta Lake
Tools & Platforms
SnowflakeGoogle BigQueryAmazon RedshiftDatabricksDelta LakeApache Iceberg
Prefect: Flow-Based & Dynamic Workflows
Tools & Platforms
Apache AirflowDagsterPrefectAstronomerMageKestra
Step 8advanced5-6 weeks
Data Orchestration
Orchestrate complex data workflows with Airflow, Dagster, and modern scheduling tools
Curriculum
- 1Apache Airflow: DAGs, Operators & Sensors
- 2Scheduling, Retries & Dependency Management
- 3Dynamic DAG Generation & Templating
- 4Dagster: Software-Defined Assets & Type System
- 5Prefect: Flow-Based & Dynamic Workflows
Tools & Platforms
Apache AirflowDagsterPrefectAstronomerMageKestra
Data Cataloging & Metadata Management
Tools & Platforms
Great ExpectationsdbtApache AtlasDatahubMonte CarloSoda
Step 9advanced5-6 weeks
Data Quality & Governance
Ensure data reliability through validation, lineage tracking, and governance frameworks
Curriculum
- 1Data Validation & Quality Frameworks
- 2dbt Testing: Schema Tests, Custom Tests & Freshness
- 3Data Lineage Tracking & Impact Analysis
- 4Schema Registry & Schema Evolution Management
- 5Data Cataloging & Metadata Management
Tools & Platforms
Great ExpectationsdbtApache AtlasDatahubMonte CarloSoda
Architectural Decision Records & Trade-off Analysis
Tools & Platforms
DatabricksSnowflakeApache IcebergTerraformdbtAtlan
Step 10advanced6-8 weeks
Data Architecture at Scale
Design modern data architectures including data mesh, data fabric, and medallion patterns
Curriculum
- 1Data Mesh: Domains, Products & Federated Governance
- 2Data Fabric & Unified Data Access Patterns
- 3Medallion Architecture: Bronze, Silver & Gold Layers
- 4Cost Optimization & Capacity Planning
- 5Architectural Decision Records & Trade-off Analysis
Tools & Platforms
DatabricksSnowflakeApache IcebergTerraformdbtAtlan