GitHub - Murathe/LuxDevHQ-Data-Engineering-Guide: LuxDevHQ Data Engineering Course Guide

LuxDevHQ Data Engineering Course Outline

This comprehensive course spans 4 months (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more.

Learning Days: Monday to Thursday (theory and practice).
Friday: Job shadowing or peer projects.
Saturday: Hands-on lab sessions and project-based learning.

Month 1: Foundations of Data Engineering

Week 1: Onboarding and Environment Setup

Monday:
- Onboarding, course overview, career pathways, tools introduction.
Tuesday:
- Introduction to cloud computing (Azure and AWS).
Wednesday:
- Data governance, security, compliance, and access control.
Thursday:
- Introduction to SQL for data engineering and PostgreSQL setup.
Friday:
- Peer Project: Environment setup challenges.
Saturday (Lab):
- Mini Project: Build a basic pipeline with PostgreSQL and Azure Blob Storage.

Week 2: SQL Essentials for Data Engineering

Monday:
- Core SQL concepts (SELECT, WHERE, JOIN, GROUP BY).
Tuesday:
- Advanced SQL techniques: recursive queries, window functions, and CTEs.
Wednesday:
- Query optimization and execution plans.
Thursday:
- Data modeling: normalization, denormalization, and star schemas.
Friday:
- Job Shadowing: Observe senior engineers writing and optimizing SQL queries.
Saturday (Lab):
- Mini Project: Create a star schema and analyze data using SQL.

Week 3: Introduction to Data Pipelines

Monday:
- Theory: Introduction to ETL/ELT workflows.
Tuesday:
- Lab: Create a simple Python-based ETL pipeline for CSV data.
Wednesday:
- Theory: Extract, transform, load (ETL) concepts and best practices.
Thursday:
- Lab: Build a Python ETL pipeline for batch data processing.
Friday:
- Peer Project: Collaborate to design a basic ETL workflow.
Saturday (Lab):
- Mini Project: Develop a simple ETL pipeline to process sales data.

Week 4: Introduction to Apache Airflow

Monday:
- Theory: Introduction to Apache Airflow, DAGs, and scheduling.
Tuesday:
- Lab: Set up Apache Airflow and create a basic DAG.
Wednesday:
- Theory: DAG best practices and scheduling in Airflow.
Thursday:
- Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage.
Friday:
- Job Shadowing: Observe real-world Airflow pipelines.
Saturday (Lab):
- Mini Project: Automate an ETL pipeline with Airflow for batch data processing.

Month 2: Intermediate Tools and Concepts

Week 5: Data Warehousing and Data Lakes

Monday:
- Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering).
Tuesday:
- Lab: Work with Amazon Redshift and Snowflake for data warehousing.
Wednesday:
- Theory: Data lakes and Lakehouse architecture.
Thursday:
- Lab: Set up Delta Lake for raw and curated data.
Friday:
- Peer Project: Implement a data warehouse model and data lake for sales data.
Saturday (Lab):
- Mini Project: Design and implement a basic Lakehouse architecture.

Week 6: Data Governance and Security

Monday:
- Theory: Data governance frameworks and data security principles.
Tuesday:
- Lab: Use AWS Lake Formation for access control and security enforcement.
Wednesday:
- Theory: Managing sensitive data and compliance (GDPR, HIPAA).
Thursday:
- Lab: Implement security policies in S3 and Azure Blob Storage.
Friday:
- Job Shadowing: Observe senior engineers applying governance policies.
Saturday (Lab):
- Mini Project: Secure data in the cloud using AWS and Azure.

Week 7: Real-Time Data Processing with Kafka

Monday:
- Theory: Introduction to Apache Kafka for real-time data streaming.
Tuesday:
- Lab: Set up a Kafka producer and consumer.
Wednesday:
- Theory: Kafka topics, partitions, and message brokers.
Thursday:
- Lab: Integrate Kafka with PostgreSQL for real-time updates.
Friday:
- Peer Project: Build a real-time Kafka pipeline for transactional data.
Saturday (Lab):
- Mini Project: Create a pipeline to stream e-commerce data with Kafka.

Week 8: Batch vs. Stream Processing

Monday:
- Theory: Introduction to batch vs. stream processing.
Tuesday:
- Lab: Batch processing with PySpark.
Wednesday:
- Theory: Combining batch and stream processing workflows.
Thursday:
- Lab: Real-time processing with Apache Flink and Spark Streaming.
Friday:
- Job Shadowing: Observe a real-time processing pipeline.
Saturday (Lab):
- Mini Project: Build a hybrid pipeline combining batch and real-time processing.

Month 3: Advanced Data Engineering

Week 9: Machine Learning Integration in Data Pipelines

Monday:
- Theory: Overview of ML workflows in data engineering.
Tuesday:
- Lab: Preprocess data for machine learning using Pandas and PySpark.
Wednesday:
- Theory: Feature engineering and automated feature extraction.
Thursday:
- Lab: Automate feature extraction using Apache Airflow.
Friday:
- Peer Project: Build a simple pipeline that integrates ML models.
Saturday (Lab):
- Mini Project: Build an ML-powered recommendation system in a pipeline.

Week 10: Spark and PySpark for Big Data

Monday:
- Theory: Introduction to Apache Spark for big data processing.
Tuesday:
- Lab: Set up Spark and PySpark for data analysis.
Wednesday:
- Theory: Spark RDDs, DataFrames, and SQL.
Thursday:
- Lab: Analyze large datasets using Spark SQL.
Friday:
- Peer Project: Build a PySpark pipeline for large-scale data processing.
Saturday (Lab):
- Mini Project: Analyze big data sets with Spark and PySpark.

Week 11: Advanced Apache Airflow Techniques

Monday:
- Theory: Advanced Airflow features (XCom, task dependencies).
Tuesday:
- Lab: Implement dynamic DAGs and task dependencies in Airflow.
Wednesday:
- Theory: Airflow scheduling, monitoring, and error handling.
Thursday:
- Lab: Create complex DAGs for multi-step ETL pipelines.
Friday:
- Job Shadowing: Observe advanced Airflow pipeline implementations.
Saturday (Lab):
- Mini Project: Design an advanced Airflow DAG for complex data workflows.

Week 12: Data Lakes and Delta Lake

Monday:
- Theory: Data lakes, Lakehouses, and Delta Lake architecture.
Tuesday:
- Lab: Set up Delta Lake on AWS for data storage and management.
Wednesday:
- Theory: Managing schema evolution in Delta Lake.
Thursday:
- Lab: Implement batch and real-time data loading to Delta Lake.
Friday:
- Peer Project: Design a Lakehouse architecture for an e-commerce platform.
Saturday (Lab):
- Mini Project: Implement a scalable Delta Lake architecture.

Month 4: Capstone Projects

Week 13: Batch Data Pipeline Development

Monday to Thursday:
- Design and Implementation:
  - Build an end-to-end batch data pipeline for e-commerce sales analytics.
- Tools: PySpark, SQL, PostgreSQL, Airflow, S3.
Friday:
- Peer Review: Present progress and receive feedback.
Saturday (Lab):
- Project Milestone: Finalize and present batch pipeline results.

Week 14: Real-Time Data Pipeline Development

Monday to Thursday:
- Design and Implementation:
  - Build an end-to-end real-time data pipeline for IoT sensor monitoring.
- Tools: Kafka, Spark Streaming, Flink, S3.
Friday:
- Peer Review: Present progress and receive feedback.
Saturday (Lab):
- Project Milestone: Finalize and present real-time pipeline results.

Week 15: Final Project Integration

Monday to Thursday:
- Design and Implementation:
  - Integrate both batch and real-time pipelines for a comprehensive end-to-end solution.
- Tools: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3.
Friday:
- Job Shadowing: Observe senior engineers integrating complex pipelines.
Saturday (Lab):
- Project Milestone: Showcase integrated solution for review.

Week 16: Capstone Project Presentation

Monday to Thursday:
- Final Presentation Preparation:
  - Polish, test, and document the final project.
Friday:
- Peer Review: Present final projects to peers and receive feedback.
Saturday (Lab):
- Capstone Presentation: Showcase completed capstone projects to industry professionals and instructors.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
AivenProjectVersionWeekOneProject.md		AivenProjectVersionWeekOneProject.md
Apache Airflow 101 Guide.md		Apache Airflow 101 Guide.md
Apache Airflow Operators Guide.md		Apache Airflow Operators Guide.md
Apache Kafka 101: Apache Kafka for Data Engineering Guide.md		Apache Kafka 101: Apache Kafka for Data Engineering Guide.md
Apache Kafka 102: Apache Kafka for Data Engineering Guide.md		Apache Kafka 102: Apache Kafka for Data Engineering Guide.md
CoreConceptsDataModeling.md		CoreConceptsDataModeling.md
Data Engineer Apache Kafka Producers.md		Data Engineer Apache Kafka Producers.md
Data Engineering Python Faker Demo.ipynb		Data Engineering Python Faker Demo.ipynb
Day1-WeekOneDayOneGude.md		Day1-WeekOneDayOneGude.md
Day2- WeekOneDayTwoPracticeQuestions.md		Day2- WeekOneDayTwoPracticeQuestions.md
Day3-WeekOneDayThreeClass.md		Day3-WeekOneDayThreeClass.md
IntroductiontoCloudComputing.md		IntroductiontoCloudComputing.md
README.md		README.md
ToolsAndTechnologiesInstallation Guide.md		ToolsAndTechnologiesInstallation Guide.md
WeekOneProject.md		WeekOneProject.md
samplejson.json		samplejson.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LuxDevHQ Data Engineering Course Outline

Month 1: Foundations of Data Engineering

Week 1: Onboarding and Environment Setup

Week 2: SQL Essentials for Data Engineering

Week 3: Introduction to Data Pipelines

Week 4: Introduction to Apache Airflow

Month 2: Intermediate Tools and Concepts

Week 5: Data Warehousing and Data Lakes

Week 6: Data Governance and Security

Week 7: Real-Time Data Processing with Kafka

Week 8: Batch vs. Stream Processing

Month 3: Advanced Data Engineering

Week 9: Machine Learning Integration in Data Pipelines

Week 10: Spark and PySpark for Big Data

Week 11: Advanced Apache Airflow Techniques

Week 12: Data Lakes and Delta Lake

Month 4: Capstone Projects

Week 13: Batch Data Pipeline Development

Week 14: Real-Time Data Pipeline Development

Week 15: Final Project Integration

Week 16: Capstone Project Presentation

About

Uh oh!

Releases

Packages

Languages

Murathe/LuxDevHQ-Data-Engineering-Guide

Folders and files

Latest commit

History

Repository files navigation

LuxDevHQ Data Engineering Course Outline

Month 1: Foundations of Data Engineering

Week 1: Onboarding and Environment Setup

Week 2: SQL Essentials for Data Engineering

Week 3: Introduction to Data Pipelines

Week 4: Introduction to Apache Airflow

Month 2: Intermediate Tools and Concepts

Week 5: Data Warehousing and Data Lakes

Week 6: Data Governance and Security

Week 7: Real-Time Data Processing with Kafka

Week 8: Batch vs. Stream Processing

Month 3: Advanced Data Engineering

Week 9: Machine Learning Integration in Data Pipelines

Week 10: Spark and PySpark for Big Data

Week 11: Advanced Apache Airflow Techniques

Week 12: Data Lakes and Delta Lake

Month 4: Capstone Projects

Week 13: Batch Data Pipeline Development

Week 14: Real-Time Data Pipeline Development

Week 15: Final Project Integration

Week 16: Capstone Project Presentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages