This comprehensive course spans 4 months (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more.
- Learning Days: Monday to Thursday (theory and practice).
- Friday: Job shadowing or peer projects.
- Saturday: Hands-on lab sessions and project-based learning.
- Monday:
- Onboarding, course overview, career pathways, tools introduction.
- Tuesday:
- Introduction to cloud computing (Azure and AWS).
- Wednesday:
- Data governance, security, compliance, and access control.
- Thursday:
- Introduction to SQL for data engineering and PostgreSQL setup.
- Friday:
- Peer Project: Environment setup challenges.
- Saturday (Lab):
- Mini Project: Build a basic pipeline with PostgreSQL and Azure Blob Storage.
- Monday:
- Core SQL concepts (
SELECT,WHERE,JOIN,GROUP BY).
- Core SQL concepts (
- Tuesday:
- Advanced SQL techniques: recursive queries, window functions, and CTEs.
- Wednesday:
- Query optimization and execution plans.
- Thursday:
- Data modeling: normalization, denormalization, and star schemas.
- Friday:
- Job Shadowing: Observe senior engineers writing and optimizing SQL queries.
- Saturday (Lab):
- Mini Project: Create a star schema and analyze data using SQL.
- Monday:
- Theory: Introduction to ETL/ELT workflows.
- Tuesday:
- Lab: Create a simple Python-based ETL pipeline for CSV data.
- Wednesday:
- Theory: Extract, transform, load (ETL) concepts and best practices.
- Thursday:
- Lab: Build a Python ETL pipeline for batch data processing.
- Friday:
- Peer Project: Collaborate to design a basic ETL workflow.
- Saturday (Lab):
- Mini Project: Develop a simple ETL pipeline to process sales data.
- Monday:
- Theory: Introduction to Apache Airflow, DAGs, and scheduling.
- Tuesday:
- Lab: Set up Apache Airflow and create a basic DAG.
- Wednesday:
- Theory: DAG best practices and scheduling in Airflow.
- Thursday:
- Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage.
- Friday:
- Job Shadowing: Observe real-world Airflow pipelines.
- Saturday (Lab):
- Mini Project: Automate an ETL pipeline with Airflow for batch data processing.
- Monday:
- Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering).
- Tuesday:
- Lab: Work with Amazon Redshift and Snowflake for data warehousing.
- Wednesday:
- Theory: Data lakes and Lakehouse architecture.
- Thursday:
- Lab: Set up Delta Lake for raw and curated data.
- Friday:
- Peer Project: Implement a data warehouse model and data lake for sales data.
- Saturday (Lab):
- Mini Project: Design and implement a basic Lakehouse architecture.
- Monday:
- Theory: Data governance frameworks and data security principles.
- Tuesday:
- Lab: Use AWS Lake Formation for access control and security enforcement.
- Wednesday:
- Theory: Managing sensitive data and compliance (GDPR, HIPAA).
- Thursday:
- Lab: Implement security policies in S3 and Azure Blob Storage.
- Friday:
- Job Shadowing: Observe senior engineers applying governance policies.
- Saturday (Lab):
- Mini Project: Secure data in the cloud using AWS and Azure.
- Monday:
- Theory: Introduction to Apache Kafka for real-time data streaming.
- Tuesday:
- Lab: Set up a Kafka producer and consumer.
- Wednesday:
- Theory: Kafka topics, partitions, and message brokers.
- Thursday:
- Lab: Integrate Kafka with PostgreSQL for real-time updates.
- Friday:
- Peer Project: Build a real-time Kafka pipeline for transactional data.
- Saturday (Lab):
- Mini Project: Create a pipeline to stream e-commerce data with Kafka.
- Monday:
- Theory: Introduction to batch vs. stream processing.
- Tuesday:
- Lab: Batch processing with PySpark.
- Wednesday:
- Theory: Combining batch and stream processing workflows.
- Thursday:
- Lab: Real-time processing with Apache Flink and Spark Streaming.
- Friday:
- Job Shadowing: Observe a real-time processing pipeline.
- Saturday (Lab):
- Mini Project: Build a hybrid pipeline combining batch and real-time processing.
- Monday:
- Theory: Overview of ML workflows in data engineering.
- Tuesday:
- Lab: Preprocess data for machine learning using Pandas and PySpark.
- Wednesday:
- Theory: Feature engineering and automated feature extraction.
- Thursday:
- Lab: Automate feature extraction using Apache Airflow.
- Friday:
- Peer Project: Build a simple pipeline that integrates ML models.
- Saturday (Lab):
- Mini Project: Build an ML-powered recommendation system in a pipeline.
- Monday:
- Theory: Introduction to Apache Spark for big data processing.
- Tuesday:
- Lab: Set up Spark and PySpark for data analysis.
- Wednesday:
- Theory: Spark RDDs, DataFrames, and SQL.
- Thursday:
- Lab: Analyze large datasets using Spark SQL.
- Friday:
- Peer Project: Build a PySpark pipeline for large-scale data processing.
- Saturday (Lab):
- Mini Project: Analyze big data sets with Spark and PySpark.
- Monday:
- Theory: Advanced Airflow features (XCom, task dependencies).
- Tuesday:
- Lab: Implement dynamic DAGs and task dependencies in Airflow.
- Wednesday:
- Theory: Airflow scheduling, monitoring, and error handling.
- Thursday:
- Lab: Create complex DAGs for multi-step ETL pipelines.
- Friday:
- Job Shadowing: Observe advanced Airflow pipeline implementations.
- Saturday (Lab):
- Mini Project: Design an advanced Airflow DAG for complex data workflows.
- Monday:
- Theory: Data lakes, Lakehouses, and Delta Lake architecture.
- Tuesday:
- Lab: Set up Delta Lake on AWS for data storage and management.
- Wednesday:
- Theory: Managing schema evolution in Delta Lake.
- Thursday:
- Lab: Implement batch and real-time data loading to Delta Lake.
- Friday:
- Peer Project: Design a Lakehouse architecture for an e-commerce platform.
- Saturday (Lab):
- Mini Project: Implement a scalable Delta Lake architecture.
- Monday to Thursday:
- Design and Implementation:
- Build an end-to-end batch data pipeline for e-commerce sales analytics.
- Tools: PySpark, SQL, PostgreSQL, Airflow, S3.
- Design and Implementation:
- Friday:
- Peer Review: Present progress and receive feedback.
- Saturday (Lab):
- Project Milestone: Finalize and present batch pipeline results.
- Monday to Thursday:
- Design and Implementation:
- Build an end-to-end real-time data pipeline for IoT sensor monitoring.
- Tools: Kafka, Spark Streaming, Flink, S3.
- Design and Implementation:
- Friday:
- Peer Review: Present progress and receive feedback.
- Saturday (Lab):
- Project Milestone: Finalize and present real-time pipeline results.
- Monday to Thursday:
- Design and Implementation:
- Integrate both batch and real-time pipelines for a comprehensive end-to-end solution.
- Tools: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3.
- Design and Implementation:
- Friday:
- Job Shadowing: Observe senior engineers integrating complex pipelines.
- Saturday (Lab):
- Project Milestone: Showcase integrated solution for review.
- Monday to Thursday:
- Final Presentation Preparation:
- Polish, test, and document the final project.
- Final Presentation Preparation:
- Friday:
- Peer Review: Present final projects to peers and receive feedback.
- Saturday (Lab):
- Capstone Presentation: Showcase completed capstone projects to industry professionals and instructors.