End-to-End Lakehouse, Analytics & ML Training (Day 0 – Day 14)
Name: Guru Saran Satsangi Peddinti
Program: BS in Data Science & Applications – IIT Madras
Community: Indian Data Club
Challenge: Databricks 14 Days AI Challenge
Duration: 09 Jan 2026 – 22 Jan 2026
Platform: Databricks Community Edition
This document serves as a comprehensive training log and technical summary of my work during the Databricks 14 Days AI Challenge, organized by the Indian Data Club, delivered in collaboration with Codebasics, and sponsored by Databricks.
The goal of this challenge was to build real-world, hands-on foundations in:
-
Databricks Lakehouse Architecture
-
Apache Spark & Delta Lake
-
Data Engineering pipelines
-
Governance, performance, and orchestration
-
SQL analytics & dashboards
-
Statistical analysis & Machine Learning
-
MLflow & AI-powered analytics concepts
All implementations were done using Databricks Community Edition, with transparent handling of platform limitations.
databricks-14-days-ai-challenge/
│
├── Day_0_Ecommerce_Setup.py
├── Day_1_PySpark_Basics.py
├── Day_2_Apache_Spark_Fundamentals.py
├── Day_3_PySpark_Transformations.py
├── Day_4_Delta_Lake_Basics.py
├── Day_5_Delta_Lake_Advanced.py
├── Day_6_Medallion_Architecture.py
│
├── Day_7_Bronze_Job.py
├── Day_7_Silver_Job.py
├── Day_7_Gold_Job.py
├── Day_7_Job_Controller.py
├── Day_7_Verify.py
│
├── Day_8_Governance.py
├── Day_9_SQL_Analytics.py
├── Day_10_Performance.py
├── Day_11_Statistical_Analysis.py
├── Day_12_MLflow_Basics.py
├── Day_13_Model_Comparison.py
├── Day_14_AI_Powered_Analytics.py
│
└── README.md (this document)
-
E-commerce Behavior Dataset (2019 – October & November)
-
Source: Kaggle
-
Scale: ~13+ million events
-
Event types:
view,cart,purchase,remove_from_cart
This dataset was processed using the Bronze → Silver → Gold (Medallion) Architecture.
Focus: Environment preparation and ingestion
-
Databricks Community Edition setup
-
Kaggle API configuration
-
Schema & volume creation
-
Raw CSV download and ingestion
Outcome: Reliable, repeatable data loading pipeline.
Focus: Spark fundamentals
-
DataFrames vs Pandas
-
Basic transformations
-
Schema inspection and filtering
Outcome: Comfort with Spark syntax and execution.
Focus: Spark internals
-
Driver, executors, DAG
-
Lazy evaluation
-
Spark SQL & temp views
Outcome: Ability to reason about Spark execution.
Focus: Advanced transformations
-
Joins (inner, left, right, outer)
-
Window functions
-
Aggregations, pivots
-
Derived features
Outcome: Production-style data transformations.
Focus: Reliable storage
-
Delta vs Parquet
-
ACID transactions
-
Schema enforcement
Outcome: Transaction-safe data lake.
Focus: Data evolution & optimization
-
Time Travel
-
MERGE (upserts)
-
OPTIMIZE & ZORDER
-
VACUUM
Outcome: Incremental and performant pipelines.
Focus: Pipeline design
-
Bronze: raw ingestion
-
Silver: cleaned & deduplicated
-
Gold: business aggregates
Outcome: Scalable, maintainable data architecture.
Focus: Automation
-
Notebook parameterization
-
Multi-task workflows
-
Dependencies (Bronze → Silver → Gold)
-
Job controller logic
Outcome: Fully automated pipelines.
Focus: Data governance
-
Catalog → Schema → Table hierarchy
-
Permissions & access control
-
Controlled views
-
Lineage awareness
Outcome: Secure and discoverable data platform.
Focus: Business analytics
-
Analytical SQL queries
-
Revenue trends
-
Funnels & conversion analysis
Outcome: Insight generation from Gold data.
Focus: Speed & efficiency
-
Query plans
-
Partitioning
-
OPTIMIZE & ZORDER
-
Benchmarking
Outcome: Performance-aware Spark usage.
Focus: ML readiness
-
Descriptive statistics
-
Hypothesis testing
-
Correlation checks
-
Feature engineering
Outcome: Clean, ML-ready datasets.
Focus: Experiment tracking
-
MLflow runs
-
Parameter & metric logging
-
Model artifacts
-
Handling NULL/NaN labels
Outcome: Reproducible ML experiments.
Focus: Model evaluation
-
Multiple regression models
-
Metric comparison
-
Feature importance
-
Spark ML pipelines
Outcome: Informed model selection.
Focus: AI in analytics
-
Databricks Genie (NL → SQL concept)
-
Mosaic AI overview
-
AI-assisted analytics
-
Spark ML–based AI demo
-
MLflow logging for AI workflows
Note: Full GenAI features require paid Databricks workspaces; Community Edition constraints were handled transparently.
Outcome: Clear understanding of AI’s role in modern data platforms.
With the 14-day foundation complete, the next step is the Codebasics Capstone Project, where:
-
A real-world problem statement will be provided
-
End-to-end data engineering, analytics, and ML will be applied
-
Best practices learned here will be consolidated into a production-grade solution
This training phase focused on fundamentals, correctness, and reasoning, ensuring readiness for the capstone.
-
Built a complete Lakehouse pipeline from raw data to AI insights
-
Learned to debug real platform and data issues
-
Practiced governance, performance, and ML workflows
-
Developed habits aligned with industry-grade data engineering
-
Databricks – Platform & ecosystem
-
Indian Data Club – Community & challenge organization
-
Codebasics – Structured learning & capstone phase
This document represents a complete, hands-on Databricks learning journey from Day 0 to Day 14 and serves as the foundation for the upcoming capstone project.