Skip to content

nithinraaj27/Datalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real-Time E-Commerce Data Lake & Analytics Platform

An end-to-end event-driven data lake and analytics system built using Kafka, Spring Boot, AWS S3, Glue, Athena, Airflow, and Metabase.
This project demonstrates a production-style Bronze β†’ Silver β†’ Gold architecture with orchestration, schema evolution, and BI dashboards.

Watch this video to Know more about project and Getting Started with the project

πŸ“Œ Architecture Overview

Flow Producers β†’ Kafka β†’ Spring Consumer β†’ S3 (Bronze) β†’ AWS Glue Crawler β†’ Athena CTAS (Silver) β†’ Aggregations (Gold) β†’ BI Dashboard (Metabase)

Design Principles

  • Event-driven ingestion
  • Immutable raw data (Bronze)
  • Curated analytics-ready data (Silver)
  • Business aggregates (Gold)
  • Fully containerized local setup
  • Cloud-native analytics (serverless)

🧩 Tech Stack

Data Ingestion

  • Apache Kafka – Event streaming platform
  • Spring Boot (Java 17) – Kafka consumer, S3 writer
  • Python – Kafka event producer (mock data)

Data Lake & Processing

  • Amazon S3 – Bronze, Silver, Gold layers
  • AWS Glue Crawler – Schema discovery & cataloging
  • Amazon Athena – SQL-based analytics (CTAS)

Orchestration

  • Apache Airflow (Dockerized)
    • Glue Crawlers
    • Athena CTAS queries
    • Dependency management & retries

BI & Analytics

  • Metabase – Dashboarding & visualization
  • Athena JDBC – Direct analytics on S3 data

Platform & Tooling

  • Docker & Docker Compose
  • AWS IAM (least privilege)
  • GitHub-ready project structure

πŸ“‚ Data Lake Layers

πŸ₯‰ Bronze (Raw)

  • One folder per Kafka topic
  • JSON events exactly as produced
  • Immutable, append-only

πŸ₯ˆ Silver (Curated)

  • Normalized schema
  • Partitioned by date and hour
  • Stored as Parquet
  • Created using Athena CTAS

πŸ₯‡ Gold (Business Aggregates)

Examples:

  • Orders Daily Summary
  • Payments Daily Summary
  • User Activity Summary

ecommerce_gold.orders_daily_summary ecommerce_gold.payment_daily_summary ecommerce_gold.user_daily_activity_summary


πŸ›  Airflow DAG

Pipeline Steps

  1. Run Bronze Glue Crawler
  2. Create Silver table (CTAS)
  3. Run Silver Glue Crawler
  4. Create Gold aggregate tables
  5. Retry-safe & cost-aware execution

Key Features

  • Deferrable Glue operators
  • No duplicate crawler execution
  • Cost-optimized retries
  • Fully Dockerized

πŸ“Š Dashboards (Metabase)

Built dashboards include:

  • Orders per day
  • Revenue trends
  • Payment success/failure ratio
  • User activity metrics

Metabase connects directly to Athena, querying data stored in S3 (Gold layer).


πŸ” Secrets & Configuration

Secrets are never committed.

Used locally via:

  • .env file
  • Docker environment variables

Protected using:

  • .gitignore

πŸš€ How to Run the Project (Step-by-Step)

Follow the steps in order. Each layer is started independently for clarity and control.


πŸ” Prerequisites

Before starting any container, you must provide AWS credentials.

Create a .env file (this file is NOT committed to Git):

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=eu-north-1

These credentials are required for:

  • Writing to S3
  • Running Glue Crawlers
  • Executing Athena queries
  • Connecting Metabase to Athena

1️⃣ Start Kafka, Zookeeper, Producer & Spring Consumer Go inside the main project directory: cd datalake

Start core ingestion services: docker compose up -d

This will start:

Zookeeper, Kafka, Python Kafka Producer, Spring Boot Kafka Consumer

image

Verify logs:

docker logs -f python-producer
docker logs -f spring-consumer

Verify Bronze Layer

Check your S3 bucket:

s3://ecommerce-event-bronze/

You should see topic-wise folders with JSON events.

2️⃣ Start Apache Airflow (Orchestration) Go inside the Airflow directory: cd airflow

Start Airflow services:

docker compose -f docker-compose.airflow.yml up -d

Create Airflow Admin User:

docker compose -f docker-compose.airflow.yml run --rm airflow-webserver airflow users create \
  --username admin \
  --password admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com
image

Open Airflow UI: http://localhost:8080/home

Login with:

Username: admin Password: admin

image

3️⃣ Run the Airflow DAG

-> Open Airflow UI

DAG Name: ecommerce_bronze_silver_gold_pipeline

-> Trigger the DAG manually

What the DAG does:

1 Runs Bronze Glue Crawler 2 Creates Silver table (Athena CTAS) 3 Runs Silver Glue Crawler 4 Creates Gold aggregate tables

image

Verify in AWS:

Glue Catalog β†’ Databases & Tables

Athena β†’ ecommerce_silver and ecommerce_gold

4️⃣ Start Metabase (BI Dashboard) Go inside the Metabase directory: cd metabase

Start Metabase:

docker compose -f docker-compose.metabase.yml up -d

Open Metabase UI: http://localhost:3000

5️⃣ Connect Metabase to Amazon Athena

During Metabase setup:

  • Database: Amazon Athena
  • Region: eu-north-1
  • S3 Staging Directory:

Workgroup: primary

Authentication: Use environment variables (already provided via .env)

After connection: -> Select ecommerce_gold tables

6️⃣ View Dashboards http://localhost:3000/dashboard

image

βœ… Final Result

βœ” Real-time ingestion via Kafka βœ” Bronze β†’ Silver β†’ Gold data lake βœ” Orchestrated using Airflow βœ” Serverless analytics with Athena βœ” BI dashboards using Metabase

🧠 Resume Value

This project demonstrates:

  • Real-world data lake architecture
  • Production-style orchestration
  • Cost-aware AWS usage
  • End-to-end ownership (ingest β†’ BI)

πŸ’‘ Why This Architecture?

  • Scalable – Event-driven & serverless analytics
  • Cost-efficient – No always-on clusters
  • Production-aligned – Industry-standard data lake pattern
  • Resume-ready – Real-world tools & design decisions

πŸ‘€ Author

Nithinraaj
Software Engineer | Backend & Data Engineering
Focused on scalable, real-world system design


⭐ If this project helped you understand modern data platforms, give it a star!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors