An end-to-end event-driven data lake and analytics system built using Kafka, Spring Boot, AWS S3, Glue, Athena, Airflow, and Metabase.
This project demonstrates a production-style Bronze β Silver β Gold architecture with orchestration, schema evolution, and BI dashboards.
Watch this video to Know more about project and Getting Started with the project
Youtube Link: https://www.youtube.com/watch?v=Gyit3aMqqow
Flow Producers β Kafka β Spring Consumer β S3 (Bronze) β AWS Glue Crawler β Athena CTAS (Silver) β Aggregations (Gold) β BI Dashboard (Metabase)
Design Principles
- Event-driven ingestion
- Immutable raw data (Bronze)
- Curated analytics-ready data (Silver)
- Business aggregates (Gold)
- Fully containerized local setup
- Cloud-native analytics (serverless)
- Apache Kafka β Event streaming platform
- Spring Boot (Java 17) β Kafka consumer, S3 writer
- Python β Kafka event producer (mock data)
- Amazon S3 β Bronze, Silver, Gold layers
- AWS Glue Crawler β Schema discovery & cataloging
- Amazon Athena β SQL-based analytics (CTAS)
- Apache Airflow (Dockerized)
- Glue Crawlers
- Athena CTAS queries
- Dependency management & retries
- Metabase β Dashboarding & visualization
- Athena JDBC β Direct analytics on S3 data
- Docker & Docker Compose
- AWS IAM (least privilege)
- GitHub-ready project structure
- One folder per Kafka topic
- JSON events exactly as produced
- Immutable, append-only
- Normalized schema
- Partitioned by
dateandhour - Stored as Parquet
- Created using Athena CTAS
Examples:
- Orders Daily Summary
- Payments Daily Summary
- User Activity Summary
ecommerce_gold.orders_daily_summary ecommerce_gold.payment_daily_summary ecommerce_gold.user_daily_activity_summary
Pipeline Steps
- Run Bronze Glue Crawler
- Create Silver table (CTAS)
- Run Silver Glue Crawler
- Create Gold aggregate tables
- Retry-safe & cost-aware execution
Key Features
- Deferrable Glue operators
- No duplicate crawler execution
- Cost-optimized retries
- Fully Dockerized
Built dashboards include:
- Orders per day
- Revenue trends
- Payment success/failure ratio
- User activity metrics
Metabase connects directly to Athena, querying data stored in S3 (Gold layer).
Secrets are never committed.
.envfile- Docker environment variables
.gitignore
Follow the steps in order. Each layer is started independently for clarity and control.
Before starting any container, you must provide AWS credentials.
Create a .env file (this file is NOT committed to Git):
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=eu-north-1These credentials are required for:
- Writing to S3
- Running Glue Crawlers
- Executing Athena queries
- Connecting Metabase to Athena
1οΈβ£ Start Kafka, Zookeeper, Producer & Spring Consumer Go inside the main project directory: cd datalake
Start core ingestion services:
docker compose up -d
This will start:
Zookeeper, Kafka, Python Kafka Producer, Spring Boot Kafka Consumer
Verify logs:
docker logs -f python-producer
docker logs -f spring-consumer
Verify Bronze Layer
Check your S3 bucket:
s3://ecommerce-event-bronze/
You should see topic-wise folders with JSON events.
2οΈβ£ Start Apache Airflow (Orchestration) Go inside the Airflow directory: cd airflow
Start Airflow services:
docker compose -f docker-compose.airflow.yml up -d
Create Airflow Admin User:
docker compose -f docker-compose.airflow.yml run --rm airflow-webserver airflow users create \
--username admin \
--password admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com
Open Airflow UI: http://localhost:8080/home
Login with:
Username: admin Password: admin
3οΈβ£ Run the Airflow DAG
-> Open Airflow UI
DAG Name: ecommerce_bronze_silver_gold_pipeline
-> Trigger the DAG manually
1 Runs Bronze Glue Crawler 2 Creates Silver table (Athena CTAS) 3 Runs Silver Glue Crawler 4 Creates Gold aggregate tables
Verify in AWS:
Glue Catalog β Databases & Tables
Athena β ecommerce_silver and ecommerce_gold
4οΈβ£ Start Metabase (BI Dashboard) Go inside the Metabase directory: cd metabase
Start Metabase:
docker compose -f docker-compose.metabase.yml up -d
Open Metabase UI: http://localhost:3000
5οΈβ£ Connect Metabase to Amazon Athena
During Metabase setup:
- Database: Amazon Athena
- Region: eu-north-1
- S3 Staging Directory:
Workgroup: primary
Authentication: Use environment variables (already provided via .env)
After connection: -> Select ecommerce_gold tables
6οΈβ£ View Dashboards http://localhost:3000/dashboard
β Final Result
β Real-time ingestion via Kafka β Bronze β Silver β Gold data lake β Orchestrated using Airflow β Serverless analytics with Athena β BI dashboards using Metabase
π§ Resume Value
This project demonstrates:
- Real-world data lake architecture
- Production-style orchestration
- Cost-aware AWS usage
- End-to-end ownership (ingest β BI)
- Scalable β Event-driven & serverless analytics
- Cost-efficient β No always-on clusters
- Production-aligned β Industry-standard data lake pattern
- Resume-ready β Real-world tools & design decisions
Nithinraaj
Software Engineer | Backend & Data Engineering
Focused on scalable, real-world system design
β If this project helped you understand modern data platforms, give it a star!