This project demonstrates an end-to-end Data Engineering pipeline applied to the Agriculture sector.
The goal was to transform raw, unstructured data (CSV) into a high-performance Star Schema Data Warehouse using Python for ingestion and SQL for data modeling and analysis. The project focuses on analyzing global crop yield trends and Year-Over-Year (YoY) growth using Window Functions.
- Ingestion (ETL): Python (
Pandas) to clean and load raw CSV data into a staging area. - Database: SQLite (embedded relational database).
- Data Modeling: Transformation of flat tables into a Star Schema (Fact & Dimensions) using SQL DDL.
- Analysis: Complex queries using Window Functions (
LAG,OVER), CTEs, and Views for anomaly detection.
To ensure data integrity and query performance, the database was normalized from a single flat table into a Relational Schema:
- Fact Table:
fact_crop_yields(Contains metrics: Yield, Rainfall, Temperature, Pesticides). - Dimension Tables:
dim_crops,dim_countries.
The core of this project is the notebooks/sql_data_modeling.ipynb file, which covers:
- Normalization: Converting raw text columns into normalized Dimension tables.
- Window Functions: Using
LAG()to calculate Year-over-Year Growth %:LAG(current_yield) OVER (PARTITION BY crop_name ORDER BY year) - Data Quality Views: Creating a persistent SQL View (
view_data_anomalies) to automatically flag data outliers (e.g., Temperature > 40°C).
-
Clone the repository:
git clone [https://github.com/PedruMota/Advanced-SQL-Agro-Pipeline.git](https://github.com/PedruMota/sql-crop-yield-analytics.git) cd SQL-Agro-Pipeline -
Set up the environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Run the ETL Pipeline:
python src/01_ingest_data.py
-
Explore the Analysis: Open
notebooks/sql_data_modeling.ipynbin VS Code or Jupyter Lab to see the step-by-step SQL transformation.
| Year | Crop | Yield (hg/ha) | Growth % |
|---|---|---|---|
| 1993 | Maize | 28,997 | 📈 +3.70% |
| 1994 | Maize | 28,344 | 📉 -2.25% |
| 1996 | Maize | 30,763 | 📈 +5.35% |
Author: Pedro Mota
Dataset Source: Kaggle Crop Yield Prediction ...