End-to-end Credit Scoring System for Bati Bank's buy-now-pay-later service. Transforms behavioral transaction data into credit risk predictions using RFM analytics and machine learning.
Highlights: RFM Clustering Proxy Target • 23+ Engineered Features • 4 ML Models (LogReg, DecisionTree, RandomForest, GradientBoosting) • MLflow Experiment Tracking • FastAPI REST API • Docker Deployment • 98 Unit Tests (85% Coverage) • CI/CD Pipeline • Basel II Compliant
# Clone and install
git clone https://github.com/estif0/credit-risk-model.git
cd credit-risk-model
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Place data.csv in data/raw/ (from Kaggle Xente Challenge)
# Run pipeline
python src/run_feature_engineering.py
python src/run_rfm_pipeline.py
python src/train.py
# Start API
uvicorn src.api.main:app --reload --port 8000
# Visit http://localhost:8000/docsdocker-compose up --build
# API: http://localhost:8000
# MLflow: http://localhost:5000
# Jupyter: http://localhost:8888credit-risk-model/
├── .github/workflows/ci.yml # CI/CD pipeline
├── data/
│ ├── raw/ # Original data (git-ignored)
│ └── processed/ # Engineered features
├── src/
│ ├── data_processing.py # Feature engineering
│ ├── rfm_analysis.py # RFM clustering & proxy target
│ ├── train.py # Model training with MLflow
│ ├── utils.py # Utility functions
│ └── api/
│ ├── main.py # FastAPI application
│ └── pydantic_models.py # Request/response schemas
├── tests/ # 98 unit tests (85% coverage)
├── notebooks/eda.ipynb # Exploratory analysis
├── mlruns/ # MLflow tracking (git-ignored)
├── requirements.txt
├── Dockerfile
└── docker-compose.yml
curl http://localhost:8000/healthcurl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"CustomerId": "CUST_001",
"Amount": 5000,
"transaction_hour": 14,
"Recency": 5,
"Frequency": 30,
"Monetary": 150000,
"total_transaction_value": 150000,
"avg_transaction_value": 5000,
"transaction_count": 30
}'Response:
{
"customer_id": "CUST_001",
"risk_probability": 0.23,
"risk_category": "low",
"credit_score": 720,
"recommendation": "APPROVE"
}curl -X POST "http://localhost:8000/predict/batch" \
-H "Content-Type: application/json" \
-d '{"transactions": [...]}'All Endpoints:
GET /- Welcome messageGET /health- Health check with model statusGET /model/info- Model metadata and metricsPOST /model/reload- Reload model from MLflowPOST /predict- Single customer predictionPOST /predict/batch- Batch predictions
Interactive Docs: http://localhost:8000/docs
# Run all tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=src --cov-report=html
# Specific test
pytest tests/test_api.py -vTest Coverage:
| Module | Tests | Coverage | Status |
|---|---|---|---|
| data_processing.py | 22 | 92% | ✅ |
| rfm_analysis.py | 11 | 88% | ✅ |
| train.py | 15 | 83% | ✅ |
| api/main.py | 20 | 90% | ✅ |
| utils.py | 30 | 95% | ✅ |
| Total | 98 | 85% | ✅ |
Services:
api(8000) - FastAPI applicationmlflow(5000) - MLflow tracking UInotebook(8888) - Jupyter notebookdata-processor- Feature engineeringrfm-analyzer- RFM clustering
Commands:
docker-compose up --build # Start all services
docker-compose up api # Start API only
docker-compose logs -f api # View logs
docker-compose down # Stop servicesGitHub Actions workflow (.github/workflows/ci.yml) runs on push/PR:
- Code Quality - flake8, black, pylint
- Testing - pytest with 80% coverage threshold
- Build - Docker image verification
Local CI Check:
black --check src tests --line-length 100
flake8 src tests --max-line-length=100
pytest tests/ --cov=src --cov-fail-under=80| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.84 | 0.82 | 0.80 | 0.81 | 0.89 |
| Decision Tree | 0.81 | 0.79 | 0.83 | 0.81 | 0.85 |
| Random Forest | 0.87 | 0.85 | 0.83 | 0.84 | 0.91 |
| Gradient Boosting | 0.86 | 0.84 | 0.84 | 0.84 | 0.90 |
Top Features: Monetary (18.5%), Frequency (15.2%), Recency (12.8%), total_transaction_value (10.3%)
Model Selection:
- Random Forest - Best balance (production default)
- Logistic Regression - Highest interpretability (Basel II compliant)
- Gradient Boosting - Best performance (requires SHAP explainability)
The Basel II Capital Accord establishes international standards for banking supervision, with a strong emphasis on quantitative risk measurement and model validation. For our credit risk model, Basel II's requirements significantly influence our design decisions:
Key Basel II Implications:
-
Model Transparency: Basel II mandates that financial institutions must understand and document their risk measurement approaches. This requirement drives us toward interpretable models where we can explain:
- Which features contribute to risk assessments
- How different customer behaviors impact credit decisions
- The rationale behind risk probability estimates
-
Validation Requirements: Our model must be rigorously tested and validated with clear documentation of:
- Model assumptions and limitations
- Performance metrics across different customer segments
- Back-testing procedures to verify predictive accuracy
- Ongoing monitoring processes
-
Risk Quantification: Basel II requires precise risk probability estimates to calculate regulatory capital requirements. Our model must provide:
- Probability of Default (PD) estimates
- Confidence intervals for predictions
- Calibration across the entire risk spectrum
Impact on Our Approach: These requirements favor models with clear decision boundaries and traceable feature contributions, making documentation and explainability as important as predictive performance.
In traditional credit scoring, models are trained on historical default data—actual records of customers who failed to repay loans. However, for our buy-now-pay-later service, we face a cold start problem:
- No Historical Default Data: As a new service, we have no direct records of loan defaults
- Only Transactional Data: We possess eCommerce behavioral data (purchases, amounts, timing, channels)
- Regulatory Timeline: We cannot wait years to accumulate default history before launching the service
Our Solution: RFM-Based Proxy Variable
We create a proxy for credit risk by analyzing customer engagement patterns through RFM (Recency, Frequency, Monetary) analysis:
- Recency: Days since last transaction (disengaged customers may be higher risk)
- Frequency: Number of transactions (active customers show commitment)
- Monetary: Total transaction value (financial capacity indicator)
Using K-Means clustering, we segment customers into risk categories, assuming that least engaged customers (high recency, low frequency/monetary) represent higher credit risk.
While necessary, this proxy approach introduces significant risks:
-
Correlation vs Causation:
- Low engagement doesn't necessarily mean inability to repay
- A customer might have low transaction frequency but high creditworthiness
- We're assuming engagement patterns predict financial reliability—an untested hypothesis
-
Potential for Discrimination:
- May unfairly penalize new customers or those with changing purchase patterns
- Could exclude creditworthy individuals based on shopping behavior rather than financial capacity
- Risk of systematic bias against certain customer segments
-
Model Drift Over Time:
- As we accumulate actual default data, the relationship between engagement and default may prove weaker than assumed
- The model will require retraining with ground truth labels
- Initial predictions may have limited real-world validity
-
Regulatory Scrutiny:
- Basel II requires models to be based on sound statistical evidence
- Our proxy approach may face challenges in regulatory validation
- We must clearly disclose the experimental nature of this methodology
-
Business Impact:
- False positives (rejecting good customers) result in lost revenue
- False negatives (approving bad customers) lead to defaults and losses
- Without ground truth, we cannot properly calibrate this trade-off
Mitigation Strategy: We must implement rigorous monitoring, collect actual default data from day one, and plan for model retraining once sufficient ground truth is available (typically 12-24 months).
Choosing between simple, interpretable models and complex, high-performance models involves critical trade-offs, especially in a regulated financial context.
Advantages:
- High Interpretability: Each coefficient clearly shows feature impact on risk
- Weight of Evidence (WoE) transformation provides:
- Monotonic relationships between features and target
- Clear binning strategies that business users can understand
- Information Value (IV) metrics for feature selection
- Regulatory Acceptance: Well-understood methodology with decades of use in banking
- Fast Inference: Minimal computational requirements for real-time scoring
- Easy Debugging: Simple to diagnose and fix when issues arise
- Stakeholder Communication: Non-technical executives can understand model decisions
Disadvantages:
- Linear Assumptions: Cannot capture complex non-linear relationships or feature interactions
- Lower Predictive Power: May miss subtle patterns that indicate risk
- Feature Engineering Intensive: Requires manual creation of interaction terms and polynomial features
- Limited with Complex Data: Struggles with high-dimensional feature spaces
Basel II Fit: Excellent—interpretability and transparency align perfectly with regulatory requirements.
Advantages:
- Superior Predictive Performance: Captures non-linear relationships and complex feature interactions
- Automatic Feature Engineering: Learns optimal feature combinations without manual specification
- Handles Missing Data: Built-in mechanisms for missing value treatment
- Robust to Outliers: Less sensitive to extreme values than linear models
- Feature Importance: Provides SHAP values and gain metrics for interpretation
- Ensemble Learning: Reduces overfitting through multiple weak learners
Disadvantages:
- Black Box Nature: Difficult to explain individual predictions to regulators or customers
- Regulatory Challenges: May face scrutiny under Basel II's transparency requirements
- Overfitting Risk: Can memorize training data, especially with limited samples
- Computational Complexity: Requires more resources for training and inference
- Hyperparameter Sensitivity: Performance heavily depends on tuning
- Harder to Maintain: More difficult to debug and update as business rules change
Basel II Fit: Challenging—while powerful, the lack of transparency may not satisfy regulatory validation requirements without extensive documentation and post-hoc explanation techniques (SHAP, LIME).
Given our context (proxy-based target, regulatory requirements, new service), we recommend:
-
Start with Logistic Regression + WoE:
- Establish an interpretable baseline
- Build trust with regulators and stakeholders
- Understand which features truly matter
- Document clear decision rules
-
Develop Gradient Boosting in Parallel:
- Compare performance metrics
- Use SHAP values for model explanation
- Validate that complex model findings align with business intuition
-
Implement Model Governance:
- If deploying Gradient Boosting, create extensive documentation
- Implement human-in-the-loop review for borderline cases
- Monitor model decisions for bias and drift
- Maintain Logistic Regression as a fallback and validation tool
-
Plan for Evolution:
- As we collect actual default data, reassess model choice
- Consider ensemble methods that combine interpretability and performance
- Invest in explainable AI tools (SHAP, LIME) to bridge the gap
Conclusion: In a regulated environment with a proxy target variable, we must prioritize interpretability and validation over marginal performance gains. Starting with simple models allows us to build a solid foundation, while complex models can be introduced later with proper governance frameworks once we have ground truth data to validate their predictions.
Creates proxy target variable from customer engagement:
- Recency: Days since last transaction
- Frequency: Number of transactions
- Monetary: Total transaction value
K-Means clustering (k=3) segments customers. High-risk cluster = high recency + low frequency/monetary.
- ✅ Interpretable models with clear feature contributions
- ✅ 98 unit tests + back-testing validation
- ✅ Probability of Default (PD) estimates + calibrated credit scores
- ✅ Complete documentation of methodology and limitations
Mitigation: Collect actual default data from day 1, monitor false positive/negative rates, plan retraining with ground truth (12-24 months).
Current Limitations:
- Proxy target not validated against real defaults
- Limited to transaction data (no demographics/credit bureau)
- Binary classification only (no loan amount/duration optimization)
Roadmap:
- Integrate actual default data for retraining
- SHAP values for explainability
- Multi-class risk categorization
- External data sources (credit bureau, demographics)
- Loan amount/duration recommendation models
# Fork repo and create feature branch
git checkout -b feature/amazing-feature
# Make changes and test
black src tests --line-length 100
pytest tests/ --cov=src --cov-fail-under=80
# Commit and push
git commit -m "feat: add amazing feature"
git push origin feature/amazing-featureFollow Conventional Commits: feat:, fix:, docs:, test:, refactor:
MIT License - see LICENSE file.
Estifanose Sahilu
📧 estifanoswork@gmail.com
🐙 @estif0
💼 LinkedIn
- Basel II Capital Accord - BIS Framework
- Credit Scoring - Siddiqi, N. (2006). "Credit Risk Scorecards"
- ML for Credit Risk - Lessmann, S., et al. (2015). "Benchmarking classification algorithms"
- RFM Analysis - Hughes, A. M. (1994). "Strategic Database Marketing"
- Model Interpretability - SHAP Documentation
⭐ Star this repo if you find it helpful!
Made with ❤️ by Estifanose Sahilu