End-to-end MLOps pipeline for predicting aqueous solubility of organic molecules using machine learning.
This project addresses the critical challenge of aqueous solubility prediction by estimating the solubility (log mol/L) of organic molecules directly from their SMILES representations using an XGBoost regression model trained on molecular fingerprints and physicochemical descriptors. By combining robust machine learning with chemically meaningful features, the system enables early assessment of solubility, which is essential for drug design, pharmaceutical formulation, coatings, and battery electrolyte development. Beyond model training, the project delivers a complete production-ready solution, including automated training and inference pipelines, a FastAPI service for real-time predictions, monitoring, and deployment configurations, allowing seamless integration of solubility prediction into R&D and decision-making workflows.
| Metric | Train | Test |
|---|---|---|
| R² | 0.86 | 0.86 |
| RMSE | 0.87 | 0.86 |
| MAE | 0.63 | 0.62 |
XGBoost Regressor (Gradient Boosting Decision Trees)
Note: Hyperparameter tuning was performed in Jupyter notebooks located in
experiments/notebooks/. The optimal parameters were determined through RandemiRandomizedSearchCV
Input: SMILES strings (Simplified Molecular Input Line Entry System)
Total Features: ~1048 (24 descriptors + 1024 fingerprint bits)
- Type: Circular fingerprints (ECFP-like)
- Radius: 2 (captures 2-hop neighborhood)
- Bits: 1024 binary features
- Generated via: RDKit
AllChem.GetMorganFingerprintAsBitVect()
SMILES Input
│
▼
┌─────────────────────────────────────────┐
│ Data Preprocessing │
│ • Validate SMILES │
│ • Filter organic molecules │
│ • Neutralize charges │
│ • Generate canonical SMILES │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Feature Engineering │
│ • Calculate 24 molecular descriptors │
│ • Generate 1024-bit Morgan fingerprints│
│ • Remove constant features (var<0.01) │
│ • Remove correlated features (>0.95) │
│ • Scale descriptors (StandardScaler) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ XGBoost Model │
│ • 1500 trees, depth 6 │
│ • L1 + L2 regularization │
│ • Histogram-based training │
└─────────────────────────────────────────┘
│
▼
Solubility Prediction (log mol/L)
| Solubility Range | Classification |
|---|---|
| > 0 log mol/L | Highly Soluble |
| -1 to 0 | Soluble |
| -2 to -1 | Slightly Soluble |
| -3 to -2 | Poorly Soluble |
| < -3 log mol/L | Practically Insoluble |
AqSolPrediction/
├── src/
│ ├── components/
│ │ ├── data_loader.py # Data loading and cleaning
│ │ ├── feature_engineering.py # Molecular feature generation
│ │ ├── model_trainer.py # XGBoost training with CV
│ │ ├── model_evaluator.py # Metrics and visualization
│ │ ├── model_monitoring.py # Drift detection (KS, PSI)
│ │ └── evidently_monitor.py # Evidently AI integration
│ ├── pipeline/
│ │ ├── train_pipeline.py # Training orchestration
│ │ └── predict_pipeline.py # Inference pipeline
│ ├── exception.py
│ └── logger.py
├── app/ # FastAPI application
│ ├── main.py
│ ├── schemas.py
│ └── routers/
│ ├── predict.py
│ └── monitoring.py
├── streamlit_app/ # Streamlit UI
│ ├── app.py # Main prediction interface
│ └── pages/
│ ├── 1_Model_Monitoring.py # Drift detection dashboard
│ └── 2_Model_Info.py # Model info & SHAP analysis
├── experiments/
│ └── notebooks/ # Hyperparameter tuning notebooks
├── configs/
│ └── config.yaml # Central configuration
├── data/
│ ├── raw/ # AqSolDB dataset
│ └── processed/
├── artifacts/ # Trained models and scalers
├── docker/
├── mlruns/ # MLflow tracking
├── requirements.txt
└── docker-compose.yml
# Clone the repository
git clone https://github.com/yourusername/AqSolPrediction.git
cd AqSolPrediction
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Run full training pipeline with cross-validation
python -m src.pipeline.train_pipelinefrom src.pipeline.predict_pipeline import PredictionPipeline
# Initialize pipeline
pipeline = PredictionPipeline()
# Single prediction
result = pipeline.predict_single("CCO") # Ethanol
print(f"Solubility: {result:.2f} log mol/L")
# Batch prediction
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
results = pipeline.predict_batch(smiles_list)# Start FastAPI server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000API Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API information |
/health |
GET | Health check |
/model/info |
GET | Model information |
/predict/single |
POST | Single SMILES prediction |
/predict/batch |
POST | Batch predictions |
/monitoring/report |
GET | Drift detection report |
Option 1: Pull from Docker Hub (quickest)
# Run the Streamlit app
docker run -p 8501:8501 ankitkumar91/aqsol-streamlit:latest
# Run the API server
docker run -p 8000:8000 ankitkumar91/aqsol-api:latestOption 2: Build locally
# Build and run Streamlit app
docker build -f docker/Dockerfile.streamlit -t aqsol-streamlit .
docker run -p 8501:8501 aqsol-streamlit
# Build and run API server
docker build -f docker/Dockerfile.api -t aqsol-api .
docker run -p 8000:8000 aqsol-apiOption 3: Docker Compose (both services)
docker-compose up -dAfter running, access:
- Streamlit UI: http://localhost:8501
- API Server: http://localhost:8000
- API Docs: http://localhost:8000/docs
streamlit run streamlit_app/app.pyFeatures:
- Single and batch molecule predictions
- Model performance metrics (Train/Test R², RMSE, MAE)
- SHAP feature importance analysis
- Feature importance visualization
- Drift detection monitoring
The project includes SHAP (SHapley Additive exPlanations) analysis for model interpretability:
- Feature Impact Summary: Bar plot showing top features by mean |SHAP value|
- Feature Impact Distribution: Beeswarm plot showing how feature values affect predictions
Access SHAP analysis via the Streamlit UI under "Model Info" page.
Analysis of molecules with prediction errors exceeding 3.5 log mol/L reveals systematic model limitations:
| Molecule | Actual | Predicted | Error | Failure Reason |
|---|---|---|---|---|
| Pyridazinone carboxylic acid | 0.40 | -3.46 | +3.86 | Heterocyclic tautomerism |
| Ethyl cyanoacrylate | -6.72 | -0.46 | -6.26 | Reactive small molecule |
| Triphenylmethane dye | -0.70 | -5.79 | +5.10 | Large charged molecule |
| Cyclic ether alcohol | 2.14 | -1.37 | +3.51 | Conformationally flexible |
| Hydroxylated benzophenone | -7.27 | -3.31 | -3.96 | Intramolecular H-bonding |
| Bis-isatin (Indigotin-like) | -7.56 | -3.78 | -3.79 | Large planar pigment |
| Amphiphilic morpholine | 0.96 | -3.15 | +4.11 | Surfactant-like behavior |
| Pattern | Count | Description |
|---|---|---|
| Charged/Zwitterionic | 1 | Large molecules with delocalized charges - Morgan fingerprints don't capture long-range electronic effects |
| Intramolecular H-bonding | 2 | Compounds where internal H-bonds reduce apparent polarity more than expected |
| Large Planar Aromatics | 1 | Molecules with strong pi-stacking tendency leading to poor crystal dissolution |
| Amphiphilic | 1 | Surfactant-like molecules with unexpected high solubility due to micelle formation |
| Reactive/Unusual | 1 | Highly reactive or underrepresented chemotypes in training data |
| Conformationally Complex | 1 | Flexible molecules with complex solvation behavior |
-
Feature Limitations: Morgan fingerprints capture local structural patterns but miss global electronic effects like charge delocalization and long-range conjugation.
-
Training Data Gaps: Unusual molecules (dyes, reactive compounds, surfactants) are underrepresented, leading to poor extrapolation.
-
Physical Effects Not Modeled: Crystal packing, micelle formation, and intramolecular hydrogen bonding significantly affect solubility but aren't directly captured by the current features.
Interactive Analysis: View detailed error analysis with molecule structures in the Streamlit UI under "Model Info" page.
-
Ensemble Methods: Combine XGBoost with other models (Random Forest, LightGBM) for improved robustness
-
Confidence Intervals: Add prediction uncertainty quantification using quantile regression or conformal prediction
-
Outlier Flagging: Automatically flag molecules similar to known high-error cases using Tanimoto similarity
-
Additional Descriptors: Include 3D descriptors, charge distribution features, and hydrogen bonding descriptors
-
Graph Neural Networks: Implement GNN-based models (MPNN, AttentiveFP) to capture global molecular structure
-
Transfer Learning: Pre-train on larger chemical datasets before fine-tuning on solubility data
-
Active Learning: Iteratively improve model by prioritizing high-uncertainty predictions for experimental validation
| Metric | Description |
|---|---|
| R² Score | Coefficient of determination |
| RMSE | Root Mean Squared Error |
| MAE | Mean Absolute Error |
| Within ±0.5 log | % predictions within 0.5 log units |
| Within ±1.0 log | % predictions within 1.0 log units |
The system includes drift detection capabilities:
- Data Drift: Kolmogorov-Smirnov test (threshold: p-value > 0.05)
- Feature Stability: Population Stability Index (threshold: 0.1)
- Concept Drift: Target distribution shift detection
- Evidently AI: Interactive HTML reports
- Prediction Counter: Tracks total predictions made
- ML Framework: XGBoost, Scikit-learn
- Molecular Chemistry: RDKit
- Model Explainability: SHAP
- API: FastAPI, Uvicorn
- Experiment Tracking: MLflow
- Monitoring: Evidently AI
- Deployment: Docker, Docker Compose
- UI: Streamlit
AqSolDB - Aqueous Solubility Database containing curated solubility data for organic compounds.
MIT License