Skip to content

End to end MLOps pipeline for predicting aqueous solubility of organic small molecules

Notifications You must be signed in to change notification settings

Ankit-kumar91/AqSolPrediction

Repository files navigation

AqSolPrediction

End-to-end MLOps pipeline for predicting aqueous solubility of organic molecules using machine learning.

Overview

This project addresses the critical challenge of aqueous solubility prediction by estimating the solubility (log mol/L) of organic molecules directly from their SMILES representations using an XGBoost regression model trained on molecular fingerprints and physicochemical descriptors. By combining robust machine learning with chemically meaningful features, the system enables early assessment of solubility, which is essential for drug design, pharmaceutical formulation, coatings, and battery electrolyte development. Beyond model training, the project delivers a complete production-ready solution, including automated training and inference pipelines, a FastAPI service for real-time predictions, monitoring, and deployment configurations, allowing seamless integration of solubility prediction into R&D and decision-making workflows.

Model Performance

Metric Train Test
0.86 0.86
RMSE 0.87 0.86
MAE 0.63 0.62

Model Architecture

Algorithm

XGBoost Regressor (Gradient Boosting Decision Trees)

Note: Hyperparameter tuning was performed in Jupyter notebooks located in experiments/notebooks/. The optimal parameters were determined through RandemiRandomizedSearchCV

Feature Engineering

Input: SMILES strings (Simplified Molecular Input Line Entry System)

Total Features: ~1048 (24 descriptors + 1024 fingerprint bits)

Morgan Fingerprints (1024 features)

  • Type: Circular fingerprints (ECFP-like)
  • Radius: 2 (captures 2-hop neighborhood)
  • Bits: 1024 binary features
  • Generated via: RDKit AllChem.GetMorganFingerprintAsBitVect()

Pipeline Architecture

SMILES Input
     │
     ▼
┌─────────────────────────────────────────┐
│         Data Preprocessing              │
│  • Validate SMILES                      │
│  • Filter organic molecules             │
│  • Neutralize charges                   │
│  • Generate canonical SMILES            │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│        Feature Engineering              │
│  • Calculate 24 molecular descriptors   │
│  • Generate 1024-bit Morgan fingerprints│
│  • Remove constant features (var<0.01)  │
│  • Remove correlated features (>0.95)   │
│  • Scale descriptors (StandardScaler)   │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│          XGBoost Model                  │
│  • 1500 trees, depth 6                  │
│  • L1 + L2 regularization               │
│  • Histogram-based training             │
└─────────────────────────────────────────┘
     │
     ▼
Solubility Prediction (log mol/L)

Output Interpretation

Solubility Range Classification
> 0 log mol/L Highly Soluble
-1 to 0 Soluble
-2 to -1 Slightly Soluble
-3 to -2 Poorly Soluble
< -3 log mol/L Practically Insoluble

Project Structure

AqSolPrediction/
├── src/
│   ├── components/
│   │   ├── data_loader.py           # Data loading and cleaning
│   │   ├── feature_engineering.py   # Molecular feature generation
│   │   ├── model_trainer.py         # XGBoost training with CV
│   │   ├── model_evaluator.py       # Metrics and visualization
│   │   ├── model_monitoring.py      # Drift detection (KS, PSI)
│   │   └── evidently_monitor.py     # Evidently AI integration
│   ├── pipeline/
│   │   ├── train_pipeline.py        # Training orchestration
│   │   └── predict_pipeline.py      # Inference pipeline
│   ├── exception.py
│   └── logger.py
├── app/                             # FastAPI application
│   ├── main.py
│   ├── schemas.py
│   └── routers/
│       ├── predict.py
│       └── monitoring.py
├── streamlit_app/                   # Streamlit UI
│   ├── app.py                       # Main prediction interface
│   └── pages/
│       ├── 1_Model_Monitoring.py    # Drift detection dashboard
│       └── 2_Model_Info.py          # Model info & SHAP analysis
├── experiments/
│   └── notebooks/                   # Hyperparameter tuning notebooks
├── configs/
│   └── config.yaml                  # Central configuration
├── data/
│   ├── raw/                         # AqSolDB dataset
│   └── processed/
├── artifacts/                       # Trained models and scalers
├── docker/
├── mlruns/                          # MLflow tracking
├── requirements.txt
└── docker-compose.yml

Installation

# Clone the repository
git clone https://github.com/yourusername/AqSolPrediction.git
cd AqSolPrediction

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

Training

# Run full training pipeline with cross-validation
python -m src.pipeline.train_pipeline

Prediction

from src.pipeline.predict_pipeline import PredictionPipeline

# Initialize pipeline
pipeline = PredictionPipeline()

# Single prediction
result = pipeline.predict_single("CCO")  # Ethanol
print(f"Solubility: {result:.2f} log mol/L")

# Batch prediction
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
results = pipeline.predict_batch(smiles_list)

API Server

# Start FastAPI server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

API Endpoints:

Endpoint Method Description
/ GET API information
/health GET Health check
/model/info GET Model information
/predict/single POST Single SMILES prediction
/predict/batch POST Batch predictions
/monitoring/report GET Drift detection report

Docker Deployment

Option 1: Pull from Docker Hub (quickest)

# Run the Streamlit app
docker run -p 8501:8501 ankitkumar91/aqsol-streamlit:latest

# Run the API server
docker run -p 8000:8000 ankitkumar91/aqsol-api:latest

Option 2: Build locally

# Build and run Streamlit app
docker build -f docker/Dockerfile.streamlit -t aqsol-streamlit .
docker run -p 8501:8501 aqsol-streamlit

# Build and run API server
docker build -f docker/Dockerfile.api -t aqsol-api .
docker run -p 8000:8000 aqsol-api

Option 3: Docker Compose (both services)

docker-compose up -d

After running, access:

Streamlit UI

streamlit run streamlit_app/app.py

Features:

  • Single and batch molecule predictions
  • Model performance metrics (Train/Test R², RMSE, MAE)
  • SHAP feature importance analysis
  • Feature importance visualization
  • Drift detection monitoring

Model Explainability

The project includes SHAP (SHapley Additive exPlanations) analysis for model interpretability:

  • Feature Impact Summary: Bar plot showing top features by mean |SHAP value|
  • Feature Impact Distribution: Beeswarm plot showing how feature values affect predictions

Access SHAP analysis via the Streamlit UI under "Model Info" page.

Error Analysis

Analysis of molecules with prediction errors exceeding 3.5 log mol/L reveals systematic model limitations:

High-Error Molecules (|Residual| > 3.5)

Molecule Actual Predicted Error Failure Reason
Pyridazinone carboxylic acid 0.40 -3.46 +3.86 Heterocyclic tautomerism
Ethyl cyanoacrylate -6.72 -0.46 -6.26 Reactive small molecule
Triphenylmethane dye -0.70 -5.79 +5.10 Large charged molecule
Cyclic ether alcohol 2.14 -1.37 +3.51 Conformationally flexible
Hydroxylated benzophenone -7.27 -3.31 -3.96 Intramolecular H-bonding
Bis-isatin (Indigotin-like) -7.56 -3.78 -3.79 Large planar pigment
Amphiphilic morpholine 0.96 -3.15 +4.11 Surfactant-like behavior

Common Failure Patterns

Pattern Count Description
Charged/Zwitterionic 1 Large molecules with delocalized charges - Morgan fingerprints don't capture long-range electronic effects
Intramolecular H-bonding 2 Compounds where internal H-bonds reduce apparent polarity more than expected
Large Planar Aromatics 1 Molecules with strong pi-stacking tendency leading to poor crystal dissolution
Amphiphilic 1 Surfactant-like molecules with unexpected high solubility due to micelle formation
Reactive/Unusual 1 Highly reactive or underrepresented chemotypes in training data
Conformationally Complex 1 Flexible molecules with complex solvation behavior

Key Insights

  1. Feature Limitations: Morgan fingerprints capture local structural patterns but miss global electronic effects like charge delocalization and long-range conjugation.

  2. Training Data Gaps: Unusual molecules (dyes, reactive compounds, surfactants) are underrepresented, leading to poor extrapolation.

  3. Physical Effects Not Modeled: Crystal packing, micelle formation, and intramolecular hydrogen bonding significantly affect solubility but aren't directly captured by the current features.

Interactive Analysis: View detailed error analysis with molecule structures in the Streamlit UI under "Model Info" page.

Future Improvements

  • Ensemble Methods: Combine XGBoost with other models (Random Forest, LightGBM) for improved robustness

  • Confidence Intervals: Add prediction uncertainty quantification using quantile regression or conformal prediction

  • Outlier Flagging: Automatically flag molecules similar to known high-error cases using Tanimoto similarity

  • Additional Descriptors: Include 3D descriptors, charge distribution features, and hydrogen bonding descriptors

  • Graph Neural Networks: Implement GNN-based models (MPNN, AttentiveFP) to capture global molecular structure

  • Transfer Learning: Pre-train on larger chemical datasets before fine-tuning on solubility data

  • Active Learning: Iteratively improve model by prioritizing high-uncertainty predictions for experimental validation

Evaluation Metrics

Metric Description
R² Score Coefficient of determination
RMSE Root Mean Squared Error
MAE Mean Absolute Error
Within ±0.5 log % predictions within 0.5 log units
Within ±1.0 log % predictions within 1.0 log units

Monitoring

The system includes drift detection capabilities:

  • Data Drift: Kolmogorov-Smirnov test (threshold: p-value > 0.05)
  • Feature Stability: Population Stability Index (threshold: 0.1)
  • Concept Drift: Target distribution shift detection
  • Evidently AI: Interactive HTML reports
  • Prediction Counter: Tracks total predictions made

Technology Stack

  • ML Framework: XGBoost, Scikit-learn
  • Molecular Chemistry: RDKit
  • Model Explainability: SHAP
  • API: FastAPI, Uvicorn
  • Experiment Tracking: MLflow
  • Monitoring: Evidently AI
  • Deployment: Docker, Docker Compose
  • UI: Streamlit

Data Source

AqSolDB - Aqueous Solubility Database containing curated solubility data for organic compounds.

License

MIT License

About

End to end MLOps pipeline for predicting aqueous solubility of organic small molecules

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published