AqSolPrediction

End-to-end MLOps pipeline for predicting aqueous solubility of organic molecules using machine learning.

Overview

This project addresses the critical challenge of aqueous solubility prediction by estimating the solubility (log mol/L) of organic molecules directly from their SMILES representations using an XGBoost regression model trained on molecular fingerprints and physicochemical descriptors. By combining robust machine learning with chemically meaningful features, the system enables early assessment of solubility, which is essential for drug design, pharmaceutical formulation, coatings, and battery electrolyte development. Beyond model training, the project delivers a complete production-ready solution, including automated training and inference pipelines, a FastAPI service for real-time predictions, monitoring, and deployment configurations, allowing seamless integration of solubility prediction into R&D and decision-making workflows.

Model Performance

Metric	Train	Test
R²	0.86	0.86
RMSE	0.87	0.86
MAE	0.63	0.62

Model Architecture

Algorithm

XGBoost Regressor (Gradient Boosting Decision Trees)

Note: Hyperparameter tuning was performed in Jupyter notebooks located in experiments/notebooks/. The optimal parameters were determined through RandemiRandomizedSearchCV

Feature Engineering

Input: SMILES strings (Simplified Molecular Input Line Entry System)

Total Features: ~1048 (24 descriptors + 1024 fingerprint bits)

Morgan Fingerprints (1024 features)

Type: Circular fingerprints (ECFP-like)
Radius: 2 (captures 2-hop neighborhood)
Bits: 1024 binary features
Generated via: RDKit AllChem.GetMorganFingerprintAsBitVect()

Pipeline Architecture

SMILES Input
     │
     ▼
┌─────────────────────────────────────────┐
│         Data Preprocessing              │
│  • Validate SMILES                      │
│  • Filter organic molecules             │
│  • Neutralize charges                   │
│  • Generate canonical SMILES            │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│        Feature Engineering              │
│  • Calculate 24 molecular descriptors   │
│  • Generate 1024-bit Morgan fingerprints│
│  • Remove constant features (var<0.01)  │
│  • Remove correlated features (>0.95)   │
│  • Scale descriptors (StandardScaler)   │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│          XGBoost Model                  │
│  • 1500 trees, depth 6                  │
│  • L1 + L2 regularization               │
│  • Histogram-based training             │
└─────────────────────────────────────────┘
     │
     ▼
Solubility Prediction (log mol/L)

Output Interpretation

Solubility Range	Classification
> 0 log mol/L	Highly Soluble
-1 to 0	Soluble
-2 to -1	Slightly Soluble
-3 to -2	Poorly Soluble
< -3 log mol/L	Practically Insoluble

Project Structure

AqSolPrediction/
├── src/
│   ├── components/
│   │   ├── data_loader.py           # Data loading and cleaning
│   │   ├── feature_engineering.py   # Molecular feature generation
│   │   ├── model_trainer.py         # XGBoost training with CV
│   │   ├── model_evaluator.py       # Metrics and visualization
│   │   ├── model_monitoring.py      # Drift detection (KS, PSI)
│   │   └── evidently_monitor.py     # Evidently AI integration
│   ├── pipeline/
│   │   ├── train_pipeline.py        # Training orchestration
│   │   └── predict_pipeline.py      # Inference pipeline
│   ├── exception.py
│   └── logger.py
├── app/                             # FastAPI application
│   ├── main.py
│   ├── schemas.py
│   └── routers/
│       ├── predict.py
│       └── monitoring.py
├── streamlit_app/                   # Streamlit UI
│   ├── app.py                       # Main prediction interface
│   └── pages/
│       ├── 1_Model_Monitoring.py    # Drift detection dashboard
│       └── 2_Model_Info.py          # Model info & SHAP analysis
├── experiments/
│   └── notebooks/                   # Hyperparameter tuning notebooks
├── configs/
│   └── config.yaml                  # Central configuration
├── data/
│   ├── raw/                         # AqSolDB dataset
│   └── processed/
├── artifacts/                       # Trained models and scalers
├── docker/
├── mlruns/                          # MLflow tracking
├── requirements.txt
└── docker-compose.yml

Installation

# Clone the repository
git clone https://github.com/yourusername/AqSolPrediction.git
cd AqSolPrediction

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

Training

# Run full training pipeline with cross-validation
python -m src.pipeline.train_pipeline

Prediction

from src.pipeline.predict_pipeline import PredictionPipeline

# Initialize pipeline
pipeline = PredictionPipeline()

# Single prediction
result = pipeline.predict_single("CCO")  # Ethanol
print(f"Solubility: {result:.2f} log mol/L")

# Batch prediction
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
results = pipeline.predict_batch(smiles_list)

API Server

# Start FastAPI server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

API Endpoints:

Endpoint	Method	Description
`/`	GET	API information
`/health`	GET	Health check
`/model/info`	GET	Model information
`/predict/single`	POST	Single SMILES prediction
`/predict/batch`	POST	Batch predictions
`/monitoring/report`	GET	Drift detection report

Docker Deployment

Option 1: Pull from Docker Hub (quickest)

# Run the Streamlit app
docker run -p 8501:8501 ankitkumar91/aqsol-streamlit:latest

# Run the API server
docker run -p 8000:8000 ankitkumar91/aqsol-api:latest

Option 2: Build locally

# Build and run Streamlit app
docker build -f docker/Dockerfile.streamlit -t aqsol-streamlit .
docker run -p 8501:8501 aqsol-streamlit

# Build and run API server
docker build -f docker/Dockerfile.api -t aqsol-api .
docker run -p 8000:8000 aqsol-api

Option 3: Docker Compose (both services)

docker-compose up -d

After running, access:

Streamlit UI

streamlit run streamlit_app/app.py

Features:

Single and batch molecule predictions
Model performance metrics (Train/Test R², RMSE, MAE)
SHAP feature importance analysis
Feature importance visualization
Drift detection monitoring

Model Explainability

The project includes SHAP (SHapley Additive exPlanations) analysis for model interpretability:

Feature Impact Summary: Bar plot showing top features by mean |SHAP value|
Feature Impact Distribution: Beeswarm plot showing how feature values affect predictions

Access SHAP analysis via the Streamlit UI under "Model Info" page.

Error Analysis

Analysis of molecules with prediction errors exceeding 3.5 log mol/L reveals systematic model limitations:

High-Error Molecules (|Residual| > 3.5)

Molecule	Actual	Predicted	Error	Failure Reason
Pyridazinone carboxylic acid	0.40	-3.46	+3.86	Heterocyclic tautomerism
Ethyl cyanoacrylate	-6.72	-0.46	-6.26	Reactive small molecule
Triphenylmethane dye	-0.70	-5.79	+5.10	Large charged molecule
Cyclic ether alcohol	2.14	-1.37	+3.51	Conformationally flexible
Hydroxylated benzophenone	-7.27	-3.31	-3.96	Intramolecular H-bonding
Bis-isatin (Indigotin-like)	-7.56	-3.78	-3.79	Large planar pigment
Amphiphilic morpholine	0.96	-3.15	+4.11	Surfactant-like behavior

Common Failure Patterns

Pattern	Count	Description
Charged/Zwitterionic	1	Large molecules with delocalized charges - Morgan fingerprints don't capture long-range electronic effects
Intramolecular H-bonding	2	Compounds where internal H-bonds reduce apparent polarity more than expected
Large Planar Aromatics	1	Molecules with strong pi-stacking tendency leading to poor crystal dissolution
Amphiphilic	1	Surfactant-like molecules with unexpected high solubility due to micelle formation
Reactive/Unusual	1	Highly reactive or underrepresented chemotypes in training data
Conformationally Complex	1	Flexible molecules with complex solvation behavior

Key Insights

Feature Limitations: Morgan fingerprints capture local structural patterns but miss global electronic effects like charge delocalization and long-range conjugation.
Training Data Gaps: Unusual molecules (dyes, reactive compounds, surfactants) are underrepresented, leading to poor extrapolation.
Physical Effects Not Modeled: Crystal packing, micelle formation, and intramolecular hydrogen bonding significantly affect solubility but aren't directly captured by the current features.

Interactive Analysis: View detailed error analysis with molecule structures in the Streamlit UI under "Model Info" page.

Future Improvements

Ensemble Methods: Combine XGBoost with other models (Random Forest, LightGBM) for improved robustness
Confidence Intervals: Add prediction uncertainty quantification using quantile regression or conformal prediction
Outlier Flagging: Automatically flag molecules similar to known high-error cases using Tanimoto similarity
Additional Descriptors: Include 3D descriptors, charge distribution features, and hydrogen bonding descriptors
Graph Neural Networks: Implement GNN-based models (MPNN, AttentiveFP) to capture global molecular structure
Transfer Learning: Pre-train on larger chemical datasets before fine-tuning on solubility data
Active Learning: Iteratively improve model by prioritizing high-uncertainty predictions for experimental validation

Evaluation Metrics

Metric	Description
R² Score	Coefficient of determination
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
Within ±0.5 log	% predictions within 0.5 log units
Within ±1.0 log	% predictions within 1.0 log units

Monitoring

The system includes drift detection capabilities:

Data Drift: Kolmogorov-Smirnov test (threshold: p-value > 0.05)
Feature Stability: Population Stability Index (threshold: 0.1)
Concept Drift: Target distribution shift detection
Evidently AI: Interactive HTML reports
Prediction Counter: Tracks total predictions made

Technology Stack

ML Framework: XGBoost, Scikit-learn
Molecular Chemistry: RDKit
Model Explainability: SHAP
API: FastAPI, Uvicorn
Experiment Tracking: MLflow
Monitoring: Evidently AI
Deployment: Docker, Docker Compose
UI: Streamlit

Data Source

AqSolDB - Aqueous Solubility Database containing curated solubility data for organic compounds.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
app		app
configs		configs
docker		docker
experiments/notebooks		experiments/notebooks
src		src
streamlit_app		streamlit_app
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-docker.txt		requirements-docker.txt
requirements.txt		requirements.txt
template.py		template.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AqSolPrediction

Overview

Model Performance

Model Architecture

Algorithm

Feature Engineering

Morgan Fingerprints (1024 features)

Pipeline Architecture

Output Interpretation

Project Structure

Installation

Usage

Training

Prediction

API Server

Docker Deployment

Streamlit UI

Model Explainability

Error Analysis

High-Error Molecules (|Residual| > 3.5)

Common Failure Patterns

Key Insights

Future Improvements

Evaluation Metrics

Monitoring

Technology Stack

Data Source

License

About

Uh oh!

Releases

Packages

Languages

Ankit-kumar91/AqSolPrediction

Folders and files

Latest commit

History

Repository files navigation

AqSolPrediction

Overview

Model Performance

Model Architecture

Algorithm

Feature Engineering

Morgan Fingerprints (1024 features)

Pipeline Architecture

Output Interpretation

Project Structure

Installation

Usage

Training

Prediction

API Server

Docker Deployment

Streamlit UI

Model Explainability

Error Analysis

High-Error Molecules (|Residual| > 3.5)

Common Failure Patterns

Key Insights

Future Improvements

Evaluation Metrics

Monitoring

Technology Stack

Data Source

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages