A comprehensive framework for deep learning on relational databases, featuring database research work: RAM, AIDA, LEVA, ARDA, Qzero and rich baseline implementations.
Deep Database helps organizations turn complex relational data into responsible, evidence-based decisions. Municipal teams can surface service gaps earlier, financial analysts can monitor systemic risk without sacrificing compliance, and scientists can fuse heterogeneous datasets to accelerate discovery. By shipping reproducible preprocessing, transparent model baselines, and retrieval-augmented reasoning, the project lowers the barrier for teams whose models must stand up to real-world scrutiny.
Deep Database provides a unified platform for applying deep learning and machine learning techniques to relational data analytics. A shared set of dataset abstractions, preprocessing utilities, and task definitions keeps experiments consistent across graph, tabular, and retrieval-augmented approaches.
- Multiple Modeling Paradigms: RAM, AIDA, LEVA, ARDA, and additional research prototypes.
- Consistent Data Abstractions: Dataset, table, and task classes with self-registration for new workloads.
- Graph & Tabular Builders: Convert relational data into heterogeneous graphs or flattened tables.
- Retrieval Integration: Document/token pipelines and BM25 retrieval for augmented modeling.
- Extensive Baselines: Classic ML (CatBoost, LightGBM), neural tabular models, and graph networks.
- Reproducible Evaluation: Scripts and configs for end-to-end benchmarking.
This project uses Conda for environment and dependency management, configured via environment.yml.
-
Clone the repository
git clone <your-repo-url> cd embedding_fusion
-
Create the Conda environment
conda env create -f environment.yml conda activate deepdb
- relbench — relational benchmark datasets
- torch, torch_geometric (PyG) — deep learning and GNN toolkits
- torch_frame — tabular modeling utilities for PyTorch
- bm25s — BM25 retrieval implementation
- pandas, numpy — data processing
Deep Database is organized so that dataset handling, task logic, and modeling remain loosely coupled:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DEEP DATABASE FRAMEWORK │
│ Advanced Relational Data Analytics │
└─────────────────────────────────────────────────────────────────────────────────┘
┌──────────────┐
│ Raw Data │
│ (Relational) │
└──────┬───────┘
│
┌──────────────────────┴──────────────────────┐
│ │
┌────────▼─────────┐ ┌────────▼─────────┐
│ DATASET REGISTRY │ │ TASK LAYER │
│ (utils/data) │ │ (utils/task) │
├──────────────────┤ ├──────────────────┤
│ • Database │◄────────────────────────►│ • Objectives │
│ • Dataset │ │ • Metrics │
│ • Table │ │ • Splits │
│ • table_data │ │ • Classification │
│ adapters │ │ • Regression │
└────────┬─────────┘ └──────────────────┘
│
┌───────────┴───────────────────┐
│ │
┌───────▼──────────┐ ┌──────────▼────────────┐
│ RELATIONAL │ │ ADAPTIVE │
│ BUILDERS │ │ PREPROCESSING │
│ (utils/builder) │ │ │
├──────────────────┤ ├───────────────────────┤
│ • Schema → Graph │ │ • utils/preprocess │
│ • Homogeneous │ │ (column analysis) │
│ • Heterogeneous │ │ • utils/document │
│ • Graph Metadata │ │ (row → text) │
└────────┬─────────┘ │ • utils/tokenize │
│ │ (retrieval prep) │
│ └──────────┬────────────┘
│ │
└──────────┬───────────────────┘
│
┌──────────▼──────────────────────────────────────┐
│ │
│ MODEL LIBRARY (model/) │
│ │
├─────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ SHARED │ │ TABULAR │ │
│ │ │ │ │ │
│ │ • base │ │ • MLP │ │
│ │ • utils │ │ • ResNet │ │
│ └─────────────┘ │ • TabM │ │
│ │ • FT-Trans │ │
│ ┌─────────────┐ │ • ARMNet │ │
│ │ GRAPH │ └──────────────┘ │
│ │ MODELS │ │
│ │ │ ┌──────────────┐ │
│ │ • GCN │ │ CONTRASTIVE │ │
│ │ • HGT │ │ │ │
│ │ • GAT │ │ • BGRL │ │
│ │ • SAGE │ │ • DGI │ │
│ └─────────────┘ │ • GraphCL │ │
│ └──────────────┘ │
└──────────────────────┬──────────────────────────┘
│
┌───────────┴────────────┐
│ │
┌──────────▼──────────┐ ┌──────────▼──────────────┐
│ COMMAND │ │ ARTEFACT STORES │
│ ENTRYPOINTS │ │ (data/) │
│ │ │ │
│ • cmds/ │ │ • Generated Tables │
│ - Data Gen │ │ • TensorFrames │
│ - Baselines │ │ • Cached Features │
│ - Diagnostics │ │ • Model Checkpoints │
│ │ │ │
│ • exe/ │ │ │
│ - Shell Scripts │ │ │
│ - End-to-End │ │ │
└─────────────────────┘ └─────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ EXTENSIONS │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ RAM │ │ LEVA │ │ tabICL │ │ tabPFN │ │ Qzero │ │
│ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │
│ │Retrieval │ │ Boosted │ │ In-Ctx │ │ Prior │ │ Neural │ │
│ │Augmented │ │Relational│ │ Learning │ │ Fitted │ │ Arch │ │
│ │ Modeling │ │ Learning │ │ │ │ Network │ │ Search │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ AIDA: Advanced Interface for Data Analytics │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ QUALITY GATES │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ test/ │ │ scripts/ │ │ static/ │ │ webapp/ │ │
│ ├──────────┤ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ Unit │ │ Notebooks │ │ Assets │ │ Demos │ │
│ │ Coverage │ │ Exploration │ │ Figures │ │ Interactive │ │
│ └──────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
DATA FLOW
─────────
Raw Relational Data → Dataset Registry → Task Layer
↓
Relational Builders + Preprocessing
↓
Graph/Tabular Representations
↓
Model Library
↓
Predictions & Evaluations
↓
Artefact Stores
KEY FEATURES
────────────
• Multi-Paradigm: Graph, Tabular, Retrieval-Augmented
• Consistent Abstractions: Unified dataset/task interfaces
• Extensive Baselines: Classic ML + Neural Networks
• Reproducible: End-to-end scripts and configs
• Extensible: Plugin architecture for new methods
-
Dataset Registry (
utils/data)- Defines abstractions: Database, Dataset, Table
- table_data adapters materialize relational sources
- Self-registration for new workloads
-
Task Layer (
utils/task)- Standardizes objectives, metrics, splits
- Supports classification, regression, future plugins
- Consistent experiment interface
-
Relational Builders (
utils/builder)- Schema → Graph transformations
- Homogeneous & heterogeneous graph support
- Reuses registry metadata
-
Adaptive Preprocessing (
utils/preprocess,utils/document,utils/tokenize)- Column type and role inspection
- Row → text translation for retrieval
- Language model interface preparation
-
Model Library (
model/)- Shared modules: base, utils
- Graph: GCN, HGT, GAT, SAGE
- Contrastive: BGRL, DGI, GraphCL
- Tabular: MLP, ResNet, TabM, FT-Transformer, ARMNet
-
Command Entrypoints (
cmds/,exe/)- Python frontends for data generation, baselines, diagnostics
- Shell scripts for end-to-end workflows
-
Artefact Stores (
data/)- Persists generated tables and tensorframes
- Cached features and model checkpoints
-
Extensions
- RAM: Retrieval-Augmented Modeling
- LEVA: Boosted Relational Learning
- tabICL, tabPFN: Tabular Foundation Models
- Qzero: Neural Architecture Search
- AIDA: Advanced Interface for Data Analytics
-
Quality Gates
- test/: Unit coverage
- scripts/: Notebooks for exploration
- static/, webapp/: Demos and visualizationsts/
and assets instatic//webapp/` support exploration and demos without altering the production workflow.
# Generate baseline tabular data
bash exe/generate_table_data.sh
# Launch neural tabular baselines (MLP, ResNet, FT-Transformer)
python cmds/dnn_baseline_table_data.py --help
# Launch classic ML baselines (CatBoost, LightGBM)
python cmds/ml_baseline.py --helpIf you use this codebase in your research, please cite:
@article{deepdatabase2025,
title={Deep Database: Advanced Relational Data Analytics},
author={Your Team},
year={2025}
}This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For questions and feedback, please open an issue in the repository.
