chat-your-data

Chat with your SQL database and make complex queries with natural language using LLMs.

The system enables users to interact with databases using natural language. It understands questions, generates queries, interprets results, and creates meaningful visualizations.

Tech Stack and Architecture

It was built with: LangChain & LangGraph (LLM interaction and workflow), Chainlit (frontend), BigQuery (database) and Google Gemini.

The code is orchestrated between LLM functions and the Chainlit frontend following this architecture:

├── src/
│   ├── app.py              # Chainlit entry point
│   ├── workflow.py         # Core AI workflow functions
│   ├── orchestrator.py     # Workflow orchestration (chainlit <-> workflow) logic
│   ├── database.py         # BigQuery connection interface
│   └── system_prompts.py   # System prompt used to generate answers
├── public/                 # Static assets and UI resources
├── tests/                  # Unit tests
└── requirements.txt        # Dependencies

Workflow

The system follows a simple 5-step workflow:

Query Generation: Converts natural language to SQL
Query Execution: Runs SQL on the BigQuery database
Answer Generation: Interprets results in natural language
Query Explanation: Explains the SQL logic and approach
Visualization: Creates relevant Matplotlib visualizations

Running

Prerequisites

Python 3.10+
GCP account
BigQuery dataset with the credit risk table uploaded
Google AI API key

Configuration

There are required environment variables. You must change the .env and have the database set up on BigQuery:

GOOGLE_API_KEY=your_gemini_api_key
GCP_KEY_PATH=./gcpkey.json
DATABASE_NAME=creditRisk

Installation

Clone repository

git clone https://github.com/fabiopapais/chat-your-data.git
cd chat-your-data

Install dependencies

(a virtual environment is recommended)
```
pip install -r requirements.txt
```

Configure environment

cp .env.example .env
# Edit .env with your values

Add GCP credentials

Place your service account key in the root directory (gcpkey.json).
Run application
```
chainlit run src/app.py
```

Running Tests

There is a simple test file for the main workflow pipeline:

python test_workflow.py

Highlights

Scalability - SQL-based queries and multi-table support for complex datasets
Context Management - Automatically handles large datasets and queries results without token overflow
Automated Visualizations - Generates meaningful charts when sufficient data is available
Handles no result queries - Manages cases where queries return no results and prevents errors
Zero Trust for LLMs - The system is designed with a zero trust principle, ensuring read-only queries and output sanitization (by default)
Standard Workflow - Follows similar approaches to other validated text-to-SQL systems (https://arxiv.org/pdf/2410.01066, https://arxiv.org/abs/2406.08426, etc)

Challenges and future work

Question pre-analysis to prevent prompt injection and out of context queries
RAG integration for more complex queries
Automatic Schema description
Improve workflow with retries and better error handling

This was attempted but could be better with an agentic approach (agent having access to tools, contrary to current static workflow and llm request)
More accuracy with visualization generation

Even with a lot of given context, the LLMs still make wrong assumptions about the data, fabricate data and generate wrong visualizations. Better prompt engineering or specialist models may improve this.
Test dedicated models such as https://huggingface.co/defog/llama-3-sqlcoder-8b
Better integration with Chainlit

LangGraph streams do not work well with Chainlit, so we can't use Graphs full potential. It was necessary to implement a custom orchestrator to handle the workflow and Chainlit communication.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.chainlit		.chainlit
public		public
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
chainlit.md		chainlit.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chat-your-data

Tech Stack and Architecture

Workflow

Running

Prerequisites

Configuration

Installation

Running Tests

Highlights

Challenges and future work

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

chat-your-data

Tech Stack and Architecture

Workflow

Running

Prerequisites

Configuration

Installation

Running Tests

Highlights

Challenges and future work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages