Chat with your SQL database and make complex queries with natural language using LLMs.
The system enables users to interact with databases using natural language. It understands questions, generates queries, interprets results, and creates meaningful visualizations.
It was built with: LangChain & LangGraph (LLM interaction and workflow), Chainlit (frontend), BigQuery (database) and Google Gemini.
The code is orchestrated between LLM functions and the Chainlit frontend following this architecture:
├── src/
│ ├── app.py # Chainlit entry point
│ ├── workflow.py # Core AI workflow functions
│ ├── orchestrator.py # Workflow orchestration (chainlit <-> workflow) logic
│ ├── database.py # BigQuery connection interface
│ └── system_prompts.py # System prompt used to generate answers
├── public/ # Static assets and UI resources
├── tests/ # Unit tests
└── requirements.txt # Dependencies
The system follows a simple 5-step workflow:
- Query Generation: Converts natural language to SQL
- Query Execution: Runs SQL on the BigQuery database
- Answer Generation: Interprets results in natural language
- Query Explanation: Explains the SQL logic and approach
- Visualization: Creates relevant Matplotlib visualizations
- Python 3.10+
- GCP account
- BigQuery dataset with the credit risk table uploaded
- Google AI API key
There are required environment variables. You must change the .env and have the database set up on BigQuery:
GOOGLE_API_KEY=your_gemini_api_key
GCP_KEY_PATH=./gcpkey.json
DATABASE_NAME=creditRisk-
Clone repository
git clone https://github.com/fabiopapais/chat-your-data.git cd chat-your-data -
Install dependencies
(a virtual environment is recommended)
pip install -r requirements.txt
-
Configure environment
cp .env.example .env # Edit .env with your values -
Add GCP credentials
Place your service account key in the root directory (
gcpkey.json). -
Run application
chainlit run src/app.py
There is a simple test file for the main workflow pipeline:
python test_workflow.py- Scalability - SQL-based queries and multi-table support for complex datasets
- Context Management - Automatically handles large datasets and queries results without token overflow
- Automated Visualizations - Generates meaningful charts when sufficient data is available
- Handles no result queries - Manages cases where queries return no results and prevents errors
- Zero Trust for LLMs - The system is designed with a zero trust principle, ensuring read-only queries and output sanitization (by default)
- Standard Workflow - Follows similar approaches to other validated text-to-SQL systems (https://arxiv.org/pdf/2410.01066, https://arxiv.org/abs/2406.08426, etc)
-
Question pre-analysis to prevent prompt injection and out of context queries
-
RAG integration for more complex queries
-
Automatic Schema description
-
Improve workflow with retries and better error handling
This was attempted but could be better with an agentic approach (agent having access to tools, contrary to current static workflow and llm request)
-
More accuracy with visualization generation
Even with a lot of given context, the LLMs still make wrong assumptions about the data, fabricate data and generate wrong visualizations. Better prompt engineering or specialist models may improve this.
-
Test dedicated models such as https://huggingface.co/defog/llama-3-sqlcoder-8b
-
Better integration with Chainlit
LangGraph streams do not work well with Chainlit, so we can't use Graphs full potential. It was necessary to implement a custom orchestrator to handle the workflow and Chainlit communication.

