This repository provides a tool for querying MongoDB databases using a Language Learning Model (LLM) integrated into a FastAPI framework. It's designed to facilitate easy interaction with your MongoDB data through a simple API.
Before you begin, ensure you have met the following requirements:
- Python Version: The codebase is tested with Python 3.11.9. Other versions might work but are not guaranteed. It is highly recommended to use Python 3.11.9 to avoid any compatibility issues.
Follow these steps to get your development environment set up:
- Clone the repository:
git clone https://github.com/qolaba/QueryGen.git
- Install the required dependencies:
pip install -r requirements.txt
To start the application, run the following command in your terminal:
python main.pyAfter running the application, you can access the FastAPI documentation by navigating to http://localhost:9000/docs in your web browser. This documentation page will provide you with all the available endpoints and their respective details.
The FastAPI server provides an interactive API documentation (Swagger UI) that lets you test the API directly from your browser. You can access the API documentation by visiting the /docs endpoint after starting the server.
When interacting with the FastAPI endpoint, you are required to provide several parameters. Here's a detailed description of each:
-
connection_url: The URL used to connect to your MongoDB instance. This should include the username, password, and the cluster address. Make sure that, your connection URL does not have editing access in associated database.
-
collection_list: An array of strings representing the names of the collections you want to query within the specified database. From the given collection list, the DBReader object will read the unique keys and associated unique examples and prepare the system prompt for that. If you are writing some query, make sure that collection list contains the collection associated with this query. Additionally, the DB reader will only add keys which are mostly repeating to avoid less repeating keys. Additionally, the examples values are only added if given keys are not object id or date to save tokens because the final system prompt will be too larger.
-
database_name: The name of the database where the collections reside.
-
description: A detailed description of guidelines or notes relevant to constructing your queries or understanding the database schema. If you have specific logic for your given database, you could describe them here.
-
query: A description or specific question you want the LLM to process and generate MongoDB queries for. This should be formulated in natural language. After generating query, it will execute the query through pymonog and return final answer.
-
example_count: This parameter specify the number of unique examples added in system prompt for given key.
-
max_output_count: This limits the number of query results returned. Example:
30. The reason for specifying max_count is that, the model will generate query and it could results in 1000s or more data output from database. At the end, the fetched data will be passed to LLM for generating answer in human understandable way. During this time, if we add too large amount of data, LLM could not be able to answer this due to token limit. To avoid this scenario, we could fix this max_output_count. It will only fetch first max_output_count number of results. -
database_type: The type of database being queried. Currently, this repo, only supports MongoDB.
-
llm_name: The specific version or name of the Language Learning Model used for generating queries. Example:
"gpt-4-turbo-2024-04-09". Currently, this repo supports MistralAI codestral model and OpenAI GPT 4 model. -
temperature: A parameter controlling the randomness of the output from the LLM. A temperature close to 0 makes the model's output more deterministic and repetitive, while higher values make it more diverse and random. Example:
0.
Each of these parameters plays a crucial role in how the API functions and serves the user's requests. Ensure that these parameters are correctly specified to achieve the desired outcomes from your API.
The operation of this repository revolves around a structured process involving data preparation, query generation, and result processing using a Language Learning Model (LLM). Here is a detailed breakdown of the workflow:
-
Data Preparation with DBReader:
- The
DBReaderobject initializes by fetching all unique keys, their associated values, and data types from the collections specified in thecollection_list. - It then arranges this data into a structured prompt, listing all keys associated with each collection, along with maximum example values and data types. This organization is performed for every collection in the list.
- The
-
Query Generation by LLM:
- The prepared data, along with the user's query, is passed to the LLM.
- The LLM processes this information to generate a MongoDB query and specifies the collection on which this query should be executed.
-
Condition Checking:
- The system checks the generated query against four conditions:
- Intermediate Query Check: Determines if the query is an intermediate step, i.e., if it cannot fetch all required data in a single step for the user's query.
- Operation Check: Identifies if the query includes delete, modify, or insert operations.
- Syntax Check: Validates the syntactical correctness of the query.
- Raw Result Check: Ensures the query does not return raw results, like object IDs, directly to the user.
- If any condition is true, the LLM is prompted to rewrite the query, providing reasons based on the condition that failed.
- The system checks the generated query against four conditions:
-
Data Fetching:
- Once all the condiitons are satisfied, The generated query is executed through a designated function that fetches the required data from the database.
-
Result Fetching and Error Handling:
- If an error occurs during above step, the LLM is asked to rewrite the query to rectify the error by passing the old query and the error message back to the LLM.
-
Final Response Crafting:
- Once the query successfully runs without errors, the output is passed back to the LLM, which then crafts the final response in a human-understandable format.
-
Iteration and Termination:
- The process includes a maximum iteration condition to prevent infinite loops. If the number of iterations exceeds the
max_iterationlimit, the process is stopped.
- The process includes a maximum iteration condition to prevent infinite loops. If the number of iterations exceeds the
This workflow ensures a robust interaction between the MongoDB database and the LLM, facilitating accurate and secure data handling and query generation.
To ensure the effective operation of the Language Learning Models (LLMs) and the associated API, it is essential to configure several environment variables. These variables facilitate authentication with external services and simplify the configuration of the application. Below are the steps and details for setting up these environment variables:
- MISTRALAI_API_KEY: Needed to authenticate with MistralAI services.
- OPENAI_API_KEY: Required for authentication with OpenAI services.
- API_KEY: Used to authenticate API requests within your FastAPI application.
In addition to the essential keys for API authentication, you can also set the following parameters as environment variables to streamline your workflow and avoid the need to specify these parameters manually each time you run the application:
- CONNECTION_URL: The MongoDB connection URL.
- COLLECTION_LIST: A list of database collections to be queried.
- DATABASE_NAME: The name of the MongoDB database.
- QUERY: A default query to run when testing or developing.
You can set these environment variables through your operating system's environment settings. Alternatively, for ease of development, you can use a .env file placed in the root directory of your project. This file can be loaded using libraries like dotenv in Python, which simplifies the management of configuration settings.