This application allows you to ask questions about the content of a PDF document and receive answers generated by a Large Language Model (LLM). It leverages semantic similarity to find relevant sections within the PDF before feeding them to the LLM, resulting in more accurate and context-aware responses.
The application follows these steps:
-
PDF Reading: The application takes a PDF file as input and extracts its textual content.
-
Text Chunking: The extracted text is split into smaller, manageable chunks. This is crucial for efficient processing by the LLM and for focusing on relevant information.
-
Embedding Generation: Using Hugging Face Sentence Transformers, each text chunk is converted into a vector embedding. These embeddings capture the semantic meaning of the text.
-
Question Embedding: When you ask a question, it is also converted into a vector embedding using the same Sentence Transformer model.
-
Semantic Similarity Search: The application calculates the semantic similarity between the question's embedding and the embeddings of all the text chunks. This identifies the chunks that are most relevant to your question.
-
Contextual Information for LLM: The most semantically similar text chunks are retrieved and provided as context to the LLM.
-
Answer Generation: The LLM uses the provided context to generate an answer to your question.
-
User Interface: Streamlit provides an interactive graphical user interface for uploading PDFs and asking questions.
-
LLM Integration: Langchain is used to seamlessly integrate with the LLM.
You can try the deployed application here: https://pdfquestion-euvx4eayptjuxzvrht6wn5.streamlit.app/
The source code for this application can be found on GitHub: https://github.com/Samtoosoon/Pdfquestion/tree/main
To run this application locally, follow these steps:
-
Clone the repository:
git clone [https://github.com/Samtoosoon/Pdfquestion.git](https://www.google.com/search?q=https://github.com/Samtoosoon/Pdfquestion.git) cd Pdfquestion
-
Install the required Python packages:
pip install -r requirements.txt
-
Set up environment variables:
-
Create a file named
.envin the root directory of the repository. -
Add your Hugging Face API key to the
.envfile. You can obtain an API key from the Hugging Face website.HUGGINGFACE_API_KEY=YOUR_HUGGINGFACE_API_KEY(Replace
YOUR_HUGGINGFACE_API_KEYwith your actual API key)
-
-
Navigate to the repository directory in your terminal.
-
Run the Streamlit application:
streamlit run app.py
-
Open your web browser to the address displayed in the terminal (usually
http://localhost:8501). -
Follow the on-screen instructions:
- Upload a PDF file using the file uploader.
- Enter your question in the text input field.
- Click the "Ask" button to get your answer.
app.py: This is the main Streamlit application file that handles the user interface, PDF processing, embedding generation, similarity search, and interaction with the LLM.requirements.txt: This file lists all the Python dependencies required to run the application..env: This file stores sensitive information like your Hugging Face API key. Ensure this file is not committed to your version control system (e.g., add it to your.gitignorefile).
