A local chatbot application that runs HuggingFace LLM models on your machine. Features a native macOS app, Streamlit web UI, and command-line interface.
- Load any HuggingFace model - Paste a model name and run it locally
- Streaming responses - See responses as they're generated
- Thinking/reasoning support - Models with
<think>tags display reasoning in a collapsible section - RTL language support - Automatic right-to-left text detection for Hebrew, Arabic, etc.
- Conversation history - Maintains full context across messages
- Markdown rendering - Responses render with proper formatting
- Quantization support - 4-bit and 8-bit quantization for large models (CUDA only)
- Export chat - Download conversation history as markdown
# Clone or navigate to the project directory
cd Local_ChatBot
# Create a virtual environment
python3 -m venv venv
# Activate the virtual environment
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtThe native macOS app requires running the API server first:
# Install server dependencies
pip install -r requirements-server.txt
# Start the API server
python run_server.pyThen open the Xcode project and run:
open LocalChatBotApp/LocalChatBotApp.xcodeprojOr build and run from command line:
cd LocalChatBotApp
xcodebuild -scheme LocalChatBotApp -configuration Debug buildThe macOS app connects to http://127.0.0.1:8000 by default.
streamlit run app.pyThen open your browser to http://localhost:8501
- Enter a HuggingFace model name in the sidebar (e.g.,
TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Click Load Model and wait for download
- Start chatting!
CHATBOT_DEBUG=true streamlit run app.pypython cli.py -m microsoft/DialoGPT-medium -p "Hello, how are you?"python cli.py -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 -ipython cli.py -m MODEL -i -s "You are a helpful coding assistant."python cli.py -m dicta-il/DictaLM-3.0-24B-Thinking -i --4bit| Option | Description |
|---|---|
-m, --model |
HuggingFace model name (required) |
-p, --prompt |
Single prompt to send |
-i, --interactive |
Run in interactive chat mode |
-s, --system-prompt |
System prompt to set behavior |
--max-tokens |
Maximum tokens to generate (default: 512) |
--temperature |
Sampling temperature (default: 0.7) |
--top-p |
Top-p sampling threshold (default: 0.9) |
--device |
Device: cuda, mps, or cpu |
--4bit |
Load in 4-bit quantization |
--8bit |
Load in 8-bit quantization |
--debug |
Enable debug logging |
--show-thinking |
Show model's thinking process |
| Command | Description |
|---|---|
/quit |
Exit the chat |
/clear |
Clear conversation history |
/history |
Show conversation history |
/think |
Toggle thinking display |
Local_ChatBot/
├── app.py # Streamlit web UI
├── cli.py # Command-line interface
├── chatbot.py # Core ChatBot class
├── run_server.py # FastAPI server entry point
├── requirements.txt # Base dependencies
├── requirements-server.txt # Server dependencies (includes FastAPI)
├── server/ # FastAPI backend
│ ├── main.py # FastAPI app
│ ├── dependencies.py # Singleton ChatBot
│ ├── schemas.py # Pydantic models
│ └── routes/
│ ├── model.py # Model management endpoints
│ ├── chat.py # Chat endpoints
│ └── websocket.py # Streaming WebSocket
├── LocalChatBotApp/ # Native macOS SwiftUI app
│ ├── LocalChatBotApp.xcodeproj
│ └── LocalChatBotApp/
│ ├── Models/ # Data models
│ ├── ViewModels/ # State management
│ ├── Views/ # SwiftUI views
│ ├── Services/ # API & WebSocket clients
│ └── Utilities/ # RTL detection, etc.
└── README.md
Any HuggingFace causal language model should work. Some tested examples:
TinyLlama/TinyLlama-1.1B-Chat-v1.0- Small, fast modelmicrosoft/DialoGPT-medium- Conversational modeldicta-il/DictaLM-3.0-24B-Thinking- Hebrew model with reasoningmeta-llama/Llama-2-7b-chat-hf- Llama 2 (requires access)
- CUDA - NVIDIA GPUs with full quantization support
- MPS - Apple Silicon (M1/M2/M3)
- CPU - Any system (slower)
Models that output reasoning in <think>...</think> tags are automatically handled:
- Thinking content appears in a collapsible "Thinking" section
- Only the final answer is shown during streaming
- Toggle visibility in sidebar or with
/thinkcommand
- Python 3.9+
- PyTorch 2.0+
- Transformers 4.36+
- Streamlit 1.28+
See requirements.txt for full dependencies.
MIT