This project fine-tunes the Llama 3.2 3B Instruct model using the Unsloth framework for efficient LoRA-based adaptation. It trains on a small Taskd dataset to teach the model how to answer questions about the AI automation company Taskd, and deploys the fine-tuned model using vLLM to create an OpenAI-compatible API server for inference.
| Component | Description |
|---|---|
| Llama 3.2 3B | Meta’s 3-billion parameter instruction-tuned model optimized for chat and reasoning tasks. |
| Unsloth | Framework providing optimized, authentication-free Llama variants and efficient fine-tuning with LoRA. |
| vLLM | High-throughput and memory-efficient inference engine supporting OpenAI-compatible APIs for local deployment. |
| LoRA | Parameter-efficient fine-tuning technique that trains lightweight adapters instead of full model weights. |
| Weights & Biases (W&B) | Optional experiment tracker for monitoring training metrics and comparing fine-tuned runs. |
Before you begin, ensure you have the following installed:
- Python 3.10+
- Git
- pip (Python package manager)
vllm(can be installed via pip)
Optional but recommended:
- GPU with CUDA for faster training
- A Weights & Biases (W&B) account for run tracking
uvis a fast Python package and environment manager.pip install uv
git clone https://github.com/Arpnik/taskd-technical-challenge.gituv venv .venv
source .venv/bin/activate # for macOS / Linux
Update the Weights and biases account API key here in the script: wandb.login(key="xyz")
python fine_tune_taskd-v1.pyTo override defaults via command-line arguments:
python fine_tune_taskd-v1.py \
--epochs 200 \
--lr 2e-4 \
--lora_rank 128 \
--lr_type cosine \
--weight_decay 0.02 \
--run_name "taskd-llama-v5" \
--max_new_tokens 256| Argument | Description | Default |
|---|---|---|
--epochs |
Number of training epochs | 150 |
--lr |
Learning rate | 1e-4 |
--lora_rank |
LoRA rank | 64 |
--lr_type |
LR scheduler type | cosine |
--weight_decay |
Optimizer weight decay | 0.01 |
--run_name |
W&B run name | taskd-llama-finetune-v4 |
--max_new_tokens |
Max tokens to generate during testing | 512 |
The fine-tuned model will be saved in ./taskd_lora_model/ by default and the training loss curve will look something like this:
After fine-tuning your model, you can test it in two ways:
This script directly loads your fine-tuned LoRA model (taskd_lora_model/) using Unsloth, runs inference, and extracts clean multiline assistant responses.
python test_fine_tuned_model.py [--max_new_tokens 512] [--temperature 0.01]| Flag | Default | Description |
|---|---|---|
--max_new_tokens |
512 | Maximum number of tokens to generate |
--temperature |
0.01 | Controls randomness (lower = more deterministic) |
- Loads your LoRA fine-tuned model in 4-bit precision for fast inference.
- Applies the same Llama-3.1 chat template used during training.
- Cleans raw model outputs using a regex-based extractor for multiline responses.
- Tests a list of example prompts automatically to verify model performance.
If you’re running your fine-tuned model using vLLM (for efficient inference serving), you can use this client script to query the model through its OpenAI-compatible API.
First, launch your vLLM server:
vllm serve ./taskd-technical-challenge/taskd_merged_model --port 8000Then, in another terminal, run:
python run_inference_on_vllm.pyOnce started, the client enters interactive Q&A mode.
| Flag | Default | Description |
|---|---|---|
--host |
http://localhost:8000 |
vLLM API endpoint |
--max_tokens |
512 | Maximum response length |
--temperature |
0.01 | Sampling temperature (creativity control) |
- Sends your prompts to the vLLM server via its /v1/chat/completions endpoint.
- Returns the assistant’s message using OpenAI-compatible JSON schema.
- Handles connection errors gracefully and supports adjustable generation parameters.
- The chatbot does not remember previous prompts between messages when using this script.
In a new terminal, verify the server is running and the model is loaded:
curl http://localhost:8000/v1/modelsExpected response format:
{
"object": "list",
"data": [
{
"id": "unsloth/Llama-3.2-3B-Instruct",
"object": "model",
"created": <timestamp>,
"owned_by": "vllm",
"root": "unsloth/Llama-3.2-3B-Instruct",
"parent": null,
"permission": [...]
}
]
}Install jq for JSON formatting (optional but recommended):
apt install jqSend a test request to the chat completion endpoint:
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "unsloth/Llama-3.2-3B-Instruct",
"messages": [
{"role": "user", "content": "Say hi"}
]
}' | jq -r '.choices[0].message.content'Expected response: A friendly greeting from the model.
Out of Memory Error:
- Reduce
--gpu-memory-utilizationto 0.7 or 0.8 - Reduce
--max-model-lento 2048
Connection Refused:
- Ensure you're using
0.0.0.0as the host - Check that port 8000 is not already in use
- Wait for the model to fully load before making requests
Model Download Fails:
- Verify internet connectivity
- Confirm you're using
unsloth/Llama-3.2-3B-Instruct(no HuggingFace login needed)
