Model evaluation tools, with a focus on Large Language Models and Large Vision Models.
$ git clone git@github.com:imamanr/evals_llm.git
$ python3 -m venv venv
$ source venv/bin/activate
(venv)$ pip install -U pip
(venv)$ pip install -r requirements.txt
TODO -- use aws configure sso instead
- Use AWS credentials
$ python evaluate.py --help
Evaluate data from s3 bucket on Haiku Claude via Bedrock API:
$ python evaluate.py -m haiku -vd bedrock -d lvm_logs -ddir ~/Documents/Rabbit/Datasets/ -mt1 Clip -mt2 latency
Evaluate local toy dataset on LLaVa via Fireworks API:
$ python evaluate.py -m fireworks -vd fireworks -d lvm_logs -ddir ~/Documents/Rabbit/Datasets/ -mt1 Clip -mt2 latency
Evaluate sample converstation data on locally run TinyLlama
$ python evaluate.py -m hf_llm -vd huggingface -d sample_conv -ddir ./assets/data_r1/ -mt1 sentenceT -mt2 latency
First, add a new module under llmeval/models/ within the respective vendor script (e.g. bedrock.py). Next, append the model name to the vendor and model dicts in evaluate.py. Lastly, update latency.py with the corresponding model stump.
First, add a new module under llmeval/evalaute/. Next, add the module name and class to the dataset dict in evaluate.py.
First, add a new module under llmeval/metric/. Next, add to the metrics dictionary in evaluate.py
- LeptonAI: Leptonai SDK
- Boto: Boto SDK
- Bedrock: Bedrock API
- HuggingFace: Hugginface models