Distillate

A command line tool to shrink LLMs while preserving their other qualities, like accuracy and precision. The distillate tool enables users to select various optimizations, apply them in different combinations, and evaluate the effectiveness of the applied optimizations. Furthermore, It also allows for fine-tuning through the incorporation of additional data.

Unlike most evaluation tools that primarily focus on benchmarking, the distillate tool integrates optimization and fine-tuning capabilities directly into the evaluation process.

Capabilities

distillate has the following capabilities:

Specify a model, either local or remote, using Huggingface.
Define a list of optimizations (Quantization, SlimMoE, SVD, etc.).
Provide post-training data for fine-tuning.
Specify benchmarks to be applied (e.g., HumanEval, MMLU, GSM8K, AIME, and others).
Evaluate the model using all the defined parameters.
Generate a JSON evaluation output for future visualization.

CLI

There is a CLI that can be used to evaluate a model:

distillate eval \
  --model meta-llama/Llama-3.2-1B \ # Model from huggingface: https://huggingface.co/meta-llama/Llama-3.2-1B
  --optimization quant \            # The first optimization applied is quant 
  --optimization slimmoe \          # The second optimization applied is SlimMoE
  --optimization svd \              # The third optimization applied is SVD
  --benchmark mmlu \                # The first benchmark to use
  --benchmark aime \                # The second benchmark to use
  --ft supplementary.json \         # JSON data for fine-tuning
  --output llama-3.2-qss.json       # JSON output for further visualization

This will evaluate the specified model with the defined optimizations and benchmarks, apply fine-tuning using the provided supplementary data, and save the results to a JSON file named llama-3.2-qss.json.

Commands

distillate eval

This command is used to evaluate a machine learning model with specified optimizations and benchmarks.

Flags and Parameters

`--model <model_name>`

Description: Specifies the model to be used in the pipeline. Supported Models: Hugging Face models (e.g., meta-llama/Llama-3.2-1B). Example: --model meta-llama/Llama-3.2-1B

`--optimization <optimization_type>`

Description: Specifies one or more optimizations to apply to the model.

Supported Optimizations:

quant: Applies quantization optimization to reduce model size and improve inference speed.
slimmoe: Applies SlimMixture-of-Experts optimization for model sparsity.
svd: Applies Singular Value Decomposition optimization for dimensionality reduction. Usage: You can specify multiple optimizations by repeating this flag. Example:

--optimization quant \
--optimization slimmoe \
--optimization svd

The order of optimizations is significant: the first specified optimization will be applied first.

`--benchmark <benchmark_name>`

Description: Specifies one or more benchmarks to evaluate the model's performance. Supported Benchmarks:

mmlu: Evaluates the model on the Massive Multitask Language Understanding dataset.
aime: Evaluates the model on the AIME dataset. Usage: Similar to optimizations, you can specify multiple benchmarks. Example:

--benchmark mmlu \
--benchmark aime

The order of benchmarks is significant: the first specified benchmark will be applied first.

`--ft <fine_tuning_file>`

Description: Specifies the file containing text-based data for fine-tuning. Supported formats could include .json, .csv or .txt files.

Example: --ft supplementary.json

`--output <output_file>`

Description: Specifies the name of the output file where evaluation results will be saved in JSON format. Usage: The output can be used for further visualization or analysis. Example: --output llama-3.2-qss.json

Comptetitors

There are two main groups of competitors of the distillate tool. The first group consists of LLM evaluation tools, which have extensive capabilities for testing already optimized models against a wide range of benchmarks. These tools may include libraries as well as CLI-based utilities. The second group includes optimization and fine-tuning frameworks, which concentrate on model optimization and fine-tuning. However, these are primarily libraries rather than CLI tools.

LLM Evaluation Tools

LLM evaluation tools can be divided into two distinct categories as well. The first category is aimed at evaluating the models themselves, using either HumanEval, MMLU, GSM8K, AIME (and similar) benchmarks, or by providing custom ones (e.g. lm-evaluation-harness, deepeval, evals, llm-benchmarker-suite, eval-framework.) The second category focuses on evaluating the performance of LLM inference (e.g., llm-optimizer, guildellm, VLMEvalKit).

evals (17.5k stars.)

evals is an OpenAI framework for evaluating LLMs and LLM-based systems, and it also serves as an open-source registry of evaluation tasks and benchmarks. It includes a collection of standard evaluation definitions located in the evals/registry directory. The framework allows users to define and run custom evaluations in addition to the built-in ones. It relies on either OpenAI API models or user-provided completion functions (that interface with external model-serving endpoints.) The framework does not perform model training, fine-tuning, or weight-level optimization.

DeepEval (12.7k stars.)

DeepEval is an LLM evaluation framework (not a CLI tool) designed for specifying models to test against various benchmarks, such as MMLU, HellaSwag, DROP, BIG‑Bench Hard, TruthfulQA, HumanEval, GSM8K, among others. It is worth noting that DeepEval is similar to Pytest but specialized for the unit testing of LLM outputs. In other words, while it provides support for benchmarking, its primary focus is on evaluating outputs using metrics rather than benchmarking itself. DeepEval does not support fine-tuning or model optimization. It does not include training loops or optimization features; it is strictly an evaluation framework. Therefore, only pre-trained models can be used with DeepEval, as fine-tuning or optimizing models is beyond its scope.

lm-evaluation-harness (11k stars.)

lm-evaluation-harness is a framework for the few-shot evaluation of language models, allowing users to select from over 60 benchmarks (e.g., HellaSwag, MMLU, ARC, WinoGrande) or use custom benchmarks. The framework offers tools for evaluating a wide range of models, including those hosted on Hugging Face. However, it does not support model fine-tuning or optimization; only pre-trained models can be evaluated.

VLMEvalKit (3.6k stars.)

VLMEvalKit is a toolkit designed for evaluating large vision-language models (LVLMs). It is a widely used tool that facilitates one-command evaluation on various benchmarks, eliminating the need for extensive data preparation across multiple repositories. However, this toolkit is specifically tailored for large vision-language models rather than general language models, making it unsuitable for our requirements. Additionally, it is a toolkit rather than a CLI, which necessitates additional Python programming.

llm-optimizer (151 stars.)

llm-optimizer is a Python tool designed for benchmarking and optimizing the inference performance of open-source LLMs. This tool provides benchmarks for models and estimates their performance. Additonally, it allows configuration of server parameters (e.g., tp_size, dp_size, chunked_prefill_size, etc.) and client parameters (e.g., max_concurrency, num_prompts, etc.). However, the tool does not allow users to specify detailed optimizations, nor does it perform optimizations that alter the model weights. In summary, this tool benchmarks inference performance exclusively and does not evaluate tasks like HumanEval, MMLU, GSM8K, or AIME.

guidellm (762 stars.)

guidellm is an SLO-aware benchmarking and evaluation platform designed for optimizing real-world LLM inference. While guidellm is relatively popular, it is quite similar to llm-optimizer in a way It focuses primarily on inference performance optimizations but does not support specifying particular algorithmic optimizations. Does not evaluate tasks like HumanEval, MMLU, GSM8K, or AIME.

llm-benchmarker-suite (48 stars.)

A unified benchmarking tool that allows users to specify different benchmarks but does not support fine-tuning or model optimization.
It is not widely adopted and is somewhat outdated (last update 2 years ago.)

eval-framework (32 stars.)

It is a production-ready framework that can be easily used as a CLI tool for evaluating large language models across multiple benchmarks. The framework allows users to specify benchmarks such as MMLU, HellaSwag, DROP, BIG‑Bench Hard, TruthfulQA, HumanEval, and GSM8K. It also supports a model abstraction layer (BaseLLM) that enables integration with Hugging Face transformer models, custom APIs, custom LLM wrappers, or external models via adapter classes. aAlthough the framework is not widely popular, it is currently under development. Notably, it does not perform fine-tuning or training; it is designed solely for evaluating models rather than training or optimizing them.

Optimisation and Fine-Tuning Tools

While the previous group of competitors focuses solely on benchmarking, this group prioritizes optimization and fine-tuning. It is worth noting that all of these are frameworks, not CLI tools, which means they require additional Python programming for integration.

Transformers (154k stars)

Transformers is a model-definition framework for machine learning models in text, vision, audio, and multimodal applications, supporting both inference and training. It is a general-purpose library for fine-tuning and training models. While Transformers provides training and evaluation utilities, users must supply datasets and metric computation. The framework supports using any pre-trained model from Hugging Face or a custom model. It enables supervised fine-tuning on arbitrary datasets, including custom datasets, and supports Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA, adapters, and QLoRA. Users can customize training by selecting optimizers, learning rate schedulers, mixed-precision settings, gradient accumulation, and optionally integrating PEFT methods.

DeepSpeed (41.1k stars.)

DeepSpeed is a deep learning optimization library designed specifically for efficient training. Benchmarking needs to be performed externally. Any PyTorch model, such as Hugging Face Transformers or custom models, can be wrapped with DeepSpeed for training. The library supports only a limited set of specific algorithms.

Neural Network Intelligence (deprectated, 14.3k stars.)

An open-source AutoML toolkit designed to automate the machine learning lifecycle, including feature engineering, neural architecture search, model compression, and hyperparameter tuning. The toolkit does not include benchmark datasets; users must provide their own training and evaluation datasets. It supports wrapping any PyTorch model or training function, including Transformers models, and integrates with your existing training loop. Users are required to supply the data and fine-tuning procedures. Key features include support for hyperparameter search (e.g., learning rate, batch size, optimizer, LoRA rank), pruning, quantization-aware training, and neural architecture search (NAS).

Novelty

As observed from a brief analysis of competitors, there are various tools that enable either benchmarking or optimization pipelines, but not both simultaneously. Moreover, there appears to be no standardization in the process of optimization and fine-tuning, though further research would be needed to confirm this.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distillate

Capabilities

CLI

Commands

Flags and Parameters

`--model <model_name>`

`--optimization <optimization_type>`

`--benchmark <benchmark_name>`

`--ft <fine_tuning_file>`

`--output <output_file>`

Comptetitors

LLM Evaluation Tools

evals (17.5k stars.)

DeepEval (12.7k stars.)

lm-evaluation-harness (11k stars.)

VLMEvalKit (3.6k stars.)

llm-optimizer (151 stars.)

guidellm (762 stars.)

llm-benchmarker-suite (48 stars.)

eval-framework (32 stars.)

Optimisation and Fine-Tuning Tools

Transformers (154k stars)

DeepSpeed (41.1k stars.)

Neural Network Intelligence (deprectated, 14.3k stars.)

Novelty

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

cqfn/distillate

Folders and files

Latest commit

History

Repository files navigation

Distillate

Capabilities

CLI

Commands

Flags and Parameters

--model <model_name>

--optimization <optimization_type>

--benchmark <benchmark_name>

--ft <fine_tuning_file>

--output <output_file>

Comptetitors

LLM Evaluation Tools

evals (17.5k stars.)

DeepEval (12.7k stars.)

lm-evaluation-harness (11k stars.)

VLMEvalKit (3.6k stars.)

llm-optimizer (151 stars.)

guidellm (762 stars.)

llm-benchmarker-suite (48 stars.)

eval-framework (32 stars.)

Optimisation and Fine-Tuning Tools

Transformers (154k stars)

DeepSpeed (41.1k stars.)

Neural Network Intelligence (deprectated, 14.3k stars.)

Novelty

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

`--model <model_name>`

`--optimization <optimization_type>`

`--benchmark <benchmark_name>`

`--ft <fine_tuning_file>`

`--output <output_file>`

Packages