-
Notifications
You must be signed in to change notification settings - Fork 7
Description
我尝试在e2e branch中使用2bit进行测试,但发现e2e branch中只支持4bit量化,于是我参照main branch的代码,解除了bit_decode中对于2bit量化的注释,然后使用evaluation/example.py进行测试.
具体的指令为:python example.py --model_path Llama-3.1-8B-Instruct --max_length 131072 --num_bits 2 --quant_mode k-channel --group_size 128 --attn_backend bit_decoding.
结果发现2bit量化后模型的输出质量极差,作为对比4bit模型的回答质量较好.
但这个是令人疑惑的,因为residual_block_size在2bit情况下是设置为了256,这个窗口应该为模型留下了一定的上下文信息.
请问是e2e branch中2bit量化的代码存在问题吗?
还有问题在于,Bitdecoding能否支持BF16精度,因为类似于Qwen2.5-7B/14B/32B等模型,使用FP16推理可能会出现数值溢出.
2bit输出:
Answer to 1
Let's assume you get a nice reward you for a 1
In the next wave of 0 The question is a tricky one to answer, as a general rule that for any given, 0
A question like, "Whoa, No, Not now Let's get you Let's see, you're doing a very bad way of doing it Let's give you 5 Any number of 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4bit输出:
Answer: Arnel gave his friends 5 * 8 = 40 pencils in total.
He had 10 pencils for himself, so in total, he had 40 + 10 = 50 pencils.
Since he had 10 boxes, there were 50 / 10 = 5 pencils in each box.
The final answer is 5.