Skip to content

Conversation

@younesbelkada
Copy link
Contributor

@younesbelkada
Copy link
Contributor Author

cc @tsong-ms let me know if there is anything else to do before successfully merging this PR - thanks very much !

@RobertAgee
Copy link

RobertAgee commented May 20, 2025

@younesbelkada 2 issues:

  1. In the changed setup_env.py, it only lists the Falcon E "Instruct" models as supported. Presumably Base is also supported (and gguf?).

  2. PR branch is missing the contents of thirdparty/llama.cpp/ folder called for in the requirements.txt, so care must be exercised to preserve the thirdparty/ dir. Nvm i see they are linked directory, strange though they don't get pulled with the pr branch clone.

Not sure if that's all that is missing yet. Gonna patch the files and hopefully it works

Follow-up:
3. Yeah, there's still lots of Falcon references left to clean up in llama.cpp. Right now, I'm attempting to add the support for Falcon E models, assuming they use the same pre-tokenizer.

@RobertAgee
Copy link

RobertAgee commented May 20, 2025

I got it working now with Falcon-E-3B-Instruct-GGUF, assuming it should work the same for all the other models. To fix the files in llama, I just made another Falcon model type option for "FALCON E" in the build scripts (server_env.py, run_inference.py, code_gen_tl1.py, code_gen_tl2.py) and as case options in the llama.cpp files (llama.cpp, llama-vocab.cpp, llama.h) Then I was able to successfully build and deploy the model both in the cli and as a server.
image

image

image

image

@younesbelkada
Copy link
Contributor Author

Thank you very much @RobertAgee for testing everything ! I have few questions
1- For Base models, I just assumed people won't use it much since they are not SFT fine-tuned, I will push the fix you mentioned
2- Can you confirm the fixes you made on llama.cpp are the same as Eddie-Wang1120/llama.cpp#8 ? Did you make any additional changes to Bitnet.cpp ? From our side we tested everything with this PR combined with Eddie-Wang1120/llama.cpp#8 and it worked great, we will test it again

@tsong-ms
Copy link
Collaborator

cc @tsong-ms let me know if there is anything else to do before successfully merging this PR - thanks very much !

Hi younesbelkada, thank you for the PR, I have tested your model on M2-ultra which got a good inference speed as 3B size.

./build/bin/llama-bench -m ./models/Falcon-E-3B-Instruct/ggml-model-i2_s.gguf -ngl 0 -b 1 -t 4,12,8 -p 0 -n 128

model size params backend ngl threads n_batch test t/s
llama 7B I2_S - 2 bpw ternary 952.51 MiB 3.05 B Metal 0 4 1 tg128 57.82 ± 0.39
llama 7B I2_S - 2 bpw ternary 952.51 MiB 3.05 B Metal 0 12 1 tg128 100.56 ± 0.66
llama 7B I2_S - 2 bpw ternary 952.51 MiB 3.05 B Metal 0 8 1 tg128 87.18 ± 0.29

Meanwhile, I left a comment on the PR in the submodule. After the submodule's PR is merged, please update the submodule reference in this PR and make the necessary changes. Thank you!

@younesbelkada
Copy link
Contributor Author

Thank you very much @tsong-ms for running the benchmarks 🙏

@tsong-ms tsong-ms self-requested a review May 21, 2025 08:27
@tsong-ms tsong-ms merged commit 69a2045 into microsoft:main May 21, 2025
1 check passed
@RobertAgee
Copy link

@younesbelkada @tsong-ms , so unfortunately there's still a bit of work to be done with ensuring Falcon support in bitnet. Mostly, it actually doesn't have good model family support, and I would image most models are running at least not optimized for their model sizes, especially the newer ones. And with the H family coming, it would behoove the team to get this right so we can truly show folks what 1.58 is all about 💪

  1. I'm glad to see you added the Base. I agree most people may not be interested in the Base model, but that's not to say everyone wouldn't be, particularly research groups. But people however may be really interested in the GGUF model 🤷 since it can run well even on 6yo phones (in my case 😓 ).

I did notice what I assume is a pretty big mixup, is that ALL Falcon models (1, 3, E, GGUF, anything you would try to build) are getting the same tensor tiling dimensions and ModelShapeDict from the 1.58bit-ized Lllama 100B->8B 1.58 quantized models, which I believe for the newer E models trained natively in 1.58, so this is way too big. These if not optimized can hurt model performance, so pretty big deal if not accurately assigned.

  1. I built my fixes on top of this PR Add falcon-e support #268, assuming you built this PR on top of @Eddie-Wang1120 's PR? If not, then they should be reconciled first.

As far as your being successful in building, which model did you use? And how did you go about that?

Because I did a direct pull from this PR, and followed the steps in the Bitnet instructions exactly, as well as on the Falcon E page (which these instructions also have 😬 ), and there were errors throughout setup and compilation. There's some upstream changes that llama.cpp needs as well (fairly minor but necessary) and related files for handling Falcon E models. I had to keep compiling til I found the error, delete the build, recompile, repeat until I was success. The actual changes that need to be made a very simple and minor, but just working through them all like this was challenging as I had to make educated guesses about what the correct settings were.

# Clone BitNet repo + llama.cpp
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

# Checkout specific branch
git fetch origin pull/268/head:pr-268
git checkout pr-268

# Create and activate the conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp

# Use pip inside conda to install dependencies
pip install -r requirements.txt

# Manually download the model and run with local path
huggingface-cli download tiiuae/Falcon-E-3B-Instruct-GGUF --local-dir models/Falcon-E-3B-Instruct-GGUF

# Compilation step, 
python setup_env.py -md models/Falcon-E-3B-Instruct-GGUF -q i2_s

# 1. setup_env.py
# - need addition of Falcon E 3b Instruct GGUF model names
# - tiling for ALL!!! Falcon models are currently set to 4-layer tiling, assuming native ternary models probably use 3-layer tiling, need to verify the actual sizes and separate by model family
# 2. codegen_tl1.py, codegen_tl2.py
# - add correct model shape dict

# most simple errors upstream in llama.cpp
# 3. common.cpp - #include <chrono>, using namespace std::chrono;
#  clock.cpp - #include <chrono>
#  log.cpp - #include <chrono>
#  imatrix.cpp - #include <chrono>
#  perplexity.cpp - #include <chrono>
#  llama-vocab.cpp - add option BFE option for FALCON_E
#  llama.cpp - add cases to handle pre-tokenizer/convo template for 'falcon_e' and 'falcom'


python run_inference.py -m models/Falcon-E-3B-Instruct-GGUF/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv

I can make a PR tomorrow for the BitNet.cpp fixes, but I'd like someone better than me to confirm the tiling and model shape sizing, as that's pretty critical. I can also submit a PR for the llama.cpp fixes. Let me know what you think.

@younesbelkada younesbelkada deleted the add-falcon-e-final branch May 21, 2025 21:20
@younesbelkada
Copy link
Contributor Author

Thank you very much @RobertAgee for your detailed analysis and work, good point for Base models, I added them in this PR so people can use it!
Regarding your second question, we tested bitnet.cpp against Falcon3 and Falcon-E series on all sizes, and following exactly the steps mentioned in the readme. We also did the same for linux based arch as well as Mac OSX to make sure things are working right. For the tiling issue, it's indeed surprising that we didn't have to change anything on our end to make it work 🤔, and I am not 100% familiar with how this works under the hood, we just assumed it's the same tiling as Llama-8B-Bitnet since we use the exact same way of weights packing (assuming tiling and packing are related, which I am not sure about so feel free to correct me here). The first question I have in mind is that how we were able to smoothly test everything on our end (it also worked on @tsong-ms setup), do you think it's either a silent issue, or simply a hardware related issue (somehow worked on our hardware but didn't on yours?)
In any case I think that you should submit your PR so that more experienced people can have a look 🙏 Thanks again and looking forward to your PRs !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants