Add falcon-e support #268

younesbelkada · 2025-05-15T12:32:54Z

As per title, adds support of Falcon-E models: https://huggingface.co/collections/tiiuae/falcon-edge-series-6804fd13344d6d8a8fa71130

Needs: Eddie-Wang1120/llama.cpp#8 to be merged

younesbelkada · 2025-05-20T13:06:22Z

cc @tsong-ms let me know if there is anything else to do before successfully merging this PR - thanks very much !

RobertAgee · 2025-05-20T15:16:15Z

@younesbelkada 2 issues:

In the changed setup_env.py, it only lists the Falcon E "Instruct" models as supported. Presumably Base is also supported (and gguf?).
~~PR branch is missing the contents of thirdparty/llama.cpp/ folder called for in the requirements.txt, so care must be exercised to preserve the thirdparty/ dir.~~ Nvm i see they are linked directory, strange though they don't get pulled with the pr branch clone.

Not sure if that's all that is missing yet. Gonna patch the files and hopefully it works

Follow-up:
3. Yeah, there's still lots of Falcon references left to clean up in llama.cpp. Right now, I'm attempting to add the support for Falcon E models, assuming they use the same pre-tokenizer.

RobertAgee · 2025-05-20T17:28:56Z

I got it working now with Falcon-E-3B-Instruct-GGUF, assuming it should work the same for all the other models. To fix the files in llama, I just made another Falcon model type option for "FALCON E" in the build scripts (server_env.py, run_inference.py, code_gen_tl1.py, code_gen_tl2.py) and as case options in the llama.cpp files (llama.cpp, llama-vocab.cpp, llama.h) Then I was able to successfully build and deploy the model both in the cli and as a server.

younesbelkada · 2025-05-20T23:41:25Z

Thank you very much @RobertAgee for testing everything ! I have few questions
1- For Base models, I just assumed people won't use it much since they are not SFT fine-tuned, I will push the fix you mentioned
2- Can you confirm the fixes you made on llama.cpp are the same as Eddie-Wang1120/llama.cpp#8 ? Did you make any additional changes to Bitnet.cpp ? From our side we tested everything with this PR combined with Eddie-Wang1120/llama.cpp#8 and it worked great, we will test it again

tsong-ms · 2025-05-21T05:52:18Z

cc @tsong-ms let me know if there is anything else to do before successfully merging this PR - thanks very much !

Hi younesbelkada, thank you for the PR, I have tested your model on M2-ultra which got a good inference speed as 3B size.

./build/bin/llama-bench -m ./models/Falcon-E-3B-Instruct/ggml-model-i2_s.gguf -ngl 0 -b 1 -t 4,12,8 -p 0 -n 128

model	size	params	backend	threads	n_batch	test	t/s
llama 7B I2_S - 2 bpw ternary	952.51 MiB	3.05 B	Metal	4	1	tg128	57.82 ± 0.39
llama 7B I2_S - 2 bpw ternary	952.51 MiB	3.05 B	Metal	12	1	tg128	100.56 ± 0.66
llama 7B I2_S - 2 bpw ternary	952.51 MiB	3.05 B	Metal	8	1	tg128	87.18 ± 0.29

Meanwhile, I left a comment on the PR in the submodule. After the submodule's PR is merged, please update the submodule reference in this PR and make the necessary changes. Thank you!

younesbelkada · 2025-05-21T07:53:32Z

Thank you very much @tsong-ms for running the benchmarks 🙏

…into add-falcon-e-final

RobertAgee · 2025-05-21T09:57:13Z

@younesbelkada @tsong-ms , so unfortunately there's still a bit of work to be done with ensuring Falcon support in bitnet. Mostly, it actually doesn't have good model family support, and I would image most models are running at least not optimized for their model sizes, especially the newer ones. And with the H family coming, it would behoove the team to get this right so we can truly show folks what 1.58 is all about 💪

I'm glad to see you added the Base. I agree most people may not be interested in the Base model, but that's not to say everyone wouldn't be, particularly research groups. But people however may be really interested in the GGUF model 🤷 since it can run well even on 6yo phones (in my case 😓 ).

I did notice what I assume is a pretty big mixup, is that ALL Falcon models (1, 3, E, GGUF, anything you would try to build) are getting the same tensor tiling dimensions and ModelShapeDict from the 1.58bit-ized Lllama 100B->8B 1.58 quantized models, which I believe for the newer E models trained natively in 1.58, so this is way too big. These if not optimized can hurt model performance, so pretty big deal if not accurately assigned.

I built my fixes on top of this PR Add falcon-e support #268, assuming you built this PR on top of @Eddie-Wang1120 's PR? If not, then they should be reconciled first.

As far as your being successful in building, which model did you use? And how did you go about that?

Because I did a direct pull from this PR, and followed the steps in the Bitnet instructions exactly, as well as on the Falcon E page (which these instructions also have 😬 ), and there were errors throughout setup and compilation. There's some upstream changes that llama.cpp needs as well (fairly minor but necessary) and related files for handling Falcon E models. I had to keep compiling til I found the error, delete the build, recompile, repeat until I was success. The actual changes that need to be made a very simple and minor, but just working through them all like this was challenging as I had to make educated guesses about what the correct settings were.

# Clone BitNet repo + llama.cpp
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

# Checkout specific branch
git fetch origin pull/268/head:pr-268
git checkout pr-268

# Create and activate the conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp

# Use pip inside conda to install dependencies
pip install -r requirements.txt

# Manually download the model and run with local path
huggingface-cli download tiiuae/Falcon-E-3B-Instruct-GGUF --local-dir models/Falcon-E-3B-Instruct-GGUF

# Compilation step, 
python setup_env.py -md models/Falcon-E-3B-Instruct-GGUF -q i2_s

# 1. setup_env.py
# - need addition of Falcon E 3b Instruct GGUF model names
# - tiling for ALL!!! Falcon models are currently set to 4-layer tiling, assuming native ternary models probably use 3-layer tiling, need to verify the actual sizes and separate by model family
# 2. codegen_tl1.py, codegen_tl2.py
# - add correct model shape dict

# most simple errors upstream in llama.cpp
# 3. common.cpp - #include <chrono>, using namespace std::chrono;
#  clock.cpp - #include <chrono>
#  log.cpp - #include <chrono>
#  imatrix.cpp - #include <chrono>
#  perplexity.cpp - #include <chrono>
#  llama-vocab.cpp - add option BFE option for FALCON_E
#  llama.cpp - add cases to handle pre-tokenizer/convo template for 'falcon_e' and 'falcom'


python run_inference.py -m models/Falcon-E-3B-Instruct-GGUF/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv

I can make a PR tomorrow for the BitNet.cpp fixes, but I'd like someone better than me to confirm the tiling and model shape sizing, as that's pretty critical. I can also submit a PR for the llama.cpp fixes. Let me know what you think.

younesbelkada · 2025-05-21T21:25:40Z

Thank you very much @RobertAgee for your detailed analysis and work, good point for Base models, I added them in this PR so people can use it!
Regarding your second question, we tested bitnet.cpp against Falcon3 and Falcon-E series on all sizes, and following exactly the steps mentioned in the readme. We also did the same for linux based arch as well as Mac OSX to make sure things are working right. For the tiling issue, it's indeed surprising that we didn't have to change anything on our end to make it work 🤔, and I am not 100% familiar with how this works under the hood, we just assumed it's the same tiling as Llama-8B-Bitnet since we use the exact same way of weights packing (assuming tiling and packing are related, which I am not sure about so feel free to correct me here). The first question I have in mind is that how we were able to smoothly test everything on our end (it also worked on @tsong-ms setup), do you think it's either a silent issue, or simply a hardware related issue (somehow worked on our hardware but didn't on yours?)
In any case I think that you should submit your PR so that more experienced people can have a look 🙏 Thanks again and looking forward to your PRs !

younesbelkada and others added 3 commits May 14, 2025 17:07

add falcon-e support

de371b7

Update README.md

0015ad5

Merge branch 'microsoft:main' into add-falcon-e-final

9e95756

feat: add also base models

f314d18

update submodule

765741d

Merge branch 'add-falcon-e-final' of github.com:younesbelkada/BitNet …

5c12850

…into add-falcon-e-final

tsong-ms self-requested a review May 21, 2025 08:27

tsong-ms approved these changes May 21, 2025

View reviewed changes

tsong-ms merged commit 69a2045 into microsoft:main May 21, 2025
1 check passed

younesbelkada deleted the add-falcon-e-final branch May 21, 2025 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add falcon-e support #268

Add falcon-e support #268

Uh oh!

younesbelkada commented May 15, 2025

Uh oh!

younesbelkada commented May 20, 2025

Uh oh!

RobertAgee commented May 20, 2025 •

edited

Loading

Uh oh!

RobertAgee commented May 20, 2025 •

edited

Loading

Uh oh!

younesbelkada commented May 20, 2025

Uh oh!

tsong-ms commented May 21, 2025

Uh oh!

younesbelkada commented May 21, 2025

Uh oh!

Uh oh!

RobertAgee commented May 21, 2025

Uh oh!

younesbelkada commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add falcon-e support #268

Add falcon-e support #268

Uh oh!

Conversation

younesbelkada commented May 15, 2025

Uh oh!

younesbelkada commented May 20, 2025

Uh oh!

RobertAgee commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobertAgee commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented May 20, 2025

Uh oh!

tsong-ms commented May 21, 2025

Uh oh!

younesbelkada commented May 21, 2025

Uh oh!

Uh oh!

RobertAgee commented May 21, 2025

Uh oh!

younesbelkada commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RobertAgee commented May 20, 2025 •

edited

Loading

RobertAgee commented May 20, 2025 •

edited

Loading