Skip to content

Conversation

@yashuatla
Copy link
Owner

This PR contains changes from a range of commits from the original repository.

Commit Range: 0e7dadb..6197e9f
Files Changed: 22 (13 programming files)
Programming Ratio: 59.1%

Commits included:

sd983527 and others added 13 commits April 15, 2025 17:07
add two FAQs for windows build requestions.
…crosoft#204)

* Update CMakeLists.txt

I added a CMake option to compile the Llama.cpp server. This update allows us to easily build and deploy the server using BitNet

* Create run_inference_server.py

same as run_inference, but for use with llama.cpp's built in server, for some extra comfort

In particular:
- The build directory is determined based on whether the system is running on Windows or not.
- A list of arguments (`--model`, `-m` etc.) is created.
- The main argument list is parsed and passed to the `subprocess.run()` method to execute the system command.
…d/aarch64 (microsoft#242)

GCC does not recognize Clang-specific warning flags like
-Wunreachable-code-break and -Wunreachable-code-return, which are passed
by upstream submodules (e.g., ggml). This patch forces CMake to use Clang
via command-line arguments, avoiding the need to patch nested submodules.

This resolves compiler errors without modifying submodule source code.
def from_name(cls, name: str):
if name in transformer_configs:
return cls(**transformer_configs[name])
config = [k for k in transformer_configs if k in name.upper() or k in name]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Ambiguous model name matching logic.

The current substring matching can select the wrong config if one config name is a substring of another, causing runtime errors or incorrect model conversion.

Current Code (Diff):

-        config = [k for k in transformer_configs if k in name.upper() or k in name]
+        config = [k for k in transformer_configs if k == name.upper() or k == name]
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
config = [k for k in transformer_configs if k in name.upper() or k in name]
config = [k for k in transformer_configs if k == name.upper() or k == name]

help="Path to input .safetensors file"
)
parser.add_argument(
"--output", type=str, default="./checkpoints/model_state.pt",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Default output path assumes directory exists.

The default output path assumes './checkpoints' directory exists, which will cause file write failures if the directory doesn't exist.

Current Code (Diff):

-        "--output", type=str, default="./checkpoints/model_state.pt",
+        "--output", type=str, default="./model_state.pt",
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
"--output", type=str, default="./checkpoints/model_state.pt",
"--output", type=str, default="./model_state.pt",


model_args_prefill = fast.ModelArgs(use_kernel=False)
model_args_decode = fast.ModelArgs(use_kernel=True)
tokenizer = Tokenizer("./tokenizer.model")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Hardcoded tokenizer path ignores parameter.

The code hardcodes the tokenizer path to './tokenizer.model' despite having an optional tokenizer_path parameter, which will cause failures if the tokenizer isn't in the expected location.

Current Code (Diff):

-        tokenizer = Tokenizer("./tokenizer.model")
+        tokenizer = Tokenizer(tokenizer_path if tokenizer_path is not None else "./tokenizer.model")
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
tokenizer = Tokenizer("./tokenizer.model")
tokenizer = Tokenizer(tokenizer_path if tokenizer_path is not None else "./tokenizer.model")


eos_id = self.tokenizer.eot_id
for niter in range(1, gen_length):
kv_seqlen.add_(kv_seqlen < max_seq_length)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Incorrect tensor operation.

The expression kv_seqlen.add_(kv_seqlen < max_seq_length) attempts to add a boolean tensor to kv_seqlen, which will cause incorrect behavior or runtime errors.

Current Code (Diff):

-            kv_seqlen.add_(kv_seqlen < max_seq_length)
+            kv_seqlen.add_((kv_seqlen < max_seq_length).to(torch.int))
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
kv_seqlen.add_(kv_seqlen < max_seq_length)
kv_seqlen.add_((kv_seqlen < max_seq_length).to(torch.int))

Comment on lines +252 to +256
if use_sampling:
temp = 0.7
top_p = 0.95
probs = torch.softmax(logits / temp, dim=-1)
next_token = sample_utils.top_p(probs, top_p)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Hardcoded sampling parameters override GenArgs.

The code hardcodes temperature and top_p values instead of using the ones from GenArgs, making those parameters in GenArgs effectively unused.

Current Code (Diff):

-            temp = 0.7
-            top_p = 0.95
-            probs = torch.softmax(logits / temp, dim=-1)
-            next_token = sample_utils.top_p(probs, top_p)
+            temp = self.gen_args.temperature
+            top_p = self.gen_args.top_p
+            probs = torch.softmax(logits / temp, dim=-1)
+            next_token = sample_utils.top_p(probs, top_p)
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
if use_sampling:
temp = 0.7
top_p = 0.95
probs = torch.softmax(logits / temp, dim=-1)
next_token = sample_utils.top_p(probs, top_p)
temp = self.gen_args.temperature
top_p = self.gen_args.top_p
probs = torch.softmax(logits / temp, dim=-1)
next_token = sample_utils.top_p(probs, top_p)

Comment on lines +273 to +277
if use_sampling:
temp = 0.7
top_p = 0.95
probs = torch.softmax(logits / temp, dim=-1)
next_token = sample_utils.top_p(probs, top_p)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Duplicate hardcoded sampling parameters.

The code duplicates the hardcoded temperature and top_p values in the generation loop, ignoring the GenArgs parameters.

Current Code (Diff):

-                temp = 0.7
-                top_p = 0.95
-                probs = torch.softmax(logits / temp, dim=-1)
-                next_token = sample_utils.top_p(probs, top_p)
+                temp = self.gen_args.temperature
+                top_p = self.gen_args.top_p
+                probs = torch.softmax(logits / temp, dim=-1)
+                next_token = sample_utils.top_p(probs, top_p)
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
if use_sampling:
temp = 0.7
top_p = 0.95
probs = torch.softmax(logits / temp, dim=-1)
next_token = sample_utils.top_p(probs, top_p)
temp = self.gen_args.temperature
top_p = self.gen_args.top_p
probs = torch.softmax(logits / temp, dim=-1)
next_token = sample_utils.top_p(probs, top_p)

)

import ctypes
bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Unsafe Library Loading Without Error Handling.

Loading a shared library with a hardcoded path without error handling will crash the application if the library is missing or incompatible.

Current Code (Diff):

- bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ try:
+     bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ except OSError as e:
+     raise ImportError(f"Failed to load BitNet library: {e}. Please ensure 'bitnet_kernels/libbitnet.so' is available.")
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
try:
bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
except OSError as e:
raise ImportError(f"Failed to load BitNet library: {e}. Please ensure 'bitnet_kernels/libbitnet.so' is available.")

return int8_weight


def interleave_weight_int8(qweight, nbits=2):\
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Syntax Error: Trailing backslash in function definition.

The function definition has a trailing backslash that will cause a Python syntax error when the code is executed, breaking any dependent code.

Current Code (Diff):

- def interleave_weight_int8(qweight, nbits=2):\
+ def interleave_weight_int8(qweight, nbits=2):
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
def interleave_weight_int8(qweight, nbits=2):\
def interleave_weight_int8(qweight, nbits=2):

🔄 Dependencies Affected

gpu/convert_checkpoint.py

Function: convert_int8_to_int2

Issue: This function calls convert_weight_int8_to_int2 which uses interleave_weight_int8, and will fail due to the syntax error

Suggestion: No change needed in this file, fix the syntax error in the original function


gpu/test.py

Function: Unknown

Issue: Line 44 uses convert_weight_int8_to_int2 which calls interleave_weight_int8, and will fail due to the syntax error

Suggestion: No change needed in this file, fix the syntax error in the original function


"""
probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
probs_sum = torch.cumsum(probs_sort, dim=-1)
mask = probs_sum - probs_sort > p
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Incorrect mask logic for top-p sampling.

The mask condition is reversed, causing the algorithm to exclude tokens that should be included in the sampling pool, which will lead to incorrect text generation results.

Current Code (Diff):

-     mask = probs_sum - probs_sort > p
+     mask = probs_sum > p
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
mask = probs_sum - probs_sort > p
mask = probs_sum > p

🔄 Dependencies Affected

gpu/generate.py

Function: generate_text

Issue: Text generation will produce incorrect results due to improper token filtering in top_p sampling

Suggestion: No change needed in the dependency, fix the implementation in sample_utils.py


time: float

def show(self) -> str:
tps = self.tokens / self.time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Division by Zero Risk.

The code divides by self.time without checking if it's zero, which would cause a runtime crash when displaying statistics.

Current Code (Diff):

-        tps = self.tokens / self.time
+        tps = self.tokens / self.time if self.time > 0 else 0
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
tps = self.tokens / self.time
tps = self.tokens / self.time if self.time > 0 else 0

torch.manual_seed(42)
np.random.seed(42)

bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

No error handling for C library loading.

Loading the C library without error handling will crash the application if the library file doesn't exist at the specified path.

Current Code (Diff):

- bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ try:
+     bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ except OSError as e:
+     raise RuntimeError(f"Failed to load BitNet library: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
try:
bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
except OSError as e:
raise RuntimeError(f"Failed to load BitNet library: {e}")

N = input1.shape[0]
K = input1.shape[1] * 4

bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

No error handling for C library function call.

The C library function call has no error checking, which could lead to silent failures or crashes if the function encounters an error.

Current Code (Diff):

-     bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
+     result = bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
+     if result != 0:  # Assuming 0 is success, adjust based on actual API
+         raise RuntimeError(f"BitNet library call failed with error code: {result}")
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
result = bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
if result != 0: # Assuming 0 is success, adjust based on actual API
raise RuntimeError(f"BitNet library call failed with error code: {result}")

# tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
return tokens

def encode_message(self, message: Message, return_target=False) -> List[int]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Incorrect return type annotation.

The function is annotated to return List[int] but actually returns a tuple (List[int], List[int] or None), which could cause type errors when used by other components.

Current Code (Diff):

-     def encode_message(self, message: Message, return_target=False) -> List[int]:
+     def encode_message(self, message: Message, return_target=False) -> Union[List[int], tuple[List[int], Union[List[int], None]]]:
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
def encode_message(self, message: Message, return_target=False) -> List[int]:
def encode_message(self, message: Message, return_target=False) -> Union[List[int], tuple[List[int], Union[List[int], None]]]:


return tokens, None

def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Incorrect return type annotation.

The function is annotated to return List[int] but can also return a tuple (List[int], List[int]) when return_target=True, which could cause type errors.

Current Code (Diff):

-     def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]:
+     def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> Union[List[int], tuple[List[int], List[int]]]:
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]:
def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> Union[List[int], tuple[List[int], List[int]]]:

if not os.path.exists(server_path):
server_path = os.path.join(build_dir, "bin", "llama-server")
else:
server_path = os.path.join(build_dir, "bin", "llama-server")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Missing server executable validation.

The script doesn't verify if the server executable exists before attempting to run it, which will cause a runtime error if the file is missing.

Current Code (Diff):

-        server_path = os.path.join(build_dir, "bin", "llama-server")
+        server_path = os.path.join(build_dir, "bin", "llama-server")
+        
+        if not os.path.exists(server_path):
+            print(f"Error: Server executable not found at {server_path}")
+            sys.exit(1)
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
server_path = os.path.join(build_dir, "bin", "llama-server")
server_path = os.path.join(build_dir, "bin", "llama-server")
if not os.path.exists(server_path):
print(f"Error: Server executable not found at {server_path}")
sys.exit(1)

exit(0)
logging.info("Compiling the code using CMake.")
run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files")
run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Hardcoded compiler breaks cross-platform compatibility.

Forcing clang/clang++ as compilers will break builds on systems where these compilers aren't available or where specific compiler requirements exist.

Current Code (Diff):

- run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files")
+ run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files")
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files")
run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants