-
Notifications
You must be signed in to change notification settings - Fork 0
Feature implementation from commits 0e7dadb..6197e9f #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature-base-1
Are you sure you want to change the base?
Conversation
add two FAQs for windows build requestions.
add third-party demo
refine readme
Fix model architecture name
…crosoft#204) * Update CMakeLists.txt I added a CMake option to compile the Llama.cpp server. This update allows us to easily build and deploy the server using BitNet * Create run_inference_server.py same as run_inference, but for use with llama.cpp's built in server, for some extra comfort In particular: - The build directory is determined based on whether the system is running on Windows or not. - A list of arguments (`--model`, `-m` etc.) is created. - The main argument list is parsed and passed to the `subprocess.run()` method to execute the system command.
…d/aarch64 (microsoft#242) GCC does not recognize Clang-specific warning flags like -Wunreachable-code-break and -Wunreachable-code-return, which are passed by upstream submodules (e.g., ggml). This patch forces CMake to use Clang via command-line arguments, avoiding the need to patch nested submodules. This resolves compiler errors without modifying submodule source code.
| def from_name(cls, name: str): | ||
| if name in transformer_configs: | ||
| return cls(**transformer_configs[name]) | ||
| config = [k for k in transformer_configs if k in name.upper() or k in name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Ambiguous model name matching logic.
The current substring matching can select the wrong config if one config name is a substring of another, causing runtime errors or incorrect model conversion.
Current Code (Diff):
- config = [k for k in transformer_configs if k in name.upper() or k in name]
+ config = [k for k in transformer_configs if k == name.upper() or k == name]📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| config = [k for k in transformer_configs if k in name.upper() or k in name] | |
| config = [k for k in transformer_configs if k == name.upper() or k == name] |
| help="Path to input .safetensors file" | ||
| ) | ||
| parser.add_argument( | ||
| "--output", type=str, default="./checkpoints/model_state.pt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Default output path assumes directory exists.
The default output path assumes './checkpoints' directory exists, which will cause file write failures if the directory doesn't exist.
Current Code (Diff):
- "--output", type=str, default="./checkpoints/model_state.pt",
+ "--output", type=str, default="./model_state.pt",📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| "--output", type=str, default="./checkpoints/model_state.pt", | |
| "--output", type=str, default="./model_state.pt", |
|
|
||
| model_args_prefill = fast.ModelArgs(use_kernel=False) | ||
| model_args_decode = fast.ModelArgs(use_kernel=True) | ||
| tokenizer = Tokenizer("./tokenizer.model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Hardcoded tokenizer path ignores parameter.
The code hardcodes the tokenizer path to './tokenizer.model' despite having an optional tokenizer_path parameter, which will cause failures if the tokenizer isn't in the expected location.
Current Code (Diff):
- tokenizer = Tokenizer("./tokenizer.model")
+ tokenizer = Tokenizer(tokenizer_path if tokenizer_path is not None else "./tokenizer.model")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| tokenizer = Tokenizer("./tokenizer.model") | |
| tokenizer = Tokenizer(tokenizer_path if tokenizer_path is not None else "./tokenizer.model") |
|
|
||
| eos_id = self.tokenizer.eot_id | ||
| for niter in range(1, gen_length): | ||
| kv_seqlen.add_(kv_seqlen < max_seq_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Incorrect tensor operation.
The expression kv_seqlen.add_(kv_seqlen < max_seq_length) attempts to add a boolean tensor to kv_seqlen, which will cause incorrect behavior or runtime errors.
Current Code (Diff):
- kv_seqlen.add_(kv_seqlen < max_seq_length)
+ kv_seqlen.add_((kv_seqlen < max_seq_length).to(torch.int))📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| kv_seqlen.add_(kv_seqlen < max_seq_length) | |
| kv_seqlen.add_((kv_seqlen < max_seq_length).to(torch.int)) |
| if use_sampling: | ||
| temp = 0.7 | ||
| top_p = 0.95 | ||
| probs = torch.softmax(logits / temp, dim=-1) | ||
| next_token = sample_utils.top_p(probs, top_p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Hardcoded sampling parameters override GenArgs.
The code hardcodes temperature and top_p values instead of using the ones from GenArgs, making those parameters in GenArgs effectively unused.
Current Code (Diff):
- temp = 0.7
- top_p = 0.95
- probs = torch.softmax(logits / temp, dim=-1)
- next_token = sample_utils.top_p(probs, top_p)
+ temp = self.gen_args.temperature
+ top_p = self.gen_args.top_p
+ probs = torch.softmax(logits / temp, dim=-1)
+ next_token = sample_utils.top_p(probs, top_p)📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| if use_sampling: | |
| temp = 0.7 | |
| top_p = 0.95 | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) | |
| temp = self.gen_args.temperature | |
| top_p = self.gen_args.top_p | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) |
| if use_sampling: | ||
| temp = 0.7 | ||
| top_p = 0.95 | ||
| probs = torch.softmax(logits / temp, dim=-1) | ||
| next_token = sample_utils.top_p(probs, top_p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Duplicate hardcoded sampling parameters.
The code duplicates the hardcoded temperature and top_p values in the generation loop, ignoring the GenArgs parameters.
Current Code (Diff):
- temp = 0.7
- top_p = 0.95
- probs = torch.softmax(logits / temp, dim=-1)
- next_token = sample_utils.top_p(probs, top_p)
+ temp = self.gen_args.temperature
+ top_p = self.gen_args.top_p
+ probs = torch.softmax(logits / temp, dim=-1)
+ next_token = sample_utils.top_p(probs, top_p)📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| if use_sampling: | |
| temp = 0.7 | |
| top_p = 0.95 | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) | |
| temp = self.gen_args.temperature | |
| top_p = self.gen_args.top_p | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) |
| ) | ||
|
|
||
| import ctypes | ||
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Unsafe Library Loading Without Error Handling.
Loading a shared library with a hardcoded path without error handling will crash the application if the library is missing or incompatible.
Current Code (Diff):
- bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ try:
+ bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ except OSError as e:
+ raise ImportError(f"Failed to load BitNet library: {e}. Please ensure 'bitnet_kernels/libbitnet.so' is available.")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| try: | |
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| except OSError as e: | |
| raise ImportError(f"Failed to load BitNet library: {e}. Please ensure 'bitnet_kernels/libbitnet.so' is available.") |
| return int8_weight | ||
|
|
||
|
|
||
| def interleave_weight_int8(qweight, nbits=2):\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Syntax Error: Trailing backslash in function definition.
The function definition has a trailing backslash that will cause a Python syntax error when the code is executed, breaking any dependent code.
Current Code (Diff):
- def interleave_weight_int8(qweight, nbits=2):\
+ def interleave_weight_int8(qweight, nbits=2):📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| def interleave_weight_int8(qweight, nbits=2):\ | |
| def interleave_weight_int8(qweight, nbits=2): |
🔄 Dependencies Affected
gpu/convert_checkpoint.py
Function: convert_int8_to_int2
Issue: This function calls convert_weight_int8_to_int2 which uses interleave_weight_int8, and will fail due to the syntax error
Suggestion: No change needed in this file, fix the syntax error in the original function
gpu/test.py
Function: Unknown
Issue: Line 44 uses convert_weight_int8_to_int2 which calls interleave_weight_int8, and will fail due to the syntax error
Suggestion: No change needed in this file, fix the syntax error in the original function
| """ | ||
| probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True) | ||
| probs_sum = torch.cumsum(probs_sort, dim=-1) | ||
| mask = probs_sum - probs_sort > p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Incorrect mask logic for top-p sampling.
The mask condition is reversed, causing the algorithm to exclude tokens that should be included in the sampling pool, which will lead to incorrect text generation results.
Current Code (Diff):
- mask = probs_sum - probs_sort > p
+ mask = probs_sum > p📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| mask = probs_sum - probs_sort > p | |
| mask = probs_sum > p |
🔄 Dependencies Affected
gpu/generate.py
Function: generate_text
Issue: Text generation will produce incorrect results due to improper token filtering in top_p sampling
Suggestion: No change needed in the dependency, fix the implementation in sample_utils.py
| time: float | ||
|
|
||
| def show(self) -> str: | ||
| tps = self.tokens / self.time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Division by Zero Risk.
The code divides by self.time without checking if it's zero, which would cause a runtime crash when displaying statistics.
Current Code (Diff):
- tps = self.tokens / self.time
+ tps = self.tokens / self.time if self.time > 0 else 0📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| tps = self.tokens / self.time | |
| tps = self.tokens / self.time if self.time > 0 else 0 |
| torch.manual_seed(42) | ||
| np.random.seed(42) | ||
|
|
||
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
No error handling for C library loading.
Loading the C library without error handling will crash the application if the library file doesn't exist at the specified path.
Current Code (Diff):
- bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ try:
+ bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ except OSError as e:
+ raise RuntimeError(f"Failed to load BitNet library: {e}")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| try: | |
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| except OSError as e: | |
| raise RuntimeError(f"Failed to load BitNet library: {e}") |
| N = input1.shape[0] | ||
| K = input1.shape[1] * 4 | ||
|
|
||
| bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
No error handling for C library function call.
The C library function call has no error checking, which could lead to silent failures or crashes if the function encounters an error.
Current Code (Diff):
- bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
+ result = bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
+ if result != 0: # Assuming 0 is success, adjust based on actual API
+ raise RuntimeError(f"BitNet library call failed with error code: {result}")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)]) | |
| result = bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)]) | |
| if result != 0: # Assuming 0 is success, adjust based on actual API | |
| raise RuntimeError(f"BitNet library call failed with error code: {result}") |
| # tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False)) | ||
| return tokens | ||
|
|
||
| def encode_message(self, message: Message, return_target=False) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Incorrect return type annotation.
The function is annotated to return List[int] but actually returns a tuple (List[int], List[int] or None), which could cause type errors when used by other components.
Current Code (Diff):
- def encode_message(self, message: Message, return_target=False) -> List[int]:
+ def encode_message(self, message: Message, return_target=False) -> Union[List[int], tuple[List[int], Union[List[int], None]]]:📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| def encode_message(self, message: Message, return_target=False) -> List[int]: | |
| def encode_message(self, message: Message, return_target=False) -> Union[List[int], tuple[List[int], Union[List[int], None]]]: |
|
|
||
| return tokens, None | ||
|
|
||
| def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Incorrect return type annotation.
The function is annotated to return List[int] but can also return a tuple (List[int], List[int]) when return_target=True, which could cause type errors.
Current Code (Diff):
- def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]:
+ def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> Union[List[int], tuple[List[int], List[int]]]:📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]: | |
| def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> Union[List[int], tuple[List[int], List[int]]]: |
| if not os.path.exists(server_path): | ||
| server_path = os.path.join(build_dir, "bin", "llama-server") | ||
| else: | ||
| server_path = os.path.join(build_dir, "bin", "llama-server") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Missing server executable validation.
The script doesn't verify if the server executable exists before attempting to run it, which will cause a runtime error if the file is missing.
Current Code (Diff):
- server_path = os.path.join(build_dir, "bin", "llama-server")
+ server_path = os.path.join(build_dir, "bin", "llama-server")
+
+ if not os.path.exists(server_path):
+ print(f"Error: Server executable not found at {server_path}")
+ sys.exit(1)📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| server_path = os.path.join(build_dir, "bin", "llama-server") | |
| server_path = os.path.join(build_dir, "bin", "llama-server") | |
| if not os.path.exists(server_path): | |
| print(f"Error: Server executable not found at {server_path}") | |
| sys.exit(1) |
| exit(0) | ||
| logging.info("Compiling the code using CMake.") | ||
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files") | ||
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 Correctness Issue
Hardcoded compiler breaks cross-platform compatibility.
Forcing clang/clang++ as compilers will break builds on systems where these compilers aren't available or where specific compiler requirements exist.
Current Code (Diff):
- run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files")
+ run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files") | |
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files") |
This PR contains changes from a range of commits from the original repository.
Commit Range:
0e7dadb..6197e9fFiles Changed: 22 (13 programming files)
Programming Ratio: 59.1%
Commits included:
... and 3 more commits