fix test/ops/self_attention.py#5
Merged
PanZezhong1725 merged 2 commits intoInfiniTensor:mainfrom Aug 19, 2025
Merged
Conversation
Collaborator
|
十分感谢,问题已复现 |
PanZezhong1725
requested changes
Aug 19, 2025
PanZezhong1725
approved these changes
Aug 19, 2025
ge3m0r
pushed a commit
to ge3m0r/llaisys
that referenced
this pull request
Jan 18, 2026
fix test/ops/self_attention.py
KevinSusan
pushed a commit
to KevinSusan/llaisys_tt
that referenced
this pull request
Mar 16, 2026
…itance, logging) - Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000) - Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing - Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit() - Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance - Fix InfiniTensor#5: Merge double lock in request_stop() - Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing
KevinSusan
pushed a commit
to KevinSusan/llaisys_tt
that referenced
this pull request
Mar 16, 2026
…sor parallelism - Communication layer: C API (comm.h), C++ dispatcher, NCCL backend - commInit accepts external unique ID for multi-rank initialization - llaisysCommGenerateUniqueId API for external ID generation - Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style) - llaisysQwen2ModelSetTensorParallel C API - Python weight splitting (column/row split for Megatron-style TP) - Multi-process launcher (launch_tp.py + _tp_worker.py) - Unit tests (test_comm_api.py) and integration tests (test_allreduce.py) - Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated
KevinSusan
pushed a commit
to KevinSusan/llaisys_tt
that referenced
this pull request
Mar 16, 2026
…itance, logging) - Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000) - Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing - Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit() - Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance - Fix InfiniTensor#5: Merge double lock in request_stop() - Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing
KevinSusan
pushed a commit
to KevinSusan/llaisys_tt
that referenced
this pull request
Mar 16, 2026
…sor parallelism - Communication layer: C API (comm.h), C++ dispatcher, NCCL backend - commInit accepts external unique ID for multi-rank initialization - llaisysCommGenerateUniqueId API for external ID generation - Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style) - llaisysQwen2ModelSetTensorParallel C API - Python weight splitting (column/row split for Megatron-style TP) - Multi-process launcher (launch_tp.py + _tp_worker.py) - Unit tests (test_comm_api.py) and integration tests (test_allreduce.py) - Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated
Open
Open
xsmccc
added a commit
to xsmccc/llaisys
that referenced
this pull request
Apr 6, 2026
- §1: Add KV Cache INT8 (InfiniTensor#4) and CUDA Graph (InfiniTensor#5) to project intro (7→9 optimizations) - §32: Rewrite optimization InfiniTensor#8 from 'failed CUDA Graph' to successful KV Cache INT8 (+55%) - §32: Add optimization InfiniTensor#9 CUDA Graph static capture (+12.2%, 118→132 tok/s) - §32: Update acceleration breakdown table (330× complete, FP32 4.4×) - §24.5: Fix perf numbers (57.3→57.5, FP32 33.6→~30, add final 132 tok/s) - §40: Update quantization Q&A with full pipeline data - §43: Rewrite cudaGraph section with project-specific implementation details - Clean up duplicate INT4 paragraph, fix title counts (七→九项)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
我在实现 kvcache 后,发现 Prefill 阶段得到的 token 正确,Decode 阶段得到的 token 不对,通过查看张量,发现 self-attention 部分有问题,最终定位到 softmax 有问题,发现我实现的 self-attention 算子的 softmax 的部分不对(当 qlen != kvlen 时,也就是用 kvcache 时),但是通过了 test/ops/self-attention.py 的测试,在我的视线中增加 past_len = total_len - seqlen 之后,可以正确推理,但是通不过 self-attention 的测试了,以此推论 test/ops/self-attention.py 也有问题。
分析:
之前测试中的 mask 内容:
mask 应该具有的正确内容:

修改这一部分的逻辑之后,self-attention 和 infer 的 CI 测试都可以通过了。
下图是通过 CI 的截图:
