Skip to content

fix test/ops/self_attention.py#5

Merged
PanZezhong1725 merged 2 commits intoInfiniTensor:mainfrom
liulog:fix-self_attention-test
Aug 19, 2025
Merged

fix test/ops/self_attention.py#5
PanZezhong1725 merged 2 commits intoInfiniTensor:mainfrom
liulog:fix-self_attention-test

Conversation

@liulog
Copy link
Copy Markdown
Contributor

@liulog liulog commented Aug 18, 2025

我在实现 kvcache 后,发现 Prefill 阶段得到的 token 正确,Decode 阶段得到的 token 不对,通过查看张量,发现 self-attention 部分有问题,最终定位到 softmax 有问题,发现我实现的 self-attention 算子的 softmax 的部分不对(当 qlen != kvlen 时,也就是用 kvcache 时),但是通过了 test/ops/self-attention.py 的测试,在我的视线中增加 past_len = total_len - seqlen 之后,可以正确推理,但是通不过 self-attention 的测试了,以此推论 test/ops/self-attention.py 也有问题。

分析:

# 之前的实现
temp_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0)
# 修改后的实现    
temp_mask = torch.ones(S, S, dtype=torch.bool).tril(diagonal=0)[-L:, ]

之前测试中的 mask 内容:

image

mask 应该具有的正确内容:
image

修改这一部分的逻辑之后,self-attention 和 infer 的 CI 测试都可以通过了。

下图是通过 CI 的截图:
image

@PanZezhong1725
Copy link
Copy Markdown
Collaborator

十分感谢,问题已复现

Comment thread test/ops/self_attention.py Outdated
@PanZezhong1725 PanZezhong1725 merged commit 2945515 into InfiniTensor:main Aug 19, 2025
0 of 2 checks passed
@liulog liulog deleted the fix-self_attention-test branch August 19, 2025 09:32
ge3m0r pushed a commit to ge3m0r/llaisys that referenced this pull request Jan 18, 2026
KevinSusan pushed a commit to KevinSusan/llaisys_tt that referenced this pull request Mar 16, 2026
…itance, logging)

- Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000)
- Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing
- Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit()
- Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance
- Fix InfiniTensor#5: Merge double lock in request_stop()
- Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing
KevinSusan pushed a commit to KevinSusan/llaisys_tt that referenced this pull request Mar 16, 2026
…sor parallelism

- Communication layer: C API (comm.h), C++ dispatcher, NCCL backend
- commInit accepts external unique ID for multi-rank initialization
- llaisysCommGenerateUniqueId API for external ID generation
- Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style)
- llaisysQwen2ModelSetTensorParallel C API
- Python weight splitting (column/row split for Megatron-style TP)
- Multi-process launcher (launch_tp.py + _tp_worker.py)
- Unit tests (test_comm_api.py) and integration tests (test_allreduce.py)
- Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated
KevinSusan pushed a commit to KevinSusan/llaisys_tt that referenced this pull request Mar 16, 2026
…itance, logging)

- Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000)
- Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing
- Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit()
- Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance
- Fix InfiniTensor#5: Merge double lock in request_stop()
- Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing
KevinSusan pushed a commit to KevinSusan/llaisys_tt that referenced this pull request Mar 16, 2026
…sor parallelism

- Communication layer: C API (comm.h), C++ dispatcher, NCCL backend
- commInit accepts external unique ID for multi-rank initialization
- llaisysCommGenerateUniqueId API for external ID generation
- Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style)
- llaisysQwen2ModelSetTensorParallel C API
- Python weight splitting (column/row split for Megatron-style TP)
- Multi-process launcher (launch_tp.py + _tp_worker.py)
- Unit tests (test_comm_api.py) and integration tests (test_allreduce.py)
- Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated
Copilot AI mentioned this pull request Mar 16, 2026
Open
@StevenFryto StevenFryto mentioned this pull request Mar 17, 2026
xsmccc added a commit to xsmccc/llaisys that referenced this pull request Apr 6, 2026
- §1: Add KV Cache INT8 (InfiniTensor#4) and CUDA Graph (InfiniTensor#5) to project intro (7→9 optimizations)
- §32: Rewrite optimization InfiniTensor#8 from 'failed CUDA Graph' to successful KV Cache INT8 (+55%)
- §32: Add optimization InfiniTensor#9 CUDA Graph static capture (+12.2%, 118→132 tok/s)
- §32: Update acceleration breakdown table (330× complete, FP32 4.4×)
- §24.5: Fix perf numbers (57.3→57.5, FP32 33.6→~30, add final 132 tok/s)
- §40: Update quantization Q&A with full pipeline data
- §43: Rewrite cudaGraph section with project-specific implementation details
- Clean up duplicate INT4 paragraph, fix title counts (七→九项)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants