fix test/ops/self_attention.py by liulog · Pull Request #5 · InfiniTensor/llaisys

liulog · 2025-08-18T11:38:01Z

我在实现 kvcache 后，发现 Prefill 阶段得到的 token 正确，Decode 阶段得到的 token 不对，通过查看张量，发现 self-attention 部分有问题，最终定位到 softmax 有问题，发现我实现的 self-attention 算子的 softmax 的部分不对（当 qlen != kvlen 时，也就是用 kvcache 时），但是通过了 test/ops/self-attention.py 的测试，在我的视线中增加 past_len = total_len - seqlen 之后，可以正确推理，但是通不过 self-attention 的测试了，以此推论 test/ops/self-attention.py 也有问题。

分析：

# 之前的实现
temp_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0)
# 修改后的实现    
temp_mask = torch.ones(S, S, dtype=torch.bool).tril(diagonal=0)[-L:, ]

之前测试中的 mask 内容：

mask 应该具有的正确内容：

修改这一部分的逻辑之后，self-attention 和 infer 的 CI 测试都可以通过了。

下图是通过 CI 的截图：

PanZezhong1725 · 2025-08-19T02:46:07Z

十分感谢，问题已复现

fix test/ops/self_attention.py

…itance, logging) - Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000) - Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing - Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit() - Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance - Fix InfiniTensor#5: Merge double lock in request_stop() - Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing

…sor parallelism - Communication layer: C API (comm.h), C++ dispatcher, NCCL backend - commInit accepts external unique ID for multi-rank initialization - llaisysCommGenerateUniqueId API for external ID generation - Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style) - llaisysQwen2ModelSetTensorParallel C API - Python weight splitting (column/row split for Megatron-style TP) - Multi-process launcher (launch_tp.py + _tp_worker.py) - Unit tests (test_comm_api.py) and integration tests (test_allreduce.py) - Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated

…itance, logging) - Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000) - Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing - Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit() - Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance - Fix InfiniTensor#5: Merge double lock in request_stop() - Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing

…sor parallelism - Communication layer: C API (comm.h), C++ dispatcher, NCCL backend - commInit accepts external unique ID for multi-rank initialization - llaisysCommGenerateUniqueId API for external ID generation - Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style) - llaisysQwen2ModelSetTensorParallel C API - Python weight splitting (column/row split for Megatron-style TP) - Multi-process launcher (launch_tp.py + _tp_worker.py) - Unit tests (test_comm_api.py) and integration tests (test_allreduce.py) - Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated

- §1: Add KV Cache INT8 (InfiniTensor#4) and CUDA Graph (InfiniTensor#5) to project intro (7→9 optimizations) - §32: Rewrite optimization InfiniTensor#8 from 'failed CUDA Graph' to successful KV Cache INT8 (+55%) - §32: Add optimization InfiniTensor#9 CUDA Graph static capture (+12.2%, 118→132 tok/s) - §32: Update acceleration breakdown table (330× complete, FP32 4.4×) - §24.5: Fix perf numbers (57.3→57.5, FP32 33.6→~30, add final 132 tok/s) - §40: Update quantization Q&A with full pipeline data - §43: Rewrite cudaGraph section with project-specific implementation details - Clean up duplicate INT4 paragraph, fix title counts (七→九项)

fix test/ops/self_attention.py

00d651a

PanZezhong1725 requested changes Aug 19, 2025

View reviewed changes

Comment thread test/ops/self_attention.py Outdated

fix test/ops/self_attention.py

896616b

PanZezhong1725 approved these changes Aug 19, 2025

View reviewed changes

PanZezhong1725 merged commit 2945515 into InfiniTensor:main Aug 19, 2025
0 of 2 checks passed

liulog deleted the fix-self_attention-test branch August 19, 2025 09:32

ge3m0r pushed a commit to ge3m0r/llaisys that referenced this pull request Jan 18, 2026

Merge pull request InfiniTensor#5 from liulog/fix-self_attention-test

beff483

fix test/ops/self_attention.py

Copilot AI mentioned this pull request Mar 16, 2026

m #50

Open

StevenFryto mentioned this pull request Mar 17, 2026

项目2/5 #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix test/ops/self_attention.py#5

fix test/ops/self_attention.py#5
PanZezhong1725 merged 2 commits intoInfiniTensor:mainfrom
liulog:fix-self_attention-test

liulog commented Aug 18, 2025

Uh oh!

PanZezhong1725 commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liulog commented Aug 18, 2025

Uh oh!

PanZezhong1725 commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants