Add SSE implementation to Matrix4x4.GetDeterminant by alexcovington · Pull Request #123954 · dotnet/runtime

alexcovington · 2026-02-03T18:32:12Z

This PR adds an SSE implementation for Matrix4x4.GetDeterminant. Performance improves by about 15% by vectorizing the operations.

Performance results are from the existing Perf_Matrix4x4.GetDeterminantBenchmark here:

| Method                  | Job        | Toolchain                   | Mean     | Error     | StdDev    | Median   | Min      | Max      | Ratio | Allocated | Alloc Ratio |
|------------------------ |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|----------:|------------:|
| GetDeterminantBenchmark | Job-KGYWNE | \base\Core_Root\corerun.exe | 3.487 ns | 0.0178 ns | 0.0167 ns | 3.483 ns | 3.452 ns | 3.518 ns |  1.00 |         - |          NA |
| GetDeterminantBenchmark | Job-VWCPYO | \diff\Core_Root\corerun.exe | 2.971 ns | 0.0106 ns | 0.0099 ns | 2.970 ns | 2.960 ns | 2.992 ns |  0.85 |         - |          NA |

dotnet-policy-service · 2026-02-03T18:38:54Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs

grbell-ms · 2026-02-03T23:22:03Z

Reordering the shuffles so that they happen just before they are needed reduces register pressure, code size, and execution time on older systems with fewer xmm registers.

sharplab

tannergooding · 2026-02-03T23:35:46Z

Overall we really only care about the codegen for x86-64-v3 (AVX2+) and later hardware, which is the vast majority of hardware users will encounter. It is acceptable if the codegen for x86-64-v2 (SSE42) is a bit suboptimal, it will still be much faster than the scalar version.

I'd be fine if we wanted to update this to break it apart into 2 groups of 6 shuffles instead, that would notably also remove the single two argument shuffle we have. DirectX Math has a solid implementation and is already used for some other code paths in Matrix4x4: https://github.com/microsoft/DirectXMath/blob/main/Inc/DirectXMathMatrix.inl#L1024-L1060. There's sometimes a few extra optimizations that could be had as well, but nothing super major.

This PR adds an SSE implementation for `Matrix4x4.GetDeterminant`. Performance improves by about 15% by vectorizing the operations. Performance results are from the existing `Perf_Matrix4x4.GetDeterminantBenchmark` [here](https://github.com/dotnet/performance/blob/f702e197f8dbf28294fc0483e7317f140f2fd6cc/src/benchmarks/micro/libraries/System.Numerics.Vectors/Perf_Matrix4x4.cs#L148): ``` | Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio | |------------------------ |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|----------:|------------:| | GetDeterminantBenchmark | Job-KGYWNE | \base\Core_Root\corerun.exe | 3.487 ns | 0.0178 ns | 0.0167 ns | 3.483 ns | 3.452 ns | 3.518 ns | 1.00 | - | NA | | GetDeterminantBenchmark | Job-VWCPYO | \diff\Core_Root\corerun.exe | 2.971 ns | 0.0106 ns | 0.0099 ns | 2.970 ns | 2.960 ns | 2.992 ns | 0.85 | - | NA | ``` --------- Co-authored-by: Alex Covington (Advanced Micro Devices Inc) <b-alexco@microsoft.com>

Add SSE implementation to Matrix4x4.GetDeterminant

b6d1d30

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 3, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 3, 2026

jkotas added area-System.Numerics tenet-performance Performance related issue and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 3, 2026

tannergooding reviewed Feb 3, 2026

View reviewed changes