Add SSE implementation to Matrix4x4.GetDeterminant#123954
Add SSE implementation to Matrix4x4.GetDeterminant#123954tannergooding merged 2 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Outdated
Show resolved
Hide resolved
|
Reordering the shuffles so that they happen just before they are needed reduces register pressure, code size, and execution time on older systems with fewer xmm registers. |
|
Overall we really only care about the codegen for I'd be fine if we wanted to update this to break it apart into 2 groups of 6 shuffles instead, that would notably also remove the single two argument shuffle we have. DirectX Math has a solid implementation and is already used for some other code paths in Matrix4x4: https://github.com/microsoft/DirectXMath/blob/main/Inc/DirectXMathMatrix.inl#L1024-L1060. There's sometimes a few extra optimizations that could be had as well, but nothing super major. |
This PR adds an SSE implementation for `Matrix4x4.GetDeterminant`. Performance improves by about 15% by vectorizing the operations. Performance results are from the existing `Perf_Matrix4x4.GetDeterminantBenchmark` [here](https://github.com/dotnet/performance/blob/f702e197f8dbf28294fc0483e7317f140f2fd6cc/src/benchmarks/micro/libraries/System.Numerics.Vectors/Perf_Matrix4x4.cs#L148): ``` | Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | Allocated | Alloc Ratio | |------------------------ |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|----------:|------------:| | GetDeterminantBenchmark | Job-KGYWNE | \base\Core_Root\corerun.exe | 3.487 ns | 0.0178 ns | 0.0167 ns | 3.483 ns | 3.452 ns | 3.518 ns | 1.00 | - | NA | | GetDeterminantBenchmark | Job-VWCPYO | \diff\Core_Root\corerun.exe | 2.971 ns | 0.0106 ns | 0.0099 ns | 2.970 ns | 2.960 ns | 2.992 ns | 0.85 | - | NA | ``` --------- Co-authored-by: Alex Covington (Advanced Micro Devices Inc) <b-alexco@microsoft.com>
This PR adds an SSE implementation for
Matrix4x4.GetDeterminant. Performance improves by about 15% by vectorizing the operations.Performance results are from the existing
Perf_Matrix4x4.GetDeterminantBenchmarkhere: