Skip to content

Forge JIT Backend Integration#35

Merged
auto-differentiation-dev merged 145 commits intoauto-differentiation:mainfrom
da-roth:forge
Feb 5, 2026
Merged

Forge JIT Backend Integration#35
auto-differentiation-dev merged 145 commits intoauto-differentiation:mainfrom
da-roth:forge

Conversation

@da-roth
Copy link
Contributor

@da-roth da-roth commented Jan 10, 2026

This PR integrates the Forge JIT backend for XAD, adding optional native code generation support. Forge is an optional dependency - everything builds and runs without it.

Changes

Build options added:

  • QLRISKS_ENABLE_FORGE: Enable Forge JIT backend
  • QLRISKS_ENABLE_FORGE_TESTS: Include Forge tests in test suite
  • QLRISKS_ENABLE_JIT_TESTS: Enable XAD JIT tests (interpreter backend, no Forge)
  • QLRISKS_BUILD_BENCHMARK / QLRISKS_BUILD_BENCHMARK_STANDALONE: Benchmark executables

Files added:

  • test-suite/jit_xad.cpp: JIT infrastructure tests (interpreter backend)
  • test-suite/forgebackend_xad.cpp: Forge backend tests
  • test-suite/swaption_jit_pipeline_xad.cpp: JIT pipeline tests for LMM Monte Carlo
  • test-suite/swaption_benchmark.cpp: Boost.Test benchmarks
  • test-suite/benchmark_main.cpp: Standalone benchmark executable
  • test-suite/PlatformInfo.hpp: Platform detection utilities
  • .github/workflows/ql-benchmarks.yaml: Benchmark workflow

Files modified:

  • CMakeLists.txt: Forge integration options
  • test-suite/CMakeLists.txt: Conditional test/benchmark targets
  • .github/workflows/ci.yaml: Added forge-linux and forge-windows jobs

Benchmarks

The benchmark workflow (ql-benchmarks.yaml) runs swaption pricing benchmarks comparing FD, XAD tape, JIT scalar, and JIT-AVX methods on Linux and Windows.

Also included some initial work towards #33 - the workflow has type overhead jobs that compare double vs xad::AReal pricing performance (no derivatives) on the same hardware, providing a baseline for measuring XAD type overhead.

Example benchmark run (Linux) Link

@auto-differentiation-dev
Copy link
Contributor

Hi @da-roth,

Looking at the numbers again, maybe it is just a matter of increasing the Monte Carlo workload so runtime isn't dominated by bootstrapping. That would better reflect a real-world setup/application. For example, we could include a portfolio of swaptions and additional cashflows.

@da-roth
Copy link
Contributor Author

da-roth commented Feb 3, 2026

Hi @auto-differentiation-dev ,

did some investigations, your remarks and intuitions were right. The QL code did some nasty re-computations of matrices during each step in the MC simulation. The code that implements this example is not that optimal - but being so, it seems to be a really good working example for future works. It shows the impact of overhead between double - AReal, and the latest results also indicate where Forge is still not that optimal.

Anyway, I did some minor optimization such that the matrix is only computed once per path and see this locally:
Timings native double FD:

Paths Method Mean StdDev Speedup
1K FD 4820.3 10.0 ---
--------+-----------+----------+----------+---------
10K FD 6781.9 0.0 ---
--------+-----------+----------+----------+---------
100K FD 26416.9 0.0 ---

Timings AReal with JIT = ON:

Paths Method Mean StdDev Setup* Speedup
1K XAD 180.0 0.8 --- ---
XAD-Split 143.5 2.4 109.5 1.25x
JIT 190.7 5.0 128.7 0.94x
JIT-AVX 143.8 0.5 128.7 1.25x
--------+-----------+----------+----------+---------+---------
10K XAD 847.8 4.0 --- ---
XAD-Split 449.3 1.5 110.9 1.89x
JIT 721.7 0.0 127.9 1.17x
JIT-AVX 258.5 1.6 127.9 3.28x
--------+-----------+----------+----------+---------+---------
100K XAD 7435.5 0.0 --- ---
XAD-Split 3463.5 0.0 110.7 2.15x
JIT 6060.8 0.0 127.6 1.23x
JIT-AVX 1413.2 0.0 127.6 5.26x

So AAD + AReal overhead (ofc implified by unoptimal implementation in QL) gives still a roughly 7.5x benefit for this example of XAD vs native double.

Interesting is that XAD-Split is faster than JIT - I think it shows the benefit of how XAD is optimized and the overhead of doing unnecessary computations.

The overhead of JIT vs JIT-AVX is not surprising me too much - I spent some time improving the throughput of setting inputs and getting the outputs. Hence we have something like 4 lanes + some infrastractural that can be applied to JIT scalar as well. I'll apply these in the future to the scalar JIT as well.

Let's see how the benchmarks look in the cloud. Would you wish any changes here? I really like the example since it'll give us all the insights for future improvements, but ofc one could create something that gives us higher x compared to native FD if wished (more inputs, trying to further dig if we can avoid unnecessary computations etc.). Thinking out loud my intuition is that: While I'd expect that the better XAD performs compared to native FD, I think the nearer JIT scalar will get to XAD-split (and at some point be slightly faster as we saw in the XAD repo's results).

Cheers, Daniel

@auto-differentiation-dev
Copy link
Contributor

Thanks @da-roth, that's all noted. We see that you're still working on this PR, let us know when you'd like us to review or comment.

Thanks.

@da-roth
Copy link
Contributor Author

da-roth commented Feb 4, 2026

Hi @auto-differentiation-dev ,

yeah, I was doing some tests and investigations. I added a second case that uses 90 inputs (adding some credit data) - so we have a 45 and 90 input now. The output is a bit messy, I had to put it into a table manually, but I'll wait till your feedback before finalizing the combined report. We have this now:
image
which leads to roughly these formulas (comparing the two cases) for the boot-strapping phase:
image
and these for the MC-Simulation:
image
Hence, for the first phase XAD will already be faster for 2 inputs. For the second phase, the breakeven is somewhere between 2-7 inputs (JIT-AVX to XAD), this is a bit uncertain ofc since FD is not really split - just some extrapolation.

I think that looks quite nice now, having in mind that the QL code is really not that nice here. Any thoughts or wishes?

Cheers, Daniel

@auto-differentiation-dev
Copy link
Contributor

Thanks Daniel, this looks very solid. Really appreciate the high-quality work here; we know this kind of effort is time-consuming, and you've moved through it impressively fast.

To summarise the key result: a plain valuation run with a single Monte Carlo path and no sensitivities takes ~206.37 ms in this setup. Enabling XAD (turbo-boosted with Forge) and running 10k Monte Carlo paths with 90 sensitivities only roughly doubles the runtime.

That's a strong, concrete demonstration of the value of AAD via operator overloading combined with Record-Once / Replay-Many. We feel this would be even more compelling when framed in a risk context (e.g. ORE).

Pending your go-ahead, we'll give it a final look and merge.

@da-roth
Copy link
Contributor Author

da-roth commented Feb 5, 2026

Hi @auto-differentiation-dev ,

thank you! Did some output cleaning yesterday evening and the combined report now shows this:
image
for the latest run link
Note: I renamed XAD-Split -> XAD and the 'slow' XAD to XAD-Full and only show the QL native double FD + XAD (JIT=off) + Forge-AVX (XAD with JIT=On) in the combined report.

Agree, I think there a couple of nice insights stating the general value and also showing the directions for improvement:

  • We have some nice levels for fine-tuning speed:
  1. Even the non-split XAD version might be attractive since faster than FD, while not having to refactoring anything, guess it won't run out of memory on production machines as fast as on the free GitHub workers.
  2. Splitting into the curve building + pricing phase can be done quite easily without any model/pricer specifics
  3. To get the additional JIT speed-up specific code regions need to be revised and refactored to capture the necessary branches.
  4. Having e.g. something like the mentioned ORE in mind: I think even if not interested in AAD, the overhead of AReal might be worth (if not wanting to build with two types), since one could potentially use the non-AAD JIT kernel of Forge. While we don't see numbers for that here (and it's not that fine-tuned yet too), I'd expect it to be even faster than one original double pass.
  • After this PR, I'll need to update the double vs. AReal PR with this example, I think it'll be nice to have some numbers for the two phases independently
  • Personally, I think it's really interesting to have an example where naive JIT gets outperformaned even though XAD still also facing the AReal overhead.

For some reason the Win action was running out of memory, didn't really notice that yesterday. I limited the compilation to 2 jobs on Windows since MSVC seems to consume more memory. I'm done from my side, feel free to have your final look while I'll keep an eye on the Win job.

Cheers, Daniel

Copy link
Contributor

@auto-differentiation-dev auto-differentiation-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @da-roth,

The PR is approved as such, but the Windows build is taking unbearably long with strong inlining. Let's try the suggested changes to disable strong inlining and see what we get.

@auto-differentiation-dev auto-differentiation-dev merged commit 5645132 into auto-differentiation:main Feb 5, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants