Forge JIT Backend Integration by da-roth · Pull Request #35 · auto-differentiation/QuantLibAAD

da-roth · 2026-01-10T16:44:44Z

This PR integrates the Forge JIT backend for XAD, adding optional native code generation support. Forge is an optional dependency - everything builds and runs without it.

Changes

Build options added:

QLRISKS_ENABLE_FORGE: Enable Forge JIT backend
QLRISKS_ENABLE_FORGE_TESTS: Include Forge tests in test suite
QLRISKS_ENABLE_JIT_TESTS: Enable XAD JIT tests (interpreter backend, no Forge)
QLRISKS_BUILD_BENCHMARK / QLRISKS_BUILD_BENCHMARK_STANDALONE: Benchmark executables

Files added:

test-suite/jit_xad.cpp: JIT infrastructure tests (interpreter backend)
test-suite/forgebackend_xad.cpp: Forge backend tests
test-suite/swaption_jit_pipeline_xad.cpp: JIT pipeline tests for LMM Monte Carlo
test-suite/swaption_benchmark.cpp: Boost.Test benchmarks
test-suite/benchmark_main.cpp: Standalone benchmark executable
test-suite/PlatformInfo.hpp: Platform detection utilities
.github/workflows/ql-benchmarks.yaml: Benchmark workflow

Files modified:

CMakeLists.txt: Forge integration options
test-suite/CMakeLists.txt: Conditional test/benchmark targets
.github/workflows/ci.yaml: Added forge-linux and forge-windows jobs

Benchmarks

The benchmark workflow (ql-benchmarks.yaml) runs swaption pricing benchmarks comparing FD, XAD tape, JIT scalar, and JIT-AVX methods on Linux and Windows.

Also included some initial work towards #33 - the workflow has type overhead jobs that compare double vs xad::AReal pricing performance (no derivatives) on the same hardware, providing a baseline for measuring XAD type overhead.

Example benchmark run (Linux) Link

This reverts commit c84d01c.

auto-differentiation-dev · 2026-02-03T09:01:27Z

Hi @da-roth,

Looking at the numbers again, maybe it is just a matter of increasing the Monte Carlo workload so runtime isn't dominated by bootstrapping. That would better reflect a real-world setup/application. For example, we could include a portfolio of swaptions and additional cashflows.

This reverts commit 834894e.

This reverts commit 8780440.

This reverts commit 55de855.

This reverts commit e24e207.

da-roth · 2026-02-03T10:27:57Z

Hi @auto-differentiation-dev ,

did some investigations, your remarks and intuitions were right. The QL code did some nasty re-computations of matrices during each step in the MC simulation. The code that implements this example is not that optimal - but being so, it seems to be a really good working example for future works. It shows the impact of overhead between double - AReal, and the latest results also indicate where Forge is still not that optimal.

Anyway, I did some minor optimization such that the matrix is only computed once per path and see this locally:
Timings native double FD:

Paths	Method	Mean	StdDev	Speedup
1K	FD	4820.3	10.0	---
--------+-----------+----------+----------+---------
10K	FD	6781.9	0.0	---
--------+-----------+----------+----------+---------
100K	FD	26416.9	0.0	---

Timings AReal with JIT = ON:

Paths	Method	Mean	StdDev	Setup*	Speedup
1K	XAD	180.0	0.8	---	---
	XAD-Split	143.5	2.4	109.5	1.25x
	JIT	190.7	5.0	128.7	0.94x
	JIT-AVX	143.8	0.5	128.7	1.25x
--------+-----------+----------+----------+---------+---------
10K	XAD	847.8	4.0	---	---
	XAD-Split	449.3	1.5	110.9	1.89x
	JIT	721.7	0.0	127.9	1.17x
	JIT-AVX	258.5	1.6	127.9	3.28x
--------+-----------+----------+----------+---------+---------
100K	XAD	7435.5	0.0	---	---
	XAD-Split	3463.5	0.0	110.7	2.15x
	JIT	6060.8	0.0	127.6	1.23x
	JIT-AVX	1413.2	0.0	127.6	5.26x

So AAD + AReal overhead (ofc implified by unoptimal implementation in QL) gives still a roughly 7.5x benefit for this example of XAD vs native double.

Interesting is that XAD-Split is faster than JIT - I think it shows the benefit of how XAD is optimized and the overhead of doing unnecessary computations.

The overhead of JIT vs JIT-AVX is not surprising me too much - I spent some time improving the throughput of setting inputs and getting the outputs. Hence we have something like 4 lanes + some infrastractural that can be applied to JIT scalar as well. I'll apply these in the future to the scalar JIT as well.

Let's see how the benchmarks look in the cloud. Would you wish any changes here? I really like the example since it'll give us all the insights for future improvements, but ofc one could create something that gives us higher x compared to native FD if wished (more inputs, trying to further dig if we can avoid unnecessary computations etc.). Thinking out loud my intuition is that: While I'd expect that the better XAD performs compared to native FD, I think the nearer JIT scalar will get to XAD-split (and at some point be slightly faster as we saw in the XAD repo's results).

Cheers, Daniel

This reverts commit 4a47b2b.

auto-differentiation-dev · 2026-02-04T14:54:03Z

Thanks @da-roth, that's all noted. We see that you're still working on this PR, let us know when you'd like us to review or comment.

Thanks.

da-roth · 2026-02-04T17:17:24Z

Hi @auto-differentiation-dev ,

yeah, I was doing some tests and investigations. I added a second case that uses 90 inputs (adding some credit data) - so we have a 45 and 90 input now. The output is a bit messy, I had to put it into a table manually, but I'll wait till your feedback before finalizing the combined report. We have this now:

which leads to roughly these formulas (comparing the two cases) for the boot-strapping phase:

and these for the MC-Simulation:

Hence, for the first phase XAD will already be faster for 2 inputs. For the second phase, the breakeven is somewhere between 2-7 inputs (JIT-AVX to XAD), this is a bit uncertain ofc since FD is not really split - just some extrapolation.

I think that looks quite nice now, having in mind that the QL code is really not that nice here. Any thoughts or wishes?

Cheers, Daniel

auto-differentiation-dev · 2026-02-05T07:12:43Z

Thanks Daniel, this looks very solid. Really appreciate the high-quality work here; we know this kind of effort is time-consuming, and you've moved through it impressively fast.

To summarise the key result: a plain valuation run with a single Monte Carlo path and no sensitivities takes ~206.37 ms in this setup. Enabling XAD (turbo-boosted with Forge) and running 10k Monte Carlo paths with 90 sensitivities only roughly doubles the runtime.

That's a strong, concrete demonstration of the value of AAD via operator overloading combined with Record-Once / Replay-Many. We feel this would be even more compelling when framed in a risk context (e.g. ORE).

Pending your go-ahead, we'll give it a final look and merge.

da-roth · 2026-02-05T08:11:29Z

Hi @auto-differentiation-dev ,

thank you! Did some output cleaning yesterday evening and the combined report now shows this:

for the latest run link
Note: I renamed XAD-Split -> XAD and the 'slow' XAD to XAD-Full and only show the QL native double FD + XAD (JIT=off) + Forge-AVX (XAD with JIT=On) in the combined report.

Agree, I think there a couple of nice insights stating the general value and also showing the directions for improvement:

We have some nice levels for fine-tuning speed:

Even the non-split XAD version might be attractive since faster than FD, while not having to refactoring anything, guess it won't run out of memory on production machines as fast as on the free GitHub workers.
Splitting into the curve building + pricing phase can be done quite easily without any model/pricer specifics
To get the additional JIT speed-up specific code regions need to be revised and refactored to capture the necessary branches.
Having e.g. something like the mentioned ORE in mind: I think even if not interested in AAD, the overhead of AReal might be worth (if not wanting to build with two types), since one could potentially use the non-AAD JIT kernel of Forge. While we don't see numbers for that here (and it's not that fine-tuned yet too), I'd expect it to be even faster than one original double pass.

After this PR, I'll need to update the double vs. AReal PR with this example, I think it'll be nice to have some numbers for the two phases independently
Personally, I think it's really interesting to have an example where naive JIT gets outperformaned even though XAD still also facing the AReal overhead.

For some reason the Win action was running out of memory, didn't really notice that yesterday. I limited the compilation to 2 jobs on Windows since MSVC seems to consume more memory. I'm done from my side, feel free to have your final look while I'll keep an eye on the Win job.

Cheers, Daniel

auto-differentiation-dev

Hi @da-roth,

The PR is approved as such, but the Windows build is taking unbearably long with strong inlining. Let's try the suggested changes to disable strong inlining and see what we get.

.github/workflows/ql-benchmarks.yaml

da-roth added 30 commits December 15, 2025 09:34

forge integration

c84d01c

Revert "forge integration"

a3605f9

This reverts commit c84d01c.

forge integration

98c276b

up

6c331dc

use setBackend

15e8808

updates due to xad-jit

b08fd88

updated backend naming

33e3d69

graph updates with jitnode

8494c87

first refactor

36eacb8

revert changes on fork

311d387

update path

2591bb5

update ci

41dc182

remove compatibility layer

a4021b2

updates

66b8ea2

update backend namespaces

7771b93

update c api workflow

b68d6f6

update benchmark yaml

7c75cd1

fix benchmark yaml

8e1ce66

update ci.yaml

f7202e4

updates

fbda9f6

updated ci.yaml

b8ca808

update ci.yaml

95e634a

updated win build

98d8b86

fix benchmark baseline

c17f541

updates

ed68306

added benchmark_utils

cf5030b

temporarily use xad-jit

8467808

up

58267e1

up

3642cfa

use double forgebackend

7cda13d

up

834894e

da-roth added 6 commits February 3, 2026 10:28

Revert "up"

9acf091

This reverts commit 834894e.

Revert "up"

a2c9376

This reverts commit 8780440.

Revert "clean locally"

efd134c

This reverts commit 55de855.

Revert "tried fixes"

8c3f726

This reverts commit e24e207.

improve matrix efficiency

57ce5ab

fix

b0ed41e

da-roth added 9 commits February 3, 2026 13:35

up

8d4bf69

added 1 mio

4a47b2b

try to use vectors

ec942a4

up

dd1d988

updates

9e6ce41

Revert "added 1 mio"

eaa23e2

This reverts commit 4a47b2b.

update aad benchmark

eccb551

udpated compiler settings

191cdf7

updates

c98c8bb

up

36645f7

up

f3b1b83

win build

b62d85c

auto-differentiation-dev approved these changes Feb 5, 2026

View reviewed changes

auto-differentiation-dev reviewed Feb 5, 2026

View reviewed changes

.github/workflows/ql-benchmarks.yaml Outdated Show resolved Hide resolved

.github/workflows/ql-benchmarks.yaml Outdated Show resolved Hide resolved

auto-differentiation-dev added 2 commits February 5, 2026 11:48

Disable strong inlining on Windows

9d7a8f4

Disable strong inlining on Windows

1db8c16

auto-differentiation-dev merged commit 5645132 into auto-differentiation:main Feb 5, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forge JIT Backend Integration#35

Forge JIT Backend Integration#35
auto-differentiation-dev merged 145 commits intoauto-differentiation:mainfrom
da-roth:forge

da-roth commented Jan 10, 2026 •

edited

Loading

Uh oh!

auto-differentiation-dev commented Feb 3, 2026

Uh oh!

da-roth commented Feb 3, 2026 •

edited

Loading

Uh oh!

auto-differentiation-dev commented Feb 4, 2026

Uh oh!

da-roth commented Feb 4, 2026

Uh oh!

auto-differentiation-dev commented Feb 5, 2026

Uh oh!

da-roth commented Feb 5, 2026 •

edited

Loading

Uh oh!

auto-differentiation-dev left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

da-roth commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

auto-differentiation-dev commented Feb 3, 2026

Uh oh!

da-roth commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

auto-differentiation-dev commented Feb 4, 2026

Uh oh!

da-roth commented Feb 4, 2026

Uh oh!

auto-differentiation-dev commented Feb 5, 2026

Uh oh!

da-roth commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

auto-differentiation-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

da-roth commented Jan 10, 2026 •

edited

Loading

da-roth commented Feb 3, 2026 •

edited

Loading

da-roth commented Feb 5, 2026 •

edited

Loading