Forge JIT Backend Integration#35
Forge JIT Backend Integration#35auto-differentiation-dev merged 145 commits intoauto-differentiation:mainfrom
Conversation
This reverts commit c84d01c.
|
Hi @da-roth, Looking at the numbers again, maybe it is just a matter of increasing the Monte Carlo workload so runtime isn't dominated by bootstrapping. That would better reflect a real-world setup/application. For example, we could include a portfolio of swaptions and additional cashflows. |
|
Hi @auto-differentiation-dev , did some investigations, your remarks and intuitions were right. The QL code did some nasty re-computations of matrices during each step in the MC simulation. The code that implements this example is not that optimal - but being so, it seems to be a really good working example for future works. It shows the impact of overhead between double - AReal, and the latest results also indicate where Forge is still not that optimal. Anyway, I did some minor optimization such that the matrix is only computed once per path and see this locally:
Timings AReal with JIT = ON:
So AAD + AReal overhead (ofc implified by unoptimal implementation in QL) gives still a roughly 7.5x benefit for this example of XAD vs native double. Interesting is that XAD-Split is faster than JIT - I think it shows the benefit of how XAD is optimized and the overhead of doing unnecessary computations. The overhead of JIT vs JIT-AVX is not surprising me too much - I spent some time improving the throughput of setting inputs and getting the outputs. Hence we have something like 4 lanes + some infrastractural that can be applied to JIT scalar as well. I'll apply these in the future to the scalar JIT as well. Let's see how the benchmarks look in the cloud. Would you wish any changes here? I really like the example since it'll give us all the insights for future improvements, but ofc one could create something that gives us higher x compared to native FD if wished (more inputs, trying to further dig if we can avoid unnecessary computations etc.). Thinking out loud my intuition is that: While I'd expect that the better XAD performs compared to native FD, I think the nearer JIT scalar will get to XAD-split (and at some point be slightly faster as we saw in the XAD repo's results). Cheers, Daniel |
|
Thanks @da-roth, that's all noted. We see that you're still working on this PR, let us know when you'd like us to review or comment. Thanks. |
|
Hi @auto-differentiation-dev , yeah, I was doing some tests and investigations. I added a second case that uses 90 inputs (adding some credit data) - so we have a 45 and 90 input now. The output is a bit messy, I had to put it into a table manually, but I'll wait till your feedback before finalizing the combined report. We have this now: I think that looks quite nice now, having in mind that the QL code is really not that nice here. Any thoughts or wishes? Cheers, Daniel |
|
Thanks Daniel, this looks very solid. Really appreciate the high-quality work here; we know this kind of effort is time-consuming, and you've moved through it impressively fast. To summarise the key result: a plain valuation run with a single Monte Carlo path and no sensitivities takes ~206.37 ms in this setup. Enabling XAD (turbo-boosted with Forge) and running 10k Monte Carlo paths with 90 sensitivities only roughly doubles the runtime. That's a strong, concrete demonstration of the value of AAD via operator overloading combined with Record-Once / Replay-Many. We feel this would be even more compelling when framed in a risk context (e.g. ORE). Pending your go-ahead, we'll give it a final look and merge. |
|
Hi @auto-differentiation-dev , thank you! Did some output cleaning yesterday evening and the combined report now shows this: Agree, I think there a couple of nice insights stating the general value and also showing the directions for improvement:
For some reason the Win action was running out of memory, didn't really notice that yesterday. I limited the compilation to 2 jobs on Windows since MSVC seems to consume more memory. I'm done from my side, feel free to have your final look while I'll keep an eye on the Win job. Cheers, Daniel |
auto-differentiation-dev
left a comment
There was a problem hiding this comment.
Hi @da-roth,
The PR is approved as such, but the Windows build is taking unbearably long with strong inlining. Let's try the suggested changes to disable strong inlining and see what we get.
5645132
into
auto-differentiation:main




This PR integrates the Forge JIT backend for XAD, adding optional native code generation support. Forge is an optional dependency - everything builds and runs without it.
Changes
Build options added:
Files added:
Files modified:
Benchmarks
The benchmark workflow (ql-benchmarks.yaml) runs swaption pricing benchmarks comparing FD, XAD tape, JIT scalar, and JIT-AVX methods on Linux and Windows.
Also included some initial work towards #33 - the workflow has type overhead jobs that compare double vs xad::AReal pricing performance (no derivatives) on the same hardware, providing a baseline for measuring XAD type overhead.
Example benchmark run (Linux) Link