cranelift: Don't recompute allocatable register set #8457

jameysharp · 2024-04-24T06:30:22Z

The implementation of PRegSet::from(self.vcode.machine_env()) does a surprising amount of work: it lazily initializes a OnceLock in the backend, then loops over six vectors of registers and adds them all to the PRegSet.

We could likely implement that better, but in the meantime at least we can avoid repeating that work for every single machine instruction. The PRegSet itself only takes a few words to store so it's cheap to just keep it around.

I discovered this because when the call to self.vcode.machine_env() is in the middle of the loop, that prevents holding a mutable borrow on parts of self.vcode, which I want to be able to do in another PR.

The implementation of `PRegSet::from(self.vcode.machine_env())` does a surprising amount of work: it lazily initializes a `OnceLock` in the backend, then loops over six vectors of registers and adds them all to the `PRegSet`. We could likely implement that better, but in the meantime at least we can avoid repeating that work for every single machine instruction. The `PRegSet` itself only takes a few words to store so it's cheap to just keep it around. I discovered this because when the call to `self.vcode.machine_env()` is in the middle of the loop, that prevents holding a mutable borrow on parts of `self.vcode`, which I want to be able to do in another PR.

cfallin · 2024-04-24T15:00:38Z

This one really surprised me so I went looking at history to figure out if we ever measured it: it looks like first appeared in #6957; previously the PRegSet was computed once when the lowering backend was created; and we didn't benchmark that. The PR did note caching the MachineEnv but we missed caching the derived information.

Out of curiosity have you seen a perf improvement with this PR? (We should do it even if not, but it seems to me it should improve compile time)

cfallin

also I'll go ahead and r+ since I dug into this (sorry to preempt reviewers!).

jameysharp · 2024-04-26T03:56:07Z

According to hyperfine, this commit is 1.02 ± 0.01 times faster when running wasmtime compile -C cache=n ../sightglass/benchmarks/spidermonkey/benchmark.wasm. That's actually a bigger effect than I expected since I thought operand collection is a pretty small part of the overall time.

jameysharp requested a review from a team as a code owner April 24, 2024 06:30

jameysharp requested review from elliottt and removed request for a team April 24, 2024 06:30

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Apr 24, 2024

cfallin approved these changes Apr 24, 2024

View reviewed changes

jameysharp added this pull request to the merge queue Apr 26, 2024

Merged via the queue into bytecodealliance:main with commit 4822904 Apr 26, 2024

jameysharp deleted the cache-allocatable branch April 26, 2024 04:15

jameysharp mentioned this pull request May 14, 2024

PregSet machine env bytecodealliance/regalloc2#177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cranelift: Don't recompute allocatable register set #8457

cranelift: Don't recompute allocatable register set #8457

Uh oh!

jameysharp commented Apr 24, 2024

Uh oh!

cfallin commented Apr 24, 2024

Uh oh!

cfallin left a comment

Uh oh!

jameysharp commented Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cranelift: Don't recompute allocatable register set #8457

cranelift: Don't recompute allocatable register set #8457

Uh oh!

Conversation

jameysharp commented Apr 24, 2024

Uh oh!

cfallin commented Apr 24, 2024

Uh oh!

cfallin left a comment

Choose a reason for hiding this comment

Uh oh!

jameysharp commented Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants