Cranelift, Part 2: Compiler Efficiency, CFGs, and a Branch Peephole Optimizer
This post is the second in a three-part series about Cranelift. In the first post, I described the context around Cranelift and our project to replace its backend code-generation infrastructure, and detailed the instruction-selection problem and how we solve it. The remaining two posts will be deep-dives into some interesting engineering problems.
In this post, I want to dive into the compiler performance aspect of our work more deeply. (In the next post we'll explore correctness.) There are many interesting aspects of compilation speed I could talk about, but one particularly difficult problem is the handling of control flow: how do we translate structured control flow at the Wasm level into control-flow graphs at the IR level, and finally to branches in a linear stream of instructions at the machine-code level?
Doing this translation efficiently requires careful attention to the overall pass structure, with the largest wins coming when one can completely eliminate a category of work. We'll see this in how we combine several passes in a traditional lowering design (critical-edge splitting, block ordering, redundant-block elimination, branch relaxation, branch target resolution) into inline transforms that happen during other passes (lowering of the CLIF, or Cranelift IR, into machine-specific IR; and later, binary emission).
This post basically describes the
MachBuffer
,
a "smart machine-code buffer" that knows about branches and edits them
on-the-fly as we emit them, and the
BlockLoweringOrder
,
which allows us to lower code in final basic-block order, with split
critical edges inserted implicitly, by traversing a never-materialized
implicit graph. The work was done mostly in Cranelift PR
#1718, which
resulted in a ~10% compile-time improvement and a ~25%
compile+run-time improvement on a CPU-intensive benchmark (bz2
).
Control-Flow Graphs
Before we discuss any of that, we need to review control-flow graphs (CFGs)! The CFG is a fundamental data structure used in almost all modern compilers. In brief, it represents how execution (i.e., program control) may flow through instructions, using graph nodes to represent linear sequences of instructions and graph edges to represent all possible control-flow transfers at branch instructions.
At the end of the instruction selection process, which we learned about in the previous post, we have a function body lowered into VCode that consists of basic blocks. A basic block is a contiguous sequence of instructions that has no outbound branches except at the end, and has no inbound branches except at the beginning. In other words, it is "straight-line" code: execution always starts at the top and proceeds to the end. An example control-flow graph (CFG) consisting of four basic blocks is shown below:
Control-flow graphs are excellent data structures for compilers to
use. By making the flow of execution explicit as graph edges, rather
than reasoning about instructions in order in memory as the processor
sees them, many analyses can be performed more easily. For example,
dataflow analysis
problems can be solved easily because the CFG makes traversal of
possible control-flow transfers easy. Graph-based representations of
the program also allow easier moving and insertion of code: it is
less error-prone to manipulate an explicit graph than to reason about
implicit control-flow (e.g. fallthrough from a not-taken conditional
branch). Finally, the graph representation factors out the question of
block ordering, which can be important for performance; we can
address this problem separately by choosing how we serialize the graph
nodes (blocks). For these reasons, most compiler IRs, including
Cranelift's CLIF and VCode
, are CFG-based.
(Historical note: control-flow graphs were invented by the late Frances Allen, who largely established the algorithmic foundations that modern compilers use. Her paper A catalogue of optimizing transformations1 covers essentially all of the important optimizations used today and is well worth a read.)
CPUs and Branch Instructions
To represent a CFG's end-of-block branches at the instruction level,
we can use two-way branches: these are instructions that branch
either to one basic-block target if some condition is true, or another
if the condition is false. (Basic blocks can also end in simple
unconditional single-target branches.) We wrote such a branch as if r0, L1, L2
above; this means that the block L0
will be followed in
execution either by L1
or L2
, depending on the value in r0
.2
Branches with Fallthrough
However, CPUs rarely have such two-way branch instructions. Instead, conditional control-flow in common ISAs is almost always provided with a conditional branch with fallthrough. This is an instruction that, if some condition is true, branches to another location; otherwise, does nothing, and allows execution to continue sequentially. This is a better fit for a hardware implementation for a number of reasons: it's easier to encode one target than two (the destination of the jump might be quite far away for some branches, and instructions have limited bits available), and it's usually the case that the compiler can place one of the successor blocks immediately afterward anyway.
Now, this isn't much of a problem if we just want a working compiler; instead of a two-way branch
if r0, L1, L2
We can write a sequence of branches
br_if r0, L1
goto L2
where br_if
branches to L1
or falls through to the unconditional
goto
. But this is not so efficient in many cases. Consider what
would happen if we laid out basic blocks in the order L0
, L2
,
L1
, L3
:
L0:
...
br_if r0, L1
goto L2
L2:
...
goto L3
L1:
...
goto L3
L3:
...
return
There are two redundant unconditional branches (goto
instructions),
each of which uselessly branches to the following instruction. We can
remove both of them with no ill effects, taking advantage instead of
fallthrough, or allowing execution to proceed directly from the end
of one block to the start of the next one:
L0:
...
br_if r0, L1
// ** Otherwise, fall through to L2 **
L2:
...
goto L3
L1:
...
// ** Always fall through to L3 **
L3:
...
return
This seems like an easy enough problem to solve: we just need to recognize when a branch is redundant and remove it, right? Well, yes, but we can do much better than that in some cases; we'll dig into this problem in significantly more depth below!
Machine-code Encoding: Branch Offsets
So far, we've written our machine instructions in a way that humans
can read, using labels to refer to locations in the instruction
stream. At the hardware level, however, these labels do not exist;
instead, the machine code branches contain target addresses (usually
encoded as relative offsets from the branch instruction). In other
words, we do not see goto L3
, but rather goto +32
.
This gives rise to several complications when emitting machine code
from a list of instruction struct
s. At the most basic level, we
have to resolve labels to offsets and then patch the branches
appropriately. This is analogous to (but at a lower level than) the
job of a linker:
we resolve symbols to concrete values after deciding placement, and
then edit the code according to relocations to refer to those
symbols. In other words, whenever we emit a branch, we make a note (a
relocation, or "label use" in our MachBackend
) to go back later and
patch it with the resolved label offset.
The second, and more interesting, problem arises because not all branch instructions can necessarily refer to all possible labels! As a concrete example, on AArch64, conditional branches have a ±1 MB range, and unconditional branches have a ±128 MB range. This arises out of instruction-encoding considerations: particularly in fixed-instruction-size ISAs (such as ARM, MIPS, and RISC V), less than a full machine word of bits are available for the immediate jump offset that is embedded in the instruction word. (The instruction itself is always a machine-word wide, and we need some bits for the opcode and condition code too!) On x86, we have limits for a different reason: the variable-width encoding allows either a one-byte offset (allowing a ±128 byte range) or four-byte offset (allowing a ±2 GB range).
To make a branch to a far-off label, then, on some machines we need to either use a different sort of branch than the default choice for the instruction selector, or we need to use a form of indirection, by targetting the original branch to another branch, the latter in a special form. The former is tricky because we do not know whether a target will be in-range until all code is lowered and placement is computed; so we need to either optimistically or pessimistically lower branches to the shortest or longest form (respectively) and possibly switch later. To make matters worse, as we edit branches to use a shorter or longer form, their length may change, moving other targets into or out of range; in the most general solution, this is a "fixpoint problem", where we iterate until no more changes occur.
Challenges in Lowering CFGs to Machine Code
So far, we have a way to produce correct machine code. To emit the final code for a two-target branch, we can emit a conditional- followed by unconditional-branch machine instruction. To resolve branch targets correctly, we can assume that any target could be anywhere in memory, and always use the long form of a branch; then we just need to come back in one final pass and fill in the offsets when we know them.
We can do much better than this, though! Below I'll describe four problems and the ways that they are traditionally solved.
Problem 1: Efficient use of Fallthroughs
We described above how branch fallthroughs allow us to omit some
some unconditional branches once we know for sure the order that basic
blocks will appear in the final binary. In particular, the simple
lowering of a two-way branch if r0, label_if_true, label_if_false
to
two one-way branches
br_if r0, label_if_true
goto label_if_false
label_if_false:
...
label_if_true:
...
has a completely redundant and useless goto
! In general, if a branch
target is the very next instruction, we can delete that branch.
However, there are slightly more complex cases where we can also find some improvements. Consider the inverted version of the above:
br_if r0, label_if_false
goto label_if_true
label_if_false:
...
label_if_true:
...
No branch here branches to its fallthrough, so one might think that
both branches are necessary. But in practice, on most CPU
architectures, all conditional branches have inverted forms. For
example, the x86 instruction JE
(jump if equal) can be inverted to
JNE
(jump if not equal). If we are allowed to edit branch conditions
as well, then we can rewrite the above as:
br_if_not r0, label_if_true
label_if_false:
...
label_if_true:
...
This turns out to remove many additional branches in practice.
Problem 2: Empty Blocks
It is sometimes the case that after optimizations, a basic block is completely empty aside from a final unconditional branch. This can occur when all of the code in an if- or else-block is optimized away or moved elsewhere in the function body. It can also occur when a block was inserted to split a critical edge (see below).
Thus, a common optimization is jump threading: when one branch points directly to another, we can just edit the first branch to point to the final target. Generalized, we can "chase through" any number of branches to eliminate intermediate steps. For example:
...
goto L1
L1:
goto L2
L2:
goto L3
L3:
...
can become:
...
goto L3 // <--- edited branch
L1:
goto L2
L2:
goto L3
L3:
...
note that the intermediate branches were not removed: they may still be the targets of other branches. We skip over them when starting from the first branch. However, if we know some other way that these branches are unused, we can then delete them, reducing code size.
Problem 3: Branch Relaxation
As we noted above, the "branch relaxation" problem is that we must choose one of multiple forms for each branch instruction, each of which may have a different range (maximal distance from current program-counter location). This is complex because the needed range depends on the final locations of the branch and its target, which in turn depends on the size of instructions in the machine code; but some of those instructions are themselves branches. We thus have a circular dependency.
There will always be some way to branch to an arbitrary location in the processor's address space, so there is always the trivial but inefficient solution of using worst-case branch forms. However, we can usually do much better, because the majority of branches will be to relatively small offsets.
The usual approach to solving this problem involves a "fixpoint computation": an iterative loop that continues to make improvements until none are left. This is where the "relaxation" of branch relaxation comes from: we modify branch instructions to have more optimal forms as we discover that targets are within range; and as we do this, we recompute code offsets and see if this enables any other relaxations. As long as the relationship between branch range and branch instruction size is monotonic (smaller required range allows for shorter instruction), this will always converge to a unique fixpoint; but it is potentially expensive, and involves sticky data-structure design questions if we want the code editing and/or offset recomputation to be fast.
Problem 4: Critical Edges
For a number of reasons, we usually want to split critical edges in the control-flow graph. A critical edge is any control-flow transfer edge that comes from a block with multiple out-edges, and goes to a block with multiple in-edges. We sometimes need to insert some code to run whenever the program follows a critical edge: e.g., the register allocator may need to "fix up" the machine state, moving values around in registers as expected by the target block. Consider where we might insert such code: we can't insert it prior to the jump, because this would execute no matter what out-edge is taken. Similarly, we can't insert it at the target of the jump, because this would execute for any entry into the target block, not just transfers over the particular edge.
The solution is to "split" the critical edge: that is, create a new basic block, edit the branch to point to this block, and then create an unconditional branch in the block to the original target. This basic block is a place where we can insert whatever fixup code we need, and it will execute only when control flow transfers from the one specific block to the other. A critical-edge split is illustrated in the following figure:
There are multiple ways in which we could handle this problem: we could preemptively split every critical edge; or we could split them on demand, only when we need to insert code. The latter would require editing the CFG in place, and for various reasons, we would prefer to avoid doing this: it invalidates many analysis results, and complicates data structures. It is also much simpler to reason about many algorithms if we can assume that edges are already split. However, splitting every edge will leave many empty blocks, because we usually do not need to insert any fixup code on an edge. In addition, splitting an edge raises the question of where to insert the split-block. If we take the simplest approach and append it to the end of the function, we probably significantly reduce the number of branch-fallthrough simplifications we can make; a smarter heuristic that placed the block near its predecessor or successor would be better.
Traditional Approach: In-Place Edits
The traditional approach to all of these problems is to decompose the
task into a number of passes and perform in-place edits with those
passes. For example, in LLVM, IR is lowered into a machine-specific
form (MachineFunction
of MachineBasicBlock
s) with an explicit
notion of layout and with machine-level branch instructions; then
edits are made, taking care to update branches when the layout
changes.
For example, the following sequence of passes should handle most of the above issues:
-
Split all critical edges, placing the split-block after the predecessor. (In LLVM, the
SplitCriticalEdges
pass.) -
Perform other optimizations, and register allocation; these may use the split-blocks.
-
Perform jump-threading transform; this will remove control-flow transfers through empty blocks. (In LLVM, the
BranchFolding
pass.) -
Compute reachability, and delete "dead blocks" (blocks that are no longer reachable). (Also done by
BranchFolding
in LLVM.) -
Compute a block order that tries to minimize jump distances and places at least one successor directly after every block when possible. (In LLVM, the
MachineBlockPlacement
pass.) -
Linearize the code from the CFG nodes into a single stream of machine instructions using this block order. (In LLVM, blocks are initially lowered into the
MachineFunction
and then reordered byMachineBlockPlacement
.) -
Remove branches to fallthrough blocks, and invert conditionals that create additional fallthroughs.
-
Compute block offsets based on machine-code size of current instruction sequence, assuming worst-case size for every branch.
-
Scan over branches, checking whether block locations allow for shorter forms due to nearer targets. Update branches and recompute block offsets if so. Continue until fixpoint. (In LLVM, the
BranchRelaxation
does this.) -
Fill in branch targets using final offsets. Branches are now in a form ready for machine-code emission.
Clearly this will work, and with some care (especially in the block-placement heuristics), it will produce very good code. But the above steps require many in-place edits. This is both slow (we are re-doing some work every time we edit the code) and forces us to use data structures that allow for such edits (e.g., linked list), which imposes a tax on every other operation on the IR. Is there a better way?
Cranelift's New Approach: Streaming Edits
It would be ideal if we could avoid some of the code-transform passes described above; can we? It turns out that one can actually do the functional equivalent of all of the above as part of other, pre-existing work:
-
We can decide the final block order ahead of time, and do our CLIF-to-VCode lowering in this order, so VCode never needs to be reordered; it is already linearized. We can also insert critical-edge splits as part of this lowering.
-
We can do all of the other work -- inverting conditionals, threading jumps, removing dead blocks, and handling various branch sizes -- in a streaming approach during machine-code emission! The key insight is that we can do a sort of "peephole optimization": we can immediately delete and re-emit branches at the tail of the emission buffer. By tracking some auxiliary state during the single emission scan, such as reachability, labels at current emission point, a list of unresolved label-refs earlier in code, and a "deadline" for short-range branches, we can do everything we need to do without ever backing up more than a few contiguous branches at the end of the buffer.
Let's go into each of those in more detail!
Step 1: Decide Final order and Split Edges while Lowering
As part of the instruction-selection pipeline described in the previous post), we need to iterate over the basic blocks in the CLIF and, for each block, lower its code to VCode instructions. We would like to do this iteration in the same order as our final machine code layout so that we don't need to reorder the VCode later.
The only constraint that the lowering algorithm imposes is that we examine value uses before value defs, which we can ensure by visiting a block before any of its dominators. That leaves a lot of freedom in how we do the lowering.
If that were the whole problem, we could just do a postorder traversal and be done with it. In fact, the problem is complicated by one other factor: critical-edge splitting!
Recall that we described above that we must either preemptively split all critical edges or else find a way to edit-in-place later. To avoid the complexities of edit-in-place, we choose to split them all. Note that this is cheap as far as our CFG lowering is concerned, because our later branch optimizations will remove empty blocks almost for free. (The register allocator's analyses may become more expensive with a higher block count, but in practice we have not found this to be much of a problem.)
The challenge is in generating these blocks in the correct place on the fly. To generate the lowering order, we define a virtual graph that is never actually materialized, whose nodes are implied by the CLIF blocks and edges (every CLIF edge becomes a split-edge block) and whose edges are defined only by a successor function. To generate the lowering order, we perform a depth-first search over the virtual graph, recording the postorder. This postorder is guaranteed to see uses before defs, as required. The DFS itself is a pretty good heuristic for block placement: it will tend to group structured-control-flow code together into its hierarchical units.
There are additional details in the implementation that ensure we split only critical edges rather than all edges, that record block-successor information directly as we produce lowered blocks so that the subsequent backend stages do not need to recompute it, and some other small optimizations.
This is illustrated in the following figure, showing a CLIF-level CFG transformed with split edges and merged edge-blocks then linearized at a conceptual level; and the successor function actually defined to drive the DFS. Note that the naïve lowering of the split-edge CFG would result in 14 branches (due to 14 CFG edges); the final lowered machine code contains only 4 branches, while providing a slot for any needed fixup instructions on any CFG edge.
Step 2: Edit Branches while Emitting
Once we have lowered VCode
, we need to emit machine code! In a
conventional design, this would require linearization, a bunch of
branch optimizations, and branch-target resolution before we ever
produced a byte of machine-code. But we can do much better.
In Cranelift's design, a machine backend just emits every conditional
branch naïvely as a two-way combination into a machine-code buffer we
call the MachBuffer
. Critically, however, this MachBuffer
is not
merely a Vec<u8>
: it knows (many) things about its content,
including where its branches are, what the branches' targets are, and
how to invert the branches if necessary.
The MachBuffer
will perform streaming edits on the code as it is
emitted, editing only a suffix of the buffer (contiguous bytes up to
the end, or current emission point), in order to convert two-way
branch combos when possible into simpler forms.
The abstraction that the machine backend sees is:
-
The
MachBuffer
allows us to emit machine-code bytes. -
We can tell the
MachBuffer
that a certain range of machine-code bytes that we just emitted are a branch, either conditional or unconditional, how to invert it if conditional, and a label as the branch target. -
We can bind a label to the current emission point.
-
We parameterize the
MachBuffer
on aLabelUse
trait implementation which defines all the different kinds of branch-target references, how to patch the machine code with a resolved offset, and how to emit a veneer, i.e., a longer-form branch that the original branch can indirect through in order to reach further.
And that's it! The MachBuffer
does all of the work behind the
scenes: when we emit a branch, it sometimes chomps some bytes; and
when we define a label, it sometimes scans through a list of deferred
fixups to patch earlier machine code.
A (simplified) illustrated example is shown below. The machine backend
emits two-way branches naïvely by always emitting a conditional and
unconditional branch (e.g. at the end of basic block L0
). It also
provides metadata to the MachBuffer
that describes where the labels
are, where the branches are, where the branches are targetted (as
labels), and how to invert conditional branches. The MachBuffer
is
able to perform the listed streaming edits as code is emitted,
producing the final machine code at the right with no intermediate
buffering or additional passes. We'll describe how this editing occurs
in more detail below.
Branch Peephole Optimizations
The key insight of the MachBuffer
design is that we can edit
branches at the tail of the buffer as code is emitted by tracking
the "most recent" branches: specifically, the branches that are
contiguous to the tail of the buffer.
The first optimization that we do is branch inversion, which can
sometimes eliminate unconditional branches. In the example above, when
the backend has emitted the machine-code bytes for all of the L0
basic block, the MachBuffer
will know that the last two branches,
contiguous to the tail, are a conditional branch to L1
and an
unconditional branch to L2
. When we then see the label L1
(which
is the first branch's target) bound to this offset, we can apply a
simple rule: a conditional that jumps over an immediately following
unconditional can be inverted, and the unconditional branch
removed. Note that, critically, because these branches are contiguous
to the tail, the edit will not affect any offsets that have already
been resolved; we are free to contract the code-size here, and offsets
of subsequently-emitted code will be correct without further fixups.
Said another way, our approach never moves code. Rather, it only sometimes chomps or adjusts a just-emitted branch, right away, before code-emission carries on past the branch.
The next optimizations we do are jump threading and dead-block
removal. Recall from above that jump threading means that
intermediate steps in a chain of jumps can be removed: a jump to a
jump to X becomes just a jump to X. We resolve this by keeping an
up-to-date alias table that tracks label-to-label aliases. The table
is updated whenever the MachBuffer
is informed that an unconditional
jump was emitted and a label was bound to its address. We then
indirect through the alias table when resolving labels to final
offsets. The second task, dead-block removal, occurs as a side-effect
of tracking reachability of the current buffer tail. Any offset that
(i) immediately follows an unconditional jump, and (ii) has no
labels bound to it, is unreachable; an unconditional jump at an
unreachable offset can be elided. (Actually, any code at an
unreachable offset can be removed, but for simplicity and to make it
easier to reason about correctness, we restrict the MachBuffer
's
edits to code explicitly marked as branch instructions only.)
In order for this to work correctly, we need to track all labels that have been bound to the current buffer tail and adjust them if we chomp (truncate) the buffer or redirect a label. For this reason, the label-binding, label-use resolution, and branch-chomping are all tightly integrated into a set of interacting data structures:
To summarize, we track:
- Emitted bytes;
- All labels bound to the current offset;
- A table of all labels to a bound offset or "unbound";
- A table of all labels to another label as an alias or "unaliased";
- A list of the last contiguous branches, each of which is conditional or not, with inverted form if so, and label-use, and labels that are bound to this branch instruction;
- A list of other label-uses for fixup.
As we emit code, we append to the emitted-bytes buffer. We lazily invalidate the "labels at current offset" set by tracking the offset for which that set is valid; appending new code implicitly clears it.
As the machine backend tells the MachBuffer
about branches, we
append to the list of the last contiguous branches. This, too, is
invalidated when code is emitted that is not a branch.
When a label-use is noted and the label is already resolved, we fix up the buffer right away. Note that once a label resolves to an offset, that offset cannot change; so this fixup can be done once and the metadata discarded.
All branch simplification happens when a label is bound: this is
when new actions become possible. We perform the following algorithm
(see
MachBuffer::optimize_branches()
for details):
- Loop as long as there are some branches in the latest-branches list:
- If the current buffer tail is beyond the end of the latest branch, done (and clear list).
- If the latest branch (which ends at current tail) has a target that resolves to current tail, chomp it and restart loop.
- If the latest branch is unconditional and does not branch to
itself:
- Update any labels pointing at this branch to point at its target instead.
- Restart loop if any labels were moved.
- If latest branch is unconditional, follows another unconditional branch, and no labels are bound at this branch, then chomp it (unreachable) and restart loop.
- If latest branch is unconditional, follows a conditional branch, and conditional branch target is current tail, then invert conditional and chomp the unconditional, and restart loop.
- When loop is done, clear latest-branches list; no more can be simplified.
This may look to have undesirable algorithmic complexity, but in fact it is tightly bounded: we make a forward-progress argument as labels only move down alias chains and fixed work is done per branch (each is only examined once and acted upon or purged). Overall, the algorithm runs in linear time.
This linear-time algorithm that edits locally, avoids any code-movement, and streams into a buffer in final form, is far better than the multi-pass edit-in-place design of a traditional backend, both asymptotically and in practice (CPUs love streaming algorithms and minimized data movement). It seems to produce code nearly as good as a much more complex branch simplifier at a much lower cost.
Correctness
The algorithm for simplifying branches is one of the most critical to correctness in the (post-optimizer) compiler backend; probably only second to the register allocator. It is very subtle, and bugs can be disastrous: incorrect control flow could cause anything to happen, from impossible-to-debug incorrect program results (ask me how I know! And here too!) to serious security vulnerabilities.
Because of this, we have taken extensive care to ensure
correctness. In fact, more than a third of the lines in the
MachBuffer
implementation
are a proof of correctness, based on several core invariants
described
here. At
each data-structure mutation, we show that (i) the invariants still
hold, and (ii) the code mutation did not alter execution semantics.
Because there is significant wisdom in the Knuth quote "I have only proved it correct, not tried it" (there are always gaps between specification and reality, and unless one generates an implementation from a machine-checked proof, then the English prose or its translation to code may have bugs too3), all invariants are also fully checked on each label-bind event in debug builds of Cranelift. The various fuzzing harnesses that hammer on the new backend will thus be driving these checks continuously.
Other Concerns
There are many subtleties to the branch-simplification and code layout
problems that were not discussed here! Most prominently, we have not
covered branch veneers or the topic of branch ranges at all, though
we saw the problem of "branch relaxation" above. The MachBuffer
handles out-of-range branches by tracking a "deadline" (the last point
at which any currently outstanding label may be bound without causing
a branch to go out of range); if we hit the deadline, we emit an
island of branch veneers, which are commonly just long-range
unconditional branches, for each unresolved label and resolve the
labels to those branches. This extends the deadlines. In practice
island emission will almost never occur, so it is acceptable to
pessimize this case (add an extra indirection) to avoid the need to go
back and edit the original branch into a longer-range form.
We also haven't covered constant pools; these are handled with the same "island" mechanism, allowing emitted machine code to refer to nearby constant data.
Conclusion, and Next Time
This has been a deep dive into the world of branch simplification, with an emphasis on how we engineered Cranelift's new backend to provide very good compilation speed taking control-flow handling and branch lowering/simplification as an example. We believe that there may be other significant opportunities to rethink, and carefully engineer, core algorithms in the compiler backend with specific attention to maximizing streaming behavior, minimizing indirection, and minimizing passes over data. This is an interesting and exciting engineering pursuit largely because it goes beyond the world of "theoretical standard compiler-book algorithms" and calls on problem solving to find clever new design tricks.
As we described near the end of this post, correctness is also an important focus -- perhaps the most important focus -- of any compiler engineering effort. Given that, I plan to write the next (and final) post in this series about how we engineered for correctness by taking a deep-dive into the register allocator checker, which is a novel symbolic checker (which can be seen as an application of abstract interpretation) that allows us to prove that any particular register-allocator execution gave a correct allocation result. I'll talk about how this checker, driven by a fuzzing frontend, found some really subtle and interesting bugs that we likely never would have found in production otherwise. With that, until next time, happy compiling!
For discussions about this post, please feel free to join us on our Zulip instance in this thread.
Thanks to Benjamin Bouvier for reviewing this post and providing very helpful feedback! Thanks also to bjorn3 for correcting a typo in a figure.
Frances E. Allen and John Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers (Prentice-Hall, 1972), pp. 1--30.
This is a bit of a simplification of branches in the IR: in CLIF (and in most other CFG-based compiler IRs), there are several branch types. Another is the "switch" or "branch table" branch that chooses between N possible targets with an integer index. There are also simple single-target unconditional branches; and a return instruction is also a "branch" of sorts in that it ends a basic block, though it has no successors. The important takeaway is that IR-level branches are an abstraction level above machine-code control flow, allowing for a direct choice between several or many targets as one operation.
See PR #2083 above, which is a bug that arose after I wrote the correctness proof, because the proof assumed target-aliasing supported arbitrarily-long branch chains but it actually followed only one level. This was a deliberate earlier implementation choice to avoid infinite loops on branch cycles. It turns out that it's possible to just avoid cycles in the alias table by construction; we carefully prove that this is so and then allow redirect-chasing through chains of branches. For extra paranoia, because a non-terminating compiler is bad and we are merely human, we still limit redirect-chasing to 1 million branches and panic beyond that (because one can never be too careful); this is a limit that will never be hit when using the Wasm frontend (due to limits on function size) and is extremely unlikely to be hit elsewhere.