Compilers and compiler infrastructure
-
(Day job) weval WebAssembly partial evaluator and a new execution tier in the SpiderMonkey JS engine, Dec 2022 -- present.
Developed ahead-of-time compilation of JavaScript in a two-step process: first, adding a new interpreter tier to SpiderMonkey, Portable Baseline, that supports execution of inline caches using an interpreter over the IC-specific IR (CacheIR), allowing for no-codegen execution of JS engine fastpaths; then, developing weval, a tool that uses the first Futamura projection, a kind of partial evaluation, to turn this interpreter into a compiler automatically. Produces native Wasm bytecode from JavaScript source, including generated fastpaths for type-specific cases, without requiring any profiling/warmup, type feedback or type annotations (using indirect call-based stub dispatch for runtime binding of fastpaths instead). Allows for faster (3-5x) JS execution on Wasm-first platforms that do not allow JIT or have a native JS engine otherwise. (See my talk for more details.)
-
(Day job) Cranelift compiler framework, Dec 2019 -- present.
Developed a new framework for compiler backends, and the initial AArch64 (ARM64) backend using this framework. Developed a new register allocator (regalloc2), including a novel symbolic checker using abstract interpretation for highly-effective fuzzing. Designed and implemented a new pattern-matching DSL (domain-specific language), ISLE, which we now use to write all of our backend lowering rules. Implemented an e-graph (equivalence graph)-based mid-end optimization framework, putting all important rewrites into one fixpoint loop, and built infrastructure to use ISLE to express rewrites here as well. (To our knowledge this is the first production use of e-graphs in a general-purpose compiler.) Published some formal verification work using an extension of the ISLE DSL to convert instruction-selection rules to SMT and verify them. The compiler has gotten much faster and generates much better code in the past few years and has grown to have four backends (x86-64, aarch64, riscv64, s390x). From 2020 until late 2022 I acted as tech lead; I am spending more time on other projects now but still remain actively involved.
All of this work done with excellent and brilliant collaborators: Julian Seward and Benjamin Bouvier, Nick Fitzgerald, Jamey Sharp, Trevor Elliott, Andrew Brown, and many others!
-
(Day job) waffle WebAssembly SSA compiler framework, Nov 2021 -- present.
For use in Wasm-to-Wasm transform tools, I built an SSA compiler framework with a Wasm frontend (constructing IR from Wasm) and a Wasm backend (lowering to Wasm from IR). The backend in particular involves nontrivial algorithmic work to reconstruct structured control flow and do "register allocation" over Wasm locals from SSA values. Ongoing work on ergonomics and performance but correctness is the first priority: the project includes rigorous differential fuzzing to ensure that a roundtrip through IR and optimizations is semantics-preserving.
-
(Day job) Port of SpiderMonkey to Wasm/WASI, early 2021.
In order to execute JavaScript applications on a Wasm-centric platform, we needed to embed a JS interpreter inside a Wasm module. I ported SpiderMonkey to the
wasm32-wasi
platform, which mainly required stubbing out the right platform-specific parts. Subsequent work investigated techniques to build IC (inline cache) chains, which ordinarily are not used in SpiderMonkey's portable interpreter; someday I hope to get back to this! -
Hardware design language with novel semantics for pipeline idioms (see Autopiper under Computer Architecture below).
Program analysis, programming language design, and memory management/ownership
-
(Academic research) PhD dissertation contributing two main ideas: (i) an alias analysis that works on loops, determining when pointers refer to unique objects in each iteration of a given loop ("distinctness"), in particular reasoning about common data structures such as lists, key-value maps, and sets; and (ii) a way to combine static analysis and dynamic observation in a principled way, essentially automatically deriving a hybrid static-dynamic system from the static analysis' inference rules in a regular way.
This research culminated in some results showing that "irregular" programs, i.e. those that are not just loop nests on arrays but are built around other interesting data structures, often have more parallelism available than a conventional autoparallelizing compiler would reveal.
-
(Side-project with academic publication) Deferred borrows: a proposal to extend Rust's ownership and borrowing system to allow for logically memory-safe references to items in collections (vectors, hashmaps and the like) without the restrictions that the borrow-system would impose. Logical memory safety means that a reference will always refer to the same item: in particular, it is tied (at the type-system level) to a particular collection, and the collection is constrained in a way that prevents the item from being removed.
These "deferred borrows" build on path-dependent types and provide most of the benefits of true borrows, i.e., non-dangling references to objects in a way that cannot be broken, without imposing onerous limits on which borrows can be simultaneously active (e.g. no more than one mutable borrow at a time).
Computer architecture
-
(Side research project) Autopiper, a high-level synthesis (HLS) tool that compiles a sequential, imperative programming language to a digital logic hardware description in Verilog. This project was an experiment in bridging sequential stateful computation semantics and pipelined hardware with a notion of "transactional state": the compiler reasons about state across time and pipeline stages, and can generate bypass logic, speculation (and recovery from misspeculation), and maybe eventually automatically perform other optimizations. This was an attempt to formalize a lot of the design patterns that a computer architect working on a high-performance out-of-order microarchitecture would use, effectively "deriving from first principles" a CPU core from ISA semantics. Lofty goal and somewhat unfinished but all the primitives are there!
-
(Academic research) Heterogeneous Block Architecture, a proposal for a high-performance, energy-efficient CPU microarchitecture design that incorporates a conventional out-of-order superscalar core and a VLIW (very-long instruction word) in-order core that runs prescheduled code based on dynamic recordings of the superscalar core's activity. The work proposes both a framework to combine multiple execution substrates (the two types of core) into one hybrid core, running a single thread of the user's program in a transparent way, by the use of dynamically-formed atomic superblocks and hardware to communicate control-flow and livein/liveout values between blocks.
To evaluate this work, I built a very detailed microarchitectural simulator that runs unmodified x86-64 programs. The simulator is an execute-at-execute design, meaning that it splits instructions into uops and sends these through a simulated pipeline with real program values. It performs speculation, computes values down a speculative program path, and recovers from misspeculation as the hardware does. (A simpler approach often used instead is to execute the program normally, record a trace of its instructions, and then feed these into a timing model that only has to be mostly correct and need not handle speculation. This simpler approach is faster but is not quite accurate enough for detailed microarchitectural work.) Unfortunately the full simulator is not publicly available, but I have released x86instlib, the x86 model underlying the decode and execute stages (in turn based on parts of PTLsim).
-
Dynamic translation-based stack machine: a pipelined CPU in Verilog with special hardware register stack handling to allow execution of a virtual stack machine, together with dynamic translation firmware that JITs the bytecode to the native core. This was an undergrad computer architecture final project.
Systems software
-
(Day job) contributions to Wasmtime, a WebAssembly virtual machine / runtime that uses Cranelift as its compiler backend. In addition to my work on Cranelift, I've done work on Wasmtime's memory management and instantiation performance: using copy-on-write mmaps and making as much data structure initialization lazy as possible, in order to get instantiation down into the low-microsecond range. I also built the "epoch interruption" mechanism, a much lower-cost alternative to "fuel" that allows for bounded runtime with periodic yields/interrupts in order to work well with high-concurrency event-loop architectures.
-
(Day job) contributions to protobuf (protocol buffers, a serialization format often used as an in-memory data structure as well) as a member of the team at Google. I co-developed arena allocator support and led its launch internally, which involved modifying the (often very hot) generated code while avoiding any performance or heap-space regressions in production workloads; making use of the more efficient allocation infrastructure sometimes led to significant performance wins. The discussions surrounding API semantics and compatibility were especially interesting.
I also developed several language bindings for the protobuf library, including a C extension for Ruby and a prototype C++ extension to Node.js, and adapted an internal pure-JavaScript protobuf runtime and code generator for external release.
-
(Side project) speck, the small, portable, efficiently-coded kernel: a hobby operating system developed in 2001—2002, when I was much younger. It runs on bare x86 and implements preemptive multitasking, message passing, and dynamic kernel-module loading.
Open-source software and libraries
- (Side project) boolean_expression (docs), a library for building and evaluating Boolean expressions, BDDs) (binary decision diagrams), and performing expression simplification via cubelists.