Performance Internals
Sema's default execution path is a bytecode VM (see Bytecode VM). A tree-walking interpreter is still available via --tw for debugging and comparison. Early optimizations brought the 1 Billion Row Challenge benchmark from ~25s to ~9.6s on 10M rows using a "mini-eval" — a minimal evaluator inlined in the stdlib that bypassed the full trampoline. The mini-eval was later removed for architectural reasons (semantic drift from the real evaluator, and blocking the path to a bytecode VM). Fast-path optimizations in the real evaluator partially recovered tree-walker performance, bringing it to ~2,700ms on 1M rows (vs ~960ms with the mini-eval). The bytecode VM now achieves ~1,150ms on 1M rows and ~12,600ms on 10M rows, more than recovering the mini-eval's performance through compilation rather than inlining. This page documents each optimization, its history, and measured impact.
All benchmarks were run on Apple Silicon (M-series), processing the 1BRC dataset (semicolon-delimited weather station readings, one per line).
Benchmark Summary
| Stage | 1M rows | 10M rows | Technique | Status |
|---|---|---|---|---|
| Baseline | 2,501 ms | ~25,000 ms | Naive implementation | — |
| + COW assoc | 1,800 ms | ~18,000 ms | In-place map mutation | ✅ Active |
| + Env reuse | 1,626 ms | 16,059 ms | Lambda env recycling (mini-eval) | ❌ Removed |
| + Mini-eval | ~960 ms | ~9,600 ms | Inlined builtins, custom parser | ❌ Removed |
| + String interning | — | — | Spur-based dispatch | ✅ Active |
| + hashbrown | — | — | Amortized O(1) accumulator | ✅ Active |
| Post-removal | ~2,700 ms | ~29,700 ms | Callback architecture + fast paths | ✅ Tree-walker (--tw) |
| Bytecode VM | ~1,150 ms | ~12,600 ms | Bytecode VM (default) | ✅ Default backend |
Note: The mini-eval and its associated optimizations (env reuse, inlined builtins, custom number parser, SIMD split fast path) were removed to unblock the bytecode VM, which is now the default backend as of v1.14.0. The bytecode VM provides a ~2.4× speedup over the tree-walker (~1,150ms vs ~2,700ms on 1M rows), more than recovering the mini-eval's performance through compilation. The tree-walker remains available via
--tw; it usessema_core::call_callbackto route stdlib → real evaluator. Fast-path optimizations (self-evaluating short-circuit, inline NativeFn dispatch, thread-local EvalContext, deferred cloning) partially recovered tree-walker performance.
VM compute benchmarks (Feb 2026, post-stdlib intrinsics): TAK 1,248ms, upvalue-counter 450ms, deriv 887ms. The deriv benchmark — dominated by
car/cdr/cons/pair?— improved 22% from stdlib intrinsic opcodes. The 1BRC numbers above are I/O-bound and less affected by VM compute optimizations.
Per-Instruction Inline Cache (Mar 2026)
The VM's global variable lookup was originally served by a 256-slot direct-mapped cache. Each LoadGlobal/CallGlobal hashed the variable name to a slot, leading to collisions on hot paths where multiple globals mapped to the same slot.
The per-instruction inline cache assigns a dedicated cache slot to each LoadGlobal/CallGlobal instruction at compile time. Cache entries are (spur_bits, env_version, value) tuples — the spur_bits guard provides cross-VM closure safety, and the env version counter invalidates stale entries on any global mutation.
Impact (Apple Silicon, release build, hyperfine --warmup 2 --runs 5):
| Benchmark | Before (direct-mapped) | After (per-instruction) | Speedup |
|---|---|---|---|
| higher-order-fold | 6,116 ms | 2,617 ms | 2.34× |
| deriv | 2,356 ms | 1,449 ms | 1.63× |
| closure-storm | 1,302 ms | 1,145 ms | 1.14× |
| tak | 1,728 ms | 1,749 ms | ~1.0× |
| mandelbrot | 311 ms | 313 ms | ~1.0× |
| upvalue-counter | 574 ms | 575 ms | ~1.0× |
The biggest wins are on global-call-heavy workloads: higher-order-fold calls stdlib HOFs (map, filter, foldl) in a tight loop — each call requires a global lookup. deriv similarly uses many global functions for symbolic differentiation. Benchmarks dominated by local computation (tak, mandelbrot, upvalue-counter) show no change, as expected.
Micro-Benchmark Suite (Feb 2026)
All benchmarks run on Apple Silicon (M-series), 10 runs + 3 warmup, via scripts/bench.sh.
| Benchmark | Tree-walker | Bytecode VM | VM speedup |
|---|---|---|---|
| tak | 21,222 ms | 1,248 ms | 17.0× |
| nqueens | 20,735 ms | 2,028 ms ¹ | 10.2× |
| deriv | 3,473 ms | 887 ms | 3.9× |
| upvalue-counter | 5,762 ms | 450 ms | 12.8× |
| closure-storm | 2,373 ms | 1,041 ms | 2.3× |
| higher-order-fold | 2,292 ms | 1,081 ms | 2.1× |
| hashmap-bench | 8,612 ms | 3,645 ms | 2.4× |
| bench-features | 12,427 ms | 1,144 ms | 10.9× |
| string-pipeline | 1,551 ms | 613 ms | 2.5× |
| mandelbrot | 2,223 ms | 212 ms | 10.5× |
| throw-catch | 2,195 ms | 197 ms | 11.2× |
¹ nqueens was previously broken on the VM due to a forward-reference bug in inner defines (fixed Mar 2026). The VM result above now reflects correct execution.
The VM achieves 2–17× speedups across the board, with the largest gains on recursion-heavy benchmarks (tak, nqueens, bench-features, upvalue-counter) where call overhead dominates. Closure-heavy and string benchmarks show more modest ~2–3× gains.
1. Copy-on-Write Map Mutation
Problem: Every (assoc map key val) call cloned the entire BTreeMap, even when no other reference existed. For the 1BRC accumulator (~400 weather stations), this was O(400) per row × millions of rows.
Solution: Use Rc::try_unwrap to check if the reference count is 1. If so, take ownership and mutate in place. Otherwise, clone.
// crates/sema-stdlib/src/map.rs
match Rc::try_unwrap(m) {
Ok(map) => map, // refcount == 1: we own it, mutate in place
Err(m) => m.as_ref().clone(), // shared: must clone
}The key insight is pairing this with Env::take() — by removing the accumulator from the environment before passing it to assoc, the refcount drops to 1, enabling the in-place path. User code looks like:
(file/fold-lines "data.csv"
(lambda (acc line)
(let ((parts (string/split line ";")))
(assoc acc (first parts) (second parts))))
{})The fold-lines implementation moves (not clones) acc into the lambda env on each iteration, keeping the refcount at 1.
Impact: ~30% of the total speedup. Eliminated the O(n) full-map clone, leaving only the O(log n) BTreeMap insert per row.
Literature:
- This is the same copy-on-write strategy used by Swift's value types. (Clojure's persistent data structures solve a related problem — avoiding full copies — but via structural sharing rather than refcount-based COW.)
- Phil Bagwell, "Ideal Hash Trees" (2001) — the paper behind Clojure/Scala persistent collections
- Rust's
Rc::make_mutprovides the same semantics with less ceremony
2. Lambda Environment Reuse (removed)
Status: This optimization was part of the mini-eval's hot path in
io.rs. It was removed when the mini-eval was deleted. The currentfile/fold-linesusessema_core::call_callback, which routes through the real evaluator — each call creates a freshEnvvia the standardapply_lambdapath.
What it was: For simple lambdas (known arity, no rest params), the mini-eval created the lambda environment once and reused it across all iterations, overwriting bindings in place. Combined with a reusable line_buf, this eliminated per-iteration allocations for Env, string interning, and line buffers.
Why it was removed: The env reuse logic was tightly coupled to the mini-eval's direct lambda dispatch. The callback architecture routes through the real evaluator's apply_lambda, which always creates a fresh child Env — this is correct and avoids subtle bugs from env mutation leaking across calls.
Impact when active: ~15% speedup (2,501ms → 1,626ms combined with COW assoc).
What remains: The reusable line_buf (String::with_capacity(64) cleared each iteration) is still present in file/fold-lines — only the env reuse was lost.
3. Evaluator Callback Architecture (replacing Mini-Eval)
Status: The mini-eval was deleted and replaced with a callback architecture. Stdlib now calls the real evaluator via
sema_core::call_callback.
What the mini-eval was: sema-stdlib previously contained its own minimal evaluator (sema_eval_value) that handled common forms via direct recursive calls, inlining builtins like +, =, assoc, string/split, and string/to-number to skip Env lookup and NativeFn dispatch entirely.
Why it was removed:
- Semantic drift: The mini-eval diverged from the real evaluator — new special forms, error handling, and features had to be duplicated or were silently missing.
- Blocking bytecode VM: A bytecode compiler can't target two evaluators. Removing the mini-eval ensures a single evaluation path that the VM can replace.
The callback architecture: sema-stdlib cannot depend on sema-eval (circular dependency). Instead, sema-eval registers a thread-local callback (set_call_callback) at startup, and stdlib functions call sema_core::call_callback to invoke the real evaluator. A thread-local EvalContext (with_stdlib_ctx) is shared across calls to avoid per-call context allocation.
// crates/sema-stdlib/src/io.rs — file/fold-lines via callback
sema_core::with_stdlib_ctx(|ctx| {
let mut line_buf = String::with_capacity(64);
loop {
line_buf.clear();
let n = reader.read_line(&mut line_buf)?;
if n == 0 { break; }
// Calls the real evaluator (eval_value) via thread-local callback
acc = sema_core::call_callback(ctx, &func, &[acc, Value::string(&line_buf)])?;
}
Ok(acc)
})Performance trade-off: ~960ms → ~2,900ms on 1M rows (~3× regression). The overhead comes from the full trampoline evaluator: call stack management, span tracking, and Trampoline dispatch on every sub-expression.
Fast-path optimizations that partially recovered performance:
- Self-evaluating fast path:
eval_valueshort-circuits for integers, floats, strings, keywords, and symbols — skipping depth tracking and step limits for the most common forms. - Inline NativeFn dispatch: When the evaluator sees a
Value::NativeFnin call position, it calls the function pointer directly without going throughcall_callbackindirection. - Thread-local shared EvalContext:
with_stdlib_ctxreuses a singleEvalContextacross all stdlib → evaluator callbacks, avoiding per-call allocation ofRefCell/Cellfields. - Deferred cloning:
eval_value_inneravoids cloning the expression and environment on the first trampoline iteration, only cloning if a tail call (Trampoline::Eval) is returned.
Remaining gap: The ~3× regression cannot be fully closed within the tree-walking architecture. The bytecode VM (now the default) — the reason the mini-eval was removed — gets to ~1.2× of the mini-eval on 1M (~1,150ms vs ~960ms) and is ~2.4× faster than the tree-walker on the same workload.
Literature:
- Inline caching, pioneered by Smalltalk-80 and refined in V8's hidden classes, solves the same dispatch overhead problem but at a different architectural level
- Most production Lisps (SBCL, Chez Scheme) compile to native code, making dispatch overhead negligible — Sema's callback overhead is inherent to tree-walking interpreters
- Lua 5.x's bytecode VM inlines common operations (
OP_ADD,OP_GETTABLE) into the dispatch loop — this is the approach Sema's bytecode VM (sema-vm) takes
4. String Interning (lasso)
Problem: Symbol/keyword equality was O(n) string comparison. Environment lookups keyed by String required comparing the full string on each BTreeMap node visit. Special form dispatch compared against 30+ string literals on every list evaluation.
Solution: Replace Rc<String> in Value::Symbol and Value::Keyword with Spur — a u32 handle from the lasso string interner. Environment bindings keyed by Spur for direct integer lookup.
// Before: O(n) string comparison
Value::Symbol(Rc<String>)
env: BTreeMap<String, Value>
// After: O(1) integer comparison
Value::Symbol(Spur) // u32
env: BTreeMap<Spur, Value>Special form dispatch uses pre-cached Spur constants:
// crates/sema-eval/src/special_forms.rs
struct SpecialFormSpurs {
quote: Spur,
if_: Spur,
define: Spur,
// ... 30 more
}
// Dispatch: integer comparison, no string resolution
if head_spur == sf.if_ {
return Some(eval_if(args, env));
}Caveat: The initial implementation was actually slower (2,518ms vs 1,580ms baseline) because resolve() was allocating a new String on every symbol lookup. Fixed by adding with_resolved(spur, |s| ...) which provides a borrowed &str without allocation, and switching Env to use Spur keys directly.
Impact: 1,580ms → 1,400ms (11% faster) after fixing the allocation issue.
Literature:
- String interning is as old as Lisp itself — McCarthy's original LISP 1.5 (1962) interned atoms in the "object list" (oblist)
- Java interns all string literals and provides
String.intern(). The JVM'sinvokedynamicuses interned method names for O(1) dispatch - The string-interner and lasso crates are the two main Rust options; lasso was chosen for its
Rodeothread-local interner which fits Sema's single-threaded architecture
5. hashbrown HashMap
Problem: The 1BRC accumulator uses a map keyed by weather station name (~400 entries). BTreeMap provides O(log n) lookup, but the accumulator is accessed on every row. With 10M rows, the log₂(400) ≈ 9 comparisons per lookup adds up.
Solution: Added a Value::HashMap variant backed by hashbrown (the same hash map used inside Rust's std::collections::HashMap, but exposed directly for no_std compatibility and raw API access).
;; User code: opt into HashMap for the accumulator
(file/fold-lines "data.csv"
(lambda (acc line) ...)
(hashmap/new)) ; amortized O(1) vs O(log n)
;; Convert back to sorted BTreeMap for output
(hashmap/to-map acc)BTreeMap remains the default for {} map literals because deterministic ordering matters for equality, printing, and test assertions. hashbrown is opt-in for performance-critical paths.
Impact: 1,400ms → 1,340ms (4% faster). Modest because BTreeMap with 400 entries and short string keys is already fast.
Literature:
- hashbrown uses SwissTable, designed by Google for their C++
absl::flat_hash_map. See CppCon 2017: Matt Kulukundis "Designing a Fast, Efficient, Cache-friendly Hash Table" - Clojure's
{:key val}maps use HAMTs (hash array mapped tries) which provide O(~1) lookup with structural sharing. Sema's approach is simpler: full COW on theRc<HashMap>rather than structural sharing, which is viable because the refcount-1 fast path almost always hits
6. SIMD Byte Search (memchr) (removed)
Status: The memchr-based two-part split fast path was part of the mini-eval's inlined
string/splitand was removed with it. The currentstring/splitinsema-stdlib/src/string.rsuses Rust's standardstr::split()followed bymapandcollect. Thememchrcrate remains a dependency ofsema-stdlibbut is no longer used in the split hot path.
What it was: A SIMD-accelerated (SSE2/AVX2/NEON) byte search via the memchr crate, combined with a two-part split fast path that avoided Vec allocation when splitting on a single-byte separator with exactly one occurrence (the common case in 1BRC: "Berlin;12.3" → ["Berlin", "12.3"]).
Impact when active: Negligible for SIMD specifically (1BRC strings are 10–30 bytes), but the two-part fast path avoided iterator/Vec overhead.
Literature:
- memchr is maintained by Andrew Gallant (BurntSushi), author of ripgrep. It uses a generic SIMD framework to dispatch to the best available instruction set at runtime
7. Custom Number Parser (removed)
Status: This was part of the mini-eval's inlined
string/to-numberand was removed with it. The currentstring/to-numberinsema-stdlib/src/string.rsuses Rust's standardstr::parse::<i64>()with fallback tostr::parse::<f64>().
What it was: A hand-rolled decimal parser that handled only [-]digits[.digits], using a precomputed powers-of-10 lookup table for 1–4 fractional digits. It returned None for complex cases (scientific notation, infinity, NaN), falling back to the standard parser.
Impact when active: Part of the combined mini-eval speedup. Difficult to isolate, but avoided the overhead of Rust's dec2flt algorithm.
Literature:
- Rust's float parser is based on the Eisel-Lemire algorithm (2020), which is fast for a general-purpose parser but still does more work than necessary for simple decimals
- Daniel Lemire's fast_float C++ library (and its Rust port) takes a similar "fast path for common cases" approach
8. Enlarged I/O Buffer
Problem: BufReader's default 8KB buffer means frequent syscalls for large files.
Solution: 256KB buffer for file/fold-lines.
let mut reader = std::io::BufReader::with_capacity(256 * 1024, file);Impact: Minor. CPU was the bottleneck, not I/O. But it's a free win — larger buffers amortize syscall overhead and improve sequential read throughput on modern SSDs.
9. Bytecode VM Optimizations
The bytecode VM (the default backend) applies several optimizations beyond basic bytecode compilation. These are documented in detail in Bytecode VM; highlights below.
Intrinsic Recognition
The compiler recognizes calls to known builtins and emits inline opcodes instead of function calls:
Arithmetic & comparison (phase 1):
| Source | Compiled to | What it replaces |
|---|---|---|
(+ a b) | AddInt | CallGlobal("+", 2) → hash lookup → NativeFn downcast → args Vec → function call |
(- a b) | SubInt | Same overhead |
(* a b) | MulInt | Same overhead |
(< a b) | LtInt | Same overhead |
(> a b) | Gt | Same overhead |
(not x) | Not | Same overhead |
Stdlib: list operations & type predicates (phase 2, Feb 2026):
| Source | Compiled to | What it replaces |
|---|---|---|
(car x) / (first x) | Car | Same overhead — pop list, push first element |
(cdr x) / (rest x) | Cdr | Same — pop list, push tail |
(cons h t) | Cons | Same — pop head+tail, push new list |
(null? x) | IsNull | Same — push #t if nil or empty list |
(pair? x) | IsPair | Same — push #t if non-empty list |
(list? x) | IsList | Same — push #t if list |
(number? x) | IsNumber | Same — push #t if int or float |
(string? x) | IsString | Same — push #t if string |
(symbol? x) | IsSymbol | Same — push #t if symbol |
(length x) | Length | Same — push collection length as int |
(append a b) | Append | Same — concatenate two lists (2-arg only) |
(get m k) | Get | Same — map lookup, nil default (2-arg only) |
(contains? m k) | ContainsQ | Same — push #t if key exists in map |
This eliminates global hash lookup, Rc downcast, argument Vec allocation, and function pointer dispatch — the entire call overhead — for the most common operations. The *Int opcodes include NaN-boxed small-int fast paths that operate directly on raw u64 bits, avoiding Clone/Drop overhead entirely.
All standard arithmetic and comparison operators are inlined. The *Int variants include NaN-boxed fast paths; the generic opcodes (Div, Gt, Le, Ge) handle int/float coercion correctly.
Impact: Phase 1: TAK 4,352ms → 1,250ms (-71%), upvalue-counter 1,232ms → 450ms (-63%). Phase 2: deriv 1,123ms → 879ms (-22%), closure-storm 1,135ms → 1,029ms (-9%). The deriv benchmark is dominated by car/cdr/cons/pair? — exactly the functions that became intrinsics.
Constant Folding
An optimization pass (optimize.rs) runs on the CoreExpr IR between lowering and variable resolution. It folds compile-time-evaluable expressions:
- Arithmetic:
(+ 1 2)→3,(* 3 4)→12 - Comparisons:
(< 1 2)→#t,(= 3 3)→#t - Boolean:
(not #t)→#f - Control flow:
(if #t a b)→a,(and #f x)→#f,(or #t x)→#t - Dead code:
(begin 42 x)→(begin x)(pure constants before the last expression are eliminated)
Impact: Eliminates unnecessary instructions at compile time. Runtime impact on benchmarks is negligible (hot loops operate on variables), but reduces code size and improves startup for programs with constant subexpressions.
Peephole: (if (not X) ...) → JumpIfTrue
The compiler pattern-matches (if (not expr) then else) and emits the condition with an inverted jump, eliminating both the not call and one opcode dispatch:
;; Before: CallGlobal("not") + JumpIfFalse
;; After: JumpIfTrue (condition compiled directly)Fused CallGlobal
Non-tail calls to global functions use a single CallGlobal instruction that combines LoadGlobal + Call, using call_vm_closure_direct to set up the call frame without needing the function value on the stack.
Per-Instruction Inline Cache
Each LoadGlobal/CallGlobal instruction gets a dedicated cache slot at compile time, eliminating hash collisions. See the inline cache section above for benchmark results.
Specialized Local Access
Slots 0–3 have dedicated zero-operand opcodes (LoadLocal0..LoadLocal3, StoreLocal0..StoreLocal3), saving 2 bytes per access to the most common local variable slots.
Rejected Optimizations
Not everything we tried worked:
| Approach | Result | Why |
|---|---|---|
| HashMap for Env | Slower | BTreeMap is faster for the very small maps (1–3 entries) typical of let scopes. HashMap's hashing overhead exceeds BTreeMap's few integer comparisons at that size. |
| im-rc / rpds (persistent collections) | Slower | Structural sharing fights the COW optimization — the whole point is to avoid sharing and mutate in place when refcount is 1. |
| bumpalo / typed-arena | Incompatible | Values need to escape the arena (returned from functions, stored in environments). Arena allocation only works for temporaries. |
| compact_str / smol_str | Redundant | Once symbols/keywords are interned as Spur, small-string optimization for them is pointless. String values are still Rc<String> but they're not in the hot path for dispatch. |
Note: "Full evaluator callback" was previously listed here as rejected (4x slower than mini-eval). It is now the tree-walker's architecture — the ~2.7× overhead vs the mini-eval is accepted as the cost of architectural correctness. The bytecode VM (now the default) bypasses this overhead by compiling directly to bytecode.
Architecture Diagram
The hot path for file/fold-lines with the current callback architecture (tree-walker):
file/fold-lines
├── BufReader (256KB buffer, reused line_buf)
└── Per-line loop:
├── read_line → reused buffer (no alloc)
├── call_callback → real evaluator (eval_value)
│ ├── self-evaluating fast path (ints, floats, strings skip depth tracking)
│ ├── NativeFn inline dispatch (direct call, no callback indirection)
│ ├── apply_lambda → fresh Env per call (no env reuse)
│ ├── string/split → std str::split (no SIMD fast path)
│ ├── string/to-number → std parse::<i64> / parse::<f64>
│ └── assoc → COW in-place mutation (Rc refcount == 1)
├── thread-local EvalContext (shared, not per-call)
└── acc moved, not cloned → preserves refcount == 1The bytecode VM (the default backend) bypasses the tree-walker callback entirely. Instead of call_callback → eval_value → trampoline, the VM compiles the lambda body to bytecode once and executes it in a tight instruction dispatch loop, eliminating trampoline overhead, per-call span tracking, and repeated AST traversal. Pass --tw to opt back into the tree-walker.