Bytecode VM
Default Backend
The bytecode VM is the default execution path as of v1.14.0. The tree-walking interpreter is still available via --tw for debugging and comparison. Both runtimes coexist and share the global environment.
Overview
Sema's default backend is a bytecode VM. The VM compiles Sema source code into stack-based bytecode for faster execution, delivering up to 17× speedup over the tree-walker on compute-heavy workloads (e.g., TAK benchmark: 1.25s vs 21.4s).
The tree-walking interpreter (sema-eval) is preserved as the macro expansion engine and eval fallback, and remains user-accessible via --tw — the two runtimes coexist, sharing the global environment and EvalContext.
Compilation Pipeline
Source text
→ Reader (tokenize + parse → Value AST)
→ Macro expand (tree-walker evaluates macros)
→ Lower (Value AST → CoreExpr IR)
→ Resolve (CoreExpr → ResolvedExpr with slot/upvalue/global analysis)
→ Compile (ResolvedExpr → bytecode Chunks)
→ VM execution (dispatch loop)Phase 1: Lowering (Value → CoreExpr)
The lowering pass converts the Value AST into CoreExpr, a desugared intermediate representation. All ~40 special forms are lowered to ~55 CoreExpr variants. Several forms desugar into simpler ones:
| Source Form | Lowers To |
|---|---|
cond | Nested If |
when | If with nil else |
unless | If with swapped branches |
case | Let + nested If with Or comparisons |
defun | Define + Lambda |
Tail position analysis happens during lowering. The Call node carries a tail: bool flag, set based on position:
- Tail: last expression in
lambdabody,begin,let/let*/letrecbody,ifbranches,condclauses,and/orlast operand - Not tail:
trybody (handler must be reachable),doloop body, non-last expressions
Phase 2: Variable Resolution (CoreExpr → ResolvedExpr)
The resolver walks the CoreExpr tree and classifies every variable reference as one of:
| Resolution | Opcode | Description |
|---|---|---|
Local { slot } | LoadLocal / StoreLocal | Variable in the current function's frame |
Upvalue { index } | LoadUpvalue / StoreUpvalue | Captured from an enclosing function |
Global { spur } | LoadGlobal / StoreGlobal | Module-level binding |
This is the key optimization over the tree-walker: instead of hash-based environment chain lookup (O(scope depth) per access), variables are accessed by direct slot index (O(1)).
Upvalue Capture
Closures use the Lua/Steel upvalue model. When a lambda references a variable from an enclosing function:
- The resolver marks the outer local as captured
- An
UpvalueDescentry is added to the inner lambda:ParentLocal(slot)if capturing from the immediate parent,ParentUpvalue(index)if capturing through an intermediate function
(lambda (x) ; x = Local slot 0
(lambda () ; captures x: UpvalueDesc::ParentLocal(0)
(lambda () ; captures through chain: UpvalueDesc::ParentUpvalue(0)
x))) ; resolves to Upvalue { index: 0 }Phase 3: Bytecode Compilation (ResolvedExpr → Chunk)
The instruction format echoes the IBM 704 (1955)
The 704's Type A instruction packed four fields into a single 36-bit word: prefix (opcode), decrement (constant parameter), tag (register selector), and address (operand location). Sema's bytecode uses a strikingly similar structure — each instruction is an opcode byte followed by inline operands (constant indices, slot numbers, jump offsets). The 704 also had a CAS (Compare Accumulator with Storage) instruction that performed a 3-way branch in a single operation: skip 0, 1, or 2 instructions depending on less-than, equal, or greater-than. This is pattern matching as a hardware primitive — the ancestor of the conditional jump patterns Sema's compiler generates for cond and match.
The compiler (compiler.rs) transforms ResolvedExpr into bytecode Chunks. The Compiler struct wraps an Emitter (bytecode builder) and collects Function templates for inner lambdas.
Compilation strategies:
- Constants:
Nil,True,Falseget dedicated opcodes. All other constants useConst+ constant pool. - Variables:
LoadLocal/StoreLocalfor locals,LoadUpvalue/StoreUpvaluefor captures,LoadGlobal/StoreGlobal/DefineGlobalfor globals. - Control flow:
ifusesJumpIfFalse+Jumpfor short-circuit.and/oruseDup+ conditional jumps to preserve the last truthy/falsy value. - Lambdas: compiled to separate
Functiontemplates, referenced byMakeClosureinstruction with upvalue descriptors. doloops: compile to backwardJumpwithJumpIfTruefor exit test.try/catch: adds entries to the chunk's exception table, no inline opcodes.- Named let: compiled as
MakeClosure+Call— the loop body becomes a function.
Runtime-delegated forms — forms that can't be compiled to pure bytecode are compiled as calls to __vm-* global functions registered by sema-eval:
| Source Form | Delegate |
|---|---|
eval | __vm-eval |
import | __vm-import |
load | __vm-load |
defmacro | __vm-defmacro-form (passes entire form as quoted constant) |
define-record-type | __vm-define-record-type |
delay | __vm-delay (passes unevaluated body as quoted constant) |
force | __vm-force |
prompt, message, deftool, defagent, macroexpand | Corresponding __vm-* delegates |
Public API: compile(exprs, n_locals, known_natives) returns CompileResult { chunk, functions, native_table }. When known_natives is provided, calls to those globals emit CallNative for direct dispatch.
Compiler Optimizations
- Intrinsic recognition: Known builtins are compiled to inline opcodes instead of function calls, eliminating global lookup,
Rcdowncast, argumentVecallocation, and function pointer dispatch. Arithmetic/comparison:+,-,*,/,<,>,<=,>=,=,not. List/predicates:car/first,cdr/rest,cons,null?,pair?,list?,number?,string?,symbol?,length. - Peephole:
(if (not X) ...): The pattern(if (not X) A B)compiles toJumpIfTrueinstead ofNot+JumpIfFalse, eliminating one instruction. - Fused
CallGlobal: Non-tail calls to global functions use a fusedCallGlobalinstruction that combinesLoadGlobal+Callinto a single opcode with(u32 spur, u16 argc)operands. - Specialized
LoadLocal/StoreLocal: Slots 0–3 have dedicated zero-operand opcodes (LoadLocal0..LoadLocal3,StoreLocal0..StoreLocal3), saving 2 bytes per instruction for the most frequently accessed locals.
Phase 4: VM Execution
The VM (vm.rs) is a stack-based dispatch loop.
Core structs:
VM { stack: Vec<Value>, frames: Vec<CallFrame>, globals: Rc<Env>, functions: Vec<Rc<Function>> }
CallFrame { closure: Rc<Closure>, pc: usize, base: usize, open_upvalues: Vec<Option<Rc<UpvalueCell>>> }Key design points:
- Unsafe hot path: The dispatch loop uses
unsafeunchecked stack operations (pop_unchecked) and raw pointer bytecode reads viaread_u16!/read_i32!/read_u32!macros for performance. Opcode decoding usesstd::mem::transmuteon the#[repr(u8)]Openum. Debug builds retain bounds checks viadebug_assert!. - Closure interop: VM closures are wrapped as
Value::NativeFnvalues so the tree-walker can call them. Each NativeFn carries anRc<dyn Any>payload containingVmClosurePayload(closure + function table), and the VM usesraw_tag()+downcast_refto avoidRcrefcount bumps on the hot path. Each NativeFn wrapper creates a fresh VM instance to execute the closure's bytecode. - Upvalue cells:
UpvalueCellwithRc<RefCell<Value>>for shared mutable state —StoreLocalsyncs to open upvalue cells,LoadLocalreads from cells when present. - Exception handling:
Throwopcode triggers handler search via the chunk's exception table. Stack is restored to saved depth, error value pushed, PC jumps to handler.
Entry points: VM::execute() takes a closure and EvalContext. compile_program() is the pipeline for normal compilation: Value AST → lower → optimize → resolve → compile → CompiledProgram. compile_program_with_spans() adds span/source-file support for debug (DAP breakpoints).
VM Optimizations
- Two-level dispatch loop: An outer loop caches frame-local state (code pointer, constants pointer, base offset) into local variables. The inner loop dispatches opcodes without re-fetching frame data. Frame state is only reloaded when control flow changes frames (
Call,TailCall,Return, exceptions). - NaN-boxed int fast paths:
AddInt/SubInt/MulInt/LtInt/EqIntoperate directly on raw NaN-boxed bits — sign-extending the payload, performing the arithmetic, and re-boxing, without ever constructing aValue. - 16-entry direct-mapped global cache: Global lookups are cached in a
[(spur, env_version, Value); 16]array indexed byspur & 0xF. Cache hits skip theEnvmap/new lookup entirely; cache lines are invalidated by env version mismatch. - Raw pointer bytecode reads:
read_u16!,read_i32!, andread_u32!macros read operands via raw pointer arithmetic on the code buffer, avoiding bounds checks in release builds. - Unsafe unchecked stack operations:
pop_uncheckedskips length checks (the compiler guarantees stack correctness).debug_assert!guards catch violations in debug builds. - Cold path factoring: The
handle_err!macro factors exception handling out of the hot instruction sequence, keeping the fast path compact for better instruction-cache behavior.
Opcode Set
The VM uses a stack-based instruction set with variable-length encoding. Each opcode is one byte, followed by operands (u16, u32, or i32).
Constants & Stack
| Opcode | Operands | Description |
|---|---|---|
Const | u16 index | Push constants[index] |
Nil | — | Push nil |
True | — | Push #t |
False | — | Push #f |
Pop | — | Discard top of stack |
Dup | — | Duplicate top of stack |
Variable Access
| Opcode | Operands | Description |
|---|---|---|
LoadLocal | u16 slot | Push locals[slot] |
StoreLocal | u16 slot | locals[slot] = pop |
LoadUpvalue | u16 index | Push upvalues[index].get() |
StoreUpvalue | u16 index | upvalues[index].set(pop) |
LoadGlobal | u32 spur | Push globals[spur] |
StoreGlobal | u32 spur | globals[spur] = pop |
DefineGlobal | u32 spur | Define new global binding |
LoadLocal0..LoadLocal3 | — | Push locals[0..3] (zero-operand fast path) |
StoreLocal0..StoreLocal3 | — | locals[0..3] = pop (zero-operand fast path) |
Control Flow
| Opcode | Operands | Description |
|---|---|---|
Jump | i32 offset | Unconditional relative jump |
JumpIfFalse | i32 offset | Pop, jump if falsy |
JumpIfTrue | i32 offset | Pop, jump if truthy |
Functions
| Opcode | Operands | Description |
|---|---|---|
Call | u16 argc | Call function with argc args |
TailCall | u16 argc | Tail call (reuse frame for TCO) |
Return | — | Return top of stack |
MakeClosure | u16 func_id, u16 n_upvalues, ... | Create closure from function template |
CallNative | u16 native_id, u16 argc | Direct native function call (no env lookup) |
CallGlobal | u32 spur, u16 argc | Fused global lookup + call |
Data Constructors
| Opcode | Operands | Description |
|---|---|---|
MakeList | u16 n | Pop n values, push list |
MakeVector | u16 n | Pop n values, push vector |
MakeMap | u16 n_pairs | Pop 2n values, push sorted map |
MakeHashMap | u16 n_pairs | Pop 2n values, push hash map |
Arithmetic & Comparison
| Opcode | Description |
|---|---|
Add, Sub, Mul, Div | Generic arithmetic (int/float/string) |
Negate, Not | Unary operators |
Eq, Lt, Gt, Le, Ge | Generic comparison |
AddInt, SubInt, MulInt | Specialized int fast paths |
LtInt, EqInt | Specialized int comparison |
Exception Handling
| Opcode | Description |
|---|---|
Throw | Pop value, raise as exception |
Exception handling uses an exception table on the Chunk rather than inline opcodes. Each entry specifies a PC range, handler address, and stack depth to restore.
Crate Structure
The bytecode VM lives in the sema-vm crate, which sits between sema-reader and sema-eval in the dependency graph:
sema-core ← sema-reader ← sema-vm ← sema-evalsema-vm depends on sema-core (for Value, Spur, SemaError) and sema-reader (for parsing in test helpers). It does not depend on sema-eval — the evaluator will depend on the VM, not the other way around.
Source Files
| File | Purpose |
|---|---|
opcodes.rs | Op enum — 64 bytecode opcodes |
chunk.rs | Chunk (bytecode + constants + spans), Function, UpvalueDesc |
emit.rs | Emitter — bytecode builder with jump backpatching |
disasm.rs | Human-readable bytecode disassembler |
core_expr.rs | CoreExpr and ResolvedExpr IR enums |
lower.rs | Value AST → CoreExpr lowering pass |
resolve.rs | Variable resolution (local/upvalue/global analysis) |
compiler.rs | Bytecode compiler (ResolvedExpr → Chunk) |
vm.rs | VM dispatch loop, call frames, closures, exception handling |
optimize.rs | Constant folding and dead code elimination on CoreExpr IR |
serialize.rs | Bytecode serialization/deserialization for .semac file format |
Current Limitations
- The compiler emits inline opcodes for common builtins (
+,-,*,/,<,>,<=,>=,=,not,car/first,cdr/rest,cons,null?,pair?,list?,number?,string?,symbol?,length) via intrinsic recognition. If a user redefines these globals, the intrinsic will still fire (these are treated as non-rebindable primitives). CallNativeoptimization requires passingknown_nativesat compile time (done automatically byeval_str_compiled); without it, all global calls useCallGlobal
CLI Usage
The bytecode VM is the default. Pass --tw to opt back into the tree-walker.
# Run a file with the bytecode VM (default)
sema examples/hello.sema
# Run with the tree-walker for debugging/comparison
sema --tw examples/hello.sema
# REPL also defaults to VM; use --tw to switch
sema
sema --twBoth paths share the global Env and EvalContext.
Design Decisions
Why Keep the Tree-Walker?
The tree-walking interpreter remains for:
- Macro expansion — macros can call
evalat expansion time, requiring a full evaluator before compilation evalfallback — dynamicevalneeds to parse, expand, compile, and execute at runtime- Debugging — tree-walking is easier to step through and inspect
Why Not Delete CoreExpr After Resolution?
The pipeline uses two IR types: CoreExpr (variables as names) and ResolvedExpr (variables as slots). This provides type-level safety — the compiler can only receive resolved expressions, preventing accidental use of unresolved variable references.