Skip to content

Lisp Dialect Benchmark

How does Sema compare to other Lisp dialects on a real-world I/O-heavy workload? This page benchmarks fifteen Lisp dialects on the 1 Billion Row Challenge — read weather-station measurements and compute min/mean/max per station. It is not a synthetic micro-benchmark; it exercises I/O, string parsing, hash-table accumulation, and numeric aggregation in a tight loop.

A benchmark ranks implementations, not just runtimes

Each dialect's optimized entry uses a comparable best effort — a hand-rolled integer×10 temperature parser and, where it helps the runtime, block/byte I/O. Even so, results partly reflect how each program is written, not pure runtime throughput. The dialect notes say where each number comes from; the simple table shows the same workload written the obvious way.

Benchmark

One same-machine run: macOS 15.6, Apple M2 Max, native Homebrew runtimes, 10,000,000 rows (~124 MiB), best of 3, single-threaded. Sema is the v1.19.2 PGO build. All fifteen implementations produce byte-identical output. PicoLisp is omitted — no native Homebrew formula.

Optimized — best effort per dialect

Each implementation tuned to a comparable level (hand-rolled int×10 parser; block/byte I/O where the runtime benefits). Relative to the fastest (Fennel).

DialectTime (ms)RelativeRuntime
Fennel/LuaJIT5321.0xJIT compiler
SBCL8991.7xNative compiler
Racket1,4342.7xJIT (Chez backend)
Chez Scheme1,5152.8xNative compiler
Gambit2,2984.3xNative compiler (C)
Clojure2,8055.3xJVM (JIT)
Guile4,3558.2xBytecode VM + JIT
Janet5,0289.5xBytecode VM
Chicken5,77210.8xNative compiler (C)
Gauche7,15313.4xBytecode VM
Sema8,09615.2xBytecode VM
Emacs Lisp8,16715.4xBytecode VM
ECL8,93316.8xNative compiler (C)
newLISP9,01917.0xInterpreter
Kawa18,39534.6xJVM (JIT)

Simple / idiomatic

The same workload written the obvious way in each dialect — built-in number parser, per-line I/O, standard data structures. No hand-rolled parsers, no block reads. Closer to "raw runtime on naive code." Relative to the fastest (Gambit).

DialectTime (ms)Relative
Gambit2,3511.0x
Chez Scheme2,4811.1x
Fennel/LuaJIT2,6791.1x
Clojure2,9021.2x
SBCL2,9971.3x
Guile5,1862.2x
newLISP8,2063.5x
Chicken9,0943.9x
Janet9,9504.2x
Sema10,0264.3x
ECL13,5995.8x
Emacs Lisp16,2196.9x
Gauche16,4767.0x
Kawa17,7937.6x

The gap between the two tables is itself the story. Where optimized ≪ simple (Fennel, Racket, Janet, Gauche), most of the win came from a hand-rolled parser or block I/O. Where they're close (Clojure, Sema, newLISP), the runtime was already doing the work and there was little left to hand-tune.

Dialect notes

What's behind each number — and which results are runtime ceilings versus implementation choices.

Fennel / LuaJIT — the JIT runs away with it

Fennel compiled to LuaJIT is the fastest entry, ahead of SBCL (532 ms). LuaJIT's tracing JIT compiles the hot byte-scan loop to native code; with a string.byte integer parser and 1 MiB block reads it chews through ~250 MB/s. It's the clearest "runtime does the heavy lifting" result — but note its simple version is 2.7 s (5× slower), so the win is the optimized byte loop being unusually JIT-friendly, not a free lunch.

SBCL — native code + (safety 0)

SBCL compiles to native machine code; in a type-specialized hot path there is no interpreter loop. With (declare (optimize (speed 3) (safety 0))), block read-sequence I/O, an integer×10 parser, and in-place setf struct mutation, the inner loop runs near C speed. 25+ years of compiler work (descended from CMUCL). Its 1.3× → 1.0x optimization gain (simple 3.0 s → optimized 0.9 s) is the largest in the suite.

Racket — byte I/O over the Chez backend

Racket reads 1 MiB byte blocks, scans for ;/newline with O(1) subbytes slicing, and parses int×10. Its CS backend (Chez under the hood) plus byte strings put it third overall, just ahead of Chez itself — a notable result for a runtime usually thought of as "the teaching language."

Chez Scheme — the other native compiler

Chez compiles to native code via a nanopass framework. With a custom char-by-char parser and make-hashtable/string-hash it lands just behind Racket. The remaining gap to SBCL is mostly per-line string allocation versus SBCL's block parser.

Gambit — compiled Scheme via C

gsc compiles Scheme to C to a native binary. It got the same int×10 parser as the other Schemes, but the win was negligible here — read-line + substring + string hashing dominate the loop, so I/O, not number parsing, is the bottleneck.

Clojure — JVM tax + warmup

Clojure's time includes JVM startup and JIT warmup, real costs for a single-shot script. line-seq + a transient map is idiomatic but not zero-cost, and Double/parseDouble handles the full IEEE-754 spec. Steady-state throughput is better than the wall-clock suggests; it trades raw speed for compactness.

Guile — Scheme bytecode VM + JIT

Guile 3 has a bytecode VM with a native JIT on supported platforms. With a hand-rolled int×10 parser it's the fastest of the "VM" tier here, ahead of Janet and Chicken.

Janet — the closest architectural peer

Janet is the most architecturally comparable to Sema: an embeddable scripting language, bytecode VM, GC-based, no native compiler. Head to head, Janet (5.0 s) lands ~1.6× ahead of Sema (8.1 s). Two things help Janet — its strings are byte strings (O(1) slicing, no UTF-8 navigation), and it has a tracing GC instead of Rc reference counting, so a slurp + byte-scan + int parser goes a long way. This is the comparison to watch as Sema's runtime evolves.

Chicken — compiled Scheme, I/O bound

Chicken compiles Scheme to C via csc -O3 with an int×10 parser. The remaining gap is per-line I/O allocation and Chicken's continuation-passing-style C ("Cheney on the MTA"), whose trampolining the C compiler can't fully optimize away.

Gauche — byte scanning over UTF-8 strings

Gauche stores strings as UTF-8 indexed by character, so a substring/string-index implementation pays O(k) navigation per slice to convert character positions to byte offsets — a trap that can make a mature, well-engineered runtime look slow. The implementation here sidesteps it: read the whole file into a u8vector, scan bytes directly, parse int×10. That lands Gauche mid-pack at 7.2 s, ahead of Sema — and is a good reminder that on a char-indexed runtime, byte-oriented I/O is the difference between near-last and respectable.

Sema — the interpreter floor

Sema (8.1 s) sits behind Gauche, Janet, Chicken, and Guile. It's a bytecode interpreter with NaN-boxed immutable values and Rc reference counting — no JIT, no native codegen — and its implementation is at the interpreter floor: the tricks that help the compiled dialects don't transfer here. Sema's native string/split and string->float already beat an interpreted hand-parser, and vectors are immutable with no mutable cell, so every row rebuilds the stat vector. The next gains are runtime, not implementation: a tracing GC (to kill the per-row Rc churn) and mutable vectors / byte-buffer + string-slice APIs (so a genuinely byte-oriented implementation, like the fast dialects use, becomes possible). See Performance Internals.

Emacs Lisp — buffer-based I/O

Emacs loads the whole file into a buffer with insert-file-contents-literally and parses int×10 directly from buffer characters with no substring extraction — strong for a venerable bytecode VM, essentially tied with Sema.

ECL — Common Lisp via C

ECL compiles Common Lisp through C with compile-file, with an int×10 parser. The gap to SBCL is ECL's less aggressive native code generation.

newLISP — a small, simple interpreter

newLISP's accumulation uses a hash, but on this 40-station dataset the data structure hardly matters — with so few stations even a linear scan is cheap, and per-row interpreter overhead dominates either way. A faithful picture of a deliberately minimal interpreter.

Kawa — JVM Scheme, slower than expected

Kawa compiles Scheme to JVM bytecode. Even with Java interop (BufferedReader, java.util.HashMap), Scheme-on-JVM data representation, startup, and JIT warmup leave it last.

What this benchmark doesn't show

This is one workload. Different benchmarks would reorder things:

  • CPU-bound computation (fibonacci, sorting): the native compilers and JITs would pull further ahead; the I/O here amortizes some of the interpreter gap.
  • Startup time: included in wall-clock but not isolated — it hits the JVM dialects (Clojure, Kawa) hardest.
  • Memory usage: not measured; JVM runtimes carry a higher baseline than small standalone ones like Janet or Sema.
  • Multi-threaded: Clojure, SBCL, Janet, and Guile can parallelize; Sema is single-threaded (its async/channels are cooperative, not parallel).
  • Developer experience: Clojure's REPL, Racket's DrRacket, and SBCL's SLIME are far more mature than Sema's.

Methodology

  • Dataset: 10,000,000 rows (~124 MiB), 40 weather stations, from the 1BRC spec.
  • Environment: macOS 15.6 / Apple M2 Max, native Homebrew runtimes (June 2026). Sema 1.19.2 (PGO). Gauche 0.9.15. Others are the current Homebrew formulae / downloaded binaries.
  • Measurement: wall-clock, best of 3 consecutive runs per dialect, via benchmarks/1brc/run-native-benchmarks.py (all dialects measured together in one session). Sema is timed as the prebuilt PGO binary (make build-pgo, run with SEMA_SKIP_BUILD=1).
  • Verification: all fifteen implementations produce byte-identical normalized output (sorted stations, 1-decimal rounding) — checked every run.
  • Implementations: each optimized entry uses a comparable best effort (hand-rolled int×10 parser; block/byte I/O where the runtime benefits); the simple table uses each dialect's naive idiom. PicoLisp is omitted (no native Homebrew formula).

Reproducing

bash
# Generate test data (or use benchmarks/data/bench-10m.txt)
python3 benchmarks/1brc/generate-test-data.py 10000000 benchmarks/data/bench-10m.txt

# Build the PGO Sema binary, then run the native matrix against it
make build-pgo
SEMA_SKIP_BUILD=1 ./benchmarks/1brc/run-native-benchmarks.py benchmarks/data/bench-10m.txt

Implementation source: benchmarks/1brc/ (optimized) and benchmarks/1brc/simple/ (simple/idiomatic).