Feature·LLM·Structured Extraction
Unstructured in, typed data out.
Pass a schema, get back a typed Sema map. The schema is both the instruction to the model and the validator for the response — one artifact, two jobs. When the output doesn't match, the LLM fixes its own mistake.
Schema-as-contract · self-correcting re-ask · per-field validators · vision extraction
The pipeline
Text in. Typed map out. Self-correcting.
One call does what normally takes a prompt, a JSON parser, a validation layer, and a retry loop.
Schema as contract
One artifact. Two jobs.
The schema map you pass to llm/extract is sent to the model as JSON instructions and used to validate the response. No separate prompt to maintain, no separate validator to keep in sync.
- Bare shorthand.
{:name :string :age :number}— fast to write, type sent to the model as a hint. - Descriptor maps.
{:amount {:type :number :validate #(> % 0)}}— full type checking, optional fields, custom predicates. - Validated types.
:string,:number,:boolean/:bool,:list/:array— type-checked against the response.:optionalskips required-field checks.
(llm/extract {:vendor :string :amount :number :date :string} "I bought coffee for $4.50 at Blue Bottle on Jan 15, 2025") ;; => {:amount 4.5 ;; :date "2025-01-15" ;; :vendor "Blue Bottle"}
Self-correcting
When the output is wrong, it tells the LLM.
If validation fails, the errors are sent back to the model so it can fix its own mistake — up to :retries times (default 2). Disable validation entirely with :validate #f when you trust the model.
- Validate-and-reask loop. The LLM sees what went wrong and regenerates. No manual re-prompting.
- Per-field messages.
:messagetext is fed into the re-ask prompt — human-readable guidance for the model. - Asynchronous by default. In an async context, the initial attempt offloads to the scheduler so sibling tasks overlap.
(llm/extract {:age {:type :number :validate #(and (>= % 0) (<= % 150)) :message "age must be between 0 and 150"}} "She is 30 years old.") ;; => {:age 30} ;; (model returns 30, passes validation)
Per-field validators
Predicates are just Sema functions.
:validate accepts any function — including short lambdas like #(> % 0). The :message becomes part of the re-ask prompt when validation fails.
:amount {:type :number :validate #(> % 0)}
:vendor {:type :string :validate #(> (string/length %) 0)}
:age {:type :number :validate #(and (>= % 0) (<= % 150)) :message "0–150"}
:nickname {:type :string :optional #t}
Classification
Sort text into typed categories.
llm/classify sends the categories and the text, gets back one label. Pass keywords → get a keyword. Pass strings → get a string. Use a cheap fast model for the classification step.
- Typed output.
(list :positive :negative :neutral)in →:positiveout. No string matching. - Cheap model option.
{:model "claude-haiku-4-5"}— classification doesn't need a frontier model. - Async-aware. Offloads to the scheduler in async context, just like
llm/extract.
(llm/classify (list :positive :negative :neutral) "This product is amazing!") ;; => :positive (llm/classify (list :spam :ham) "WINNER!!! Claim your prize" {:model "claude-haiku-4-5-20251001"}) ;; => :spam
Vision extraction
Receipts, invoices, screenshots.
llm/extract-from-image applies the same schema semantics to images. Pass a file path or a bytevector. Media type is auto-detected — PNG, JPEG, GIF, WebP, PDF. Works across Anthropic, OpenAI, Gemini, and Ollama.
- File path or bytevector.
"receipt.png"or(file/read-bytes "invoice.jpg")— both work. - Auto-detected media type. Magic bytes, no manual MIME configuration. Supports PNG, JPEG, GIF, WebP, PDF.
- Multi-modal chat.
message/with-imagefor freeform image conversations withllm/chat.
(llm/extract-from-image {:total :number :date :string} "receipt.png") ;; => {:total 42.50 :date "2026-06-23"} (define img (file/read-bytes "invoice.jpg")) (llm/extract-from-image {:invoice_number :string :date :string :total :string} img) ;; => {:date "2025-03-15" ;; :invoice_number "12345" ;; :total "$139.96"}
The argument
What you'd write without it.
The same extraction in a typical Python setup: prompt engineering, JSON parsing, error handling, manual validation, re-prompting logic. Sema does all of that in one call.
from pydantic import BaseModel from langchain.openai import ChatOpenAI from langchain.core.messages import HumanMessage class Receipt(BaseModel): vendor: str amount: float date: str def extract_receipt(text): for attempt in range(3): resp = llm.invoke([ HumanMessage(content=( f"Extract vendor, amount, date." f"Text: {text}" "Return JSON only."))]) try: return Receipt.model_validate_json( resp.content) except Exception as e: text += f"Error: {e}. Fix." raise ValueError("failed")
(llm/extract {:vendor :string :amount :number :date :string} "I bought coffee for $4.50 at Blue Bottle on Jan 15, 2025") ;; => {:amount 4.5 ;; :date "2025-01-15" ;; :vendor "Blue Bottle"}
Extract your first field.
One call. No prompt engineering. No JSON parsing.