Feature·LLM·Structured Extraction

Unstructured in, typed data out.

Pass a schema, get back a typed Sema map. The schema is both the instruction to the model and the validator for the response — one artifact, two jobs. When the output doesn't match, the LLM fixes its own mistake.

$sema -e '(llm/extract {:name :string :age :number} "Alice is 30")'

Schema-as-contract · self-correcting re-ask · per-field validators · vision extraction

The pipeline

Text in. Typed map out. Self-correcting.

One call does what normally takes a prompt, a JSON parser, a validation layer, and a retry loop.

input
unstructured text
"I bought coffee for $4.50 at Blue Bottle on Jan 15, 2025"
schema
one artifact, two jobs
:vendor :string
:amount :number
:date :string
llm
JSON mode + validate
1. send schema as instructions
2. parse JSON → Sema map
3. validate → re-ask if wrong
output
typed Sema map
:vendor "Blue Bottle"
:amount 4.5
:date "2025-01-15"

Schema as contract

One artifact. Two jobs.

The schema map you pass to llm/extract is sent to the model as JSON instructions and used to validate the response. No separate prompt to maintain, no separate validator to keep in sync.

  • Bare shorthand. {:name :string :age :number} — fast to write, type sent to the model as a hint.
  • Descriptor maps. {:amount {:type :number :validate #(> % 0)}} — full type checking, optional fields, custom predicates.
  • Validated types. :string, :number, :boolean/:bool, :list/:array — type-checked against the response. :optional skips required-field checks.
receipt.semabasic extraction
(llm/extract
  {:vendor :string
   :amount :number
   :date   :string}
  "I bought coffee for $4.50
   at Blue Bottle on Jan 15, 2025")

;; => {:amount 4.5
;;     :date "2025-01-15"
;;     :vendor "Blue Bottle"}

Self-correcting

When the output is wrong, it tells the LLM.

If validation fails, the errors are sent back to the model so it can fix its own mistake — up to :retries times (default 2). Disable validation entirely with :validate #f when you trust the model.

  • Validate-and-reask loop. The LLM sees what went wrong and regenerates. No manual re-prompting.
  • Per-field messages. :message text is fed into the re-ask prompt — human-readable guidance for the model.
  • Asynchronous by default. In an async context, the initial attempt offloads to the scheduler so sibling tasks overlap.
validate.semacustom predicate + message
(llm/extract
  {:age {:type :number
          :validate #(and (>= % 0)
                              (<= % 150))
          :message "age must be
                      between 0 and 150"}}
  "She is 30 years old.")

;; => {:age 30}
;; (model returns 30, passes validation)

Per-field validators

Predicates are just Sema functions.

:validate accepts any function — including short lambdas like #(> % 0). The :message becomes part of the re-ask prompt when validation fails.

positive amount
  :amount {:type :number
            :validate #(> % 0)}
non-empty string
  :vendor {:type :string
            :validate
              #(> (string/length %) 0)}
range check + message
  :age {:type :number
        :validate
          #(and (>= % 0) (<= % 150))
        :message "0–150"}
optional field
  :nickname {:type :string
              :optional #t}

Classification

Sort text into typed categories.

llm/classify sends the categories and the text, gets back one label. Pass keywords → get a keyword. Pass strings → get a string. Use a cheap fast model for the classification step.

  • Typed output. (list :positive :negative :neutral) in → :positive out. No string matching.
  • Cheap model option. {:model "claude-haiku-4-5"} — classification doesn't need a frontier model.
  • Async-aware. Offloads to the scheduler in async context, just like llm/extract.
classify.semasentiment + spam
(llm/classify
  (list :positive :negative :neutral)
  "This product is amazing!")
;; => :positive

(llm/classify
  (list :spam :ham)
  "WINNER!!! Claim your prize"
  {:model "claude-haiku-4-5-20251001"})
;; => :spam

Vision extraction

Receipts, invoices, screenshots.

llm/extract-from-image applies the same schema semantics to images. Pass a file path or a bytevector. Media type is auto-detected — PNG, JPEG, GIF, WebP, PDF. Works across Anthropic, OpenAI, Gemini, and Ollama.

  • File path or bytevector. "receipt.png" or (file/read-bytes "invoice.jpg") — both work.
  • Auto-detected media type. Magic bytes, no manual MIME configuration. Supports PNG, JPEG, GIF, WebP, PDF.
  • Multi-modal chat. message/with-image for freeform image conversations with llm/chat.
vision.semaimage → typed data
(llm/extract-from-image
  {:total :number
   :date  :string}
  "receipt.png")

;; => {:total 42.50 :date "2026-06-23"}

(define img (file/read-bytes "invoice.jpg"))
(llm/extract-from-image
  {:invoice_number :string
   :date  :string
   :total :string}
  img)
;; => {:date "2025-03-15"
;;     :invoice_number "12345"
;;     :total "$139.96"}

The argument

What you'd write without it.

The same extraction in a typical Python setup: prompt engineering, JSON parsing, error handling, manual validation, re-prompting logic. Sema does all of that in one call.

extract.pyPydantic + LangChain
from pydantic import BaseModel
from langchain.openai import ChatOpenAI
from langchain.core.messages import HumanMessage

class Receipt(BaseModel):
    vendor: str
    amount: float
    date: str

def extract_receipt(text):
    for attempt in range(3):
        resp = llm.invoke([
            HumanMessage(content=(
                f"Extract vendor, amount, date."
                f"Text: {text}"
                "Return JSON only."))])
        try:
            return Receipt.model_validate_json(
                resp.content)
        except Exception as e:
            text += f"Error: {e}. Fix."
    raise ValueError("failed")
20 lines. Manual prompt. Manual JSON parsing. Manual retry. Manual error feedback. The schema and the prompt can drift.
extract.semaone call
(llm/extract
  {:vendor :string
   :amount :number
   :date   :string}
  "I bought coffee for $4.50
   at Blue Bottle on Jan 15, 2025")

;; => {:amount 4.5
;;     :date "2025-01-15"
;;     :vendor "Blue Bottle"}
4 lines. Schema is the prompt and the validator. Re-ask is automatic. Typed map out of the box.

Extract your first field.

One call. No prompt engineering. No JSON parsing.

run$sema -e '(llm/extract {:name :string :age :number} "John is 42")'
install$cargo install sema-lang