Feature·LLM·RAG & Vector Store
Retrieval is a language feature.
Four primitives — llm/embed, vector-store/*, llm/rerank, llm/complete — compose into the full RAG pipeline. No framework. No vector database to stand up. No orchestration library. Just functions that fit in one screen.
embed → retrieve → rerank → answer · no external DB · disk-persisted
The pipeline
Embed. Retrieve. Rerank. Answer.
Bi-encoder search for high recall, cross-encoder reranking for precision. The industry-standard pattern, as four function calls.
Embed
Vectors as bytevectors.
llm/embed takes a string or a list (batch). Returns packed f64 bytevectors with a fast path for similarity computations — no per-element unboxing overhead. It's a first-class value: (map llm/embed ...) just works.
- Batch embeddings.
(llm/embed ["a" "b" "c"])returns a list of vectors in one network call. - Seven providers. Jina, Voyage, Cohere, Nomic, Together AI, Fireworks AI, OpenAI. Auto-configured from env vars. Separate from your chat provider.
- Async-aware. Inside
(async/...), embeddings offload to the scheduler so sibling tasks overlap.
(define texts (list "Rust is a systems language" "Python is great for ML" "Lisp is homoiconic")) ;; Batch: one network call (define vecs (llm/embed texts)) ;; Similarity between two vectors (llm/similarity (first vecs) (last vecs)) ;; => 0.72
Store & retrieve
No database. Just a file.
An in-memory vector store with disk persistence. Index once, query forever. vector-store/open loads from disk automatically. The JSON format is portable across platforms — base64-encoded embeddings, full metadata, diffable in git.
- Index once.
vector-store/savewrites to disk. Next run loads instantly — no re-embedding. - Metadata on every doc. Store source paths, page numbers, timestamps alongside the vector.
- Dimension-mismatch safety. Mixing embedding models in one store raises a clear error at search time.
(vector-store/open "docs" "my-docs.json") (for-each (lambda (text) (vector-store/add "docs" text (llm/embed text) {:text text})) texts) (vector-store/save "docs") (define hits (vector-store/search "docs" (llm/embed "Which is homoiconic?") 5)) ;; => ({:id "Lisp" ;; :score 0.94 ;; :metadata {:text "Lisp is homoiconic"}} ;; ...)
Rerank
Retrieve many. Rerank to a few.
Bi-encoders embed query and document independently — fast, but coarse. Cross-encoders read them together — slow, but precise. Sema's llm/rerank calls a hosted cross-encoder (Cohere, Jina, Voyage, Nomic, Together AI, or Fireworks AI) to reorder your candidates. The :index field maps back to the original list.
(define candidates (vector-store/search "docs" (llm/embed question) 12)) (define reranked (llm/rerank question (map (lambda (c) (:text (:metadata c))) candidates) {:top-k 4 :provider :cohere})) ;; => ({:index 0 :score 0.467 ;; :document "file-read-lines"} ;; {:index 1 :score 0.304 ;; :document "read-line"} ...) ;; Override per call: (llm/rerank q docs {:top-k 5 :provider :voyage :model "rerank-2.5"})
The whole pipeline
Four functions. One screen.
Index a directory of docs, embed the query, retrieve candidates, rerank for precision, build context, generate a grounded answer. This is the code from examples/llm/rag-docs-search.sema — run it with sema and it indexes Sema's own documentation.
;; 1. Index (run once, cached to disk) (vector-store/open "docs" "/tmp/sema-docs.vec") (when (= (vector-store/count "docs") 0) (let* ((files (file/glob "crates/sema-docs/entries/stdlib/**/*.md")) (docs (map (lambda (p) {:name (path/stem p) :path p :text (string/take (file/read p) 900)}) files)) (vecs (flat-map llm/embed (list/chunk 64 (map :text docs))))) (map (lambda (doc vec) (vector-store/add "docs" (:name doc) vec doc)) docs vecs) (vector-store/save "docs"))) ;; 2. Retrieve (define question "How do I read a file and split it into lines?") (define hits (vector-store/search "docs" (llm/embed question) 12)) ;; 3. Rerank (define reranked (llm/rerank question (map (lambda (c) (:text (:metadata c))) hits) {:top-k 4})) ;; 4. Answer (define context (string/join (map (lambda (r) (nth (map :text hits) (:index r))) reranked) "\n\n---\n\n")) (println (llm/complete (prompt (system "Answer using only the context.") (user (format "Context:\n~a\n\nQ: ~a" context question))) {:max-tokens 400})) ;; => "Use file/read-lines to read all lines, ;; then string/split or map over the result."
No infrastructure
No Pinecone. No pgvector. No Chroma.
The vector store is in-process with disk persistence. No connection strings, no Docker compose, no infrastructure to maintain. Four embedding providers and three reranker providers, all auto-configured from environment variables.
- Embedding providers. Jina, Voyage, Cohere, Nomic, Together AI, Fireworks AI, OpenAI — or any OpenAI-compatible endpoint via
:base-url. - Reranker providers. Cohere, Jina, Voyage, Nomic, Together AI, Fireworks AI — same API key, per-call override.
- Separate from chat.
llm/configure-embeddingslets you use Voyage for embeddings and Anthropic for chat.
The argument
What you'd assemble without it.
A typical Python RAG stack: LangChain for orchestration, Chroma or Pinecone for vectors, sentence-transformers for embeddings, a separate Cohere call for reranking, and prompt templates to glue it together. Sema replaces all of it with four function calls.
from langchain.vectorstores import Chroma from langchain.embeddings import \ HuggingFaceEmbeddings from langchain.text_splitter import \ RecursiveCharacterTextSplitter from cohere import Client as Cohere from langchain.openai import ChatOpenAI embeddings = HuggingFaceEmbeddings() splitter = RecursiveCharacterTextSplitter() vectorstore = Chroma.from_documents( splitter.split_text(docs), embeddings) cohere = Cohere(api_key=...) llm = ChatOpenAI() def rag(question): docs = vectorstore.similarity_search( question, k=12) results = cohere.rerank( query=question, documents=docs, top_k=4) context = "\n\n".join( docs[r.index].page_content for r in results) return llm.predict( f"Context: {context}\nQ: {question}")
(define hits (vector-store/search "docs" (llm/embed question) 12)) (define reranked (llm/rerank question (map (lambda (c) (:text (:metadata c))) hits) {:top-k 4})) (define context (string/join (map (lambda (r) (nth (map :text hits) (:index r))) reranked) "\n\n---\n\n")) (llm/complete (prompt (system "Answer using context.") (user (format "Context:\n~a\n\nQ: ~a" context question))) {:max-tokens 400})
Search your first document.
Run the example. It indexes Sema's own docs and answers questions.