Vector Search

Local, dependency-free semantic search over a text field — a deterministic offline embedder plus exact flat cosine kNN, hybrid filters, and persisted embeddings with staleness tracking.

OMGDB ships a local semantic search layer that turns a string field into a vector and ranks documents by cosine similarity to a query — with no external service, network call, or API key. Text is embedded by a pluggable Embedder; the bundled HashingEmbedder is a deterministic, offline bag-of-words embedder. Ranking is a flat (exact) cosine kNN over the collection.

This makes semantic search reproducible and self-contained: the same input always yields the same vector, so results are stable across runs and machines. It is the foundation that context packs build on for token-budgeted, cited retrieval.

Limitation: The only embedder in the source today is HashingEmbedder, a baseline (the hashing trick + L2 normalization), explicitly not a neural model. A real model — e.g. all-MiniLM-L6-v2 via ONNX — is a planned drop-in replacement implementing the same Embedder trait. It is not yet implemented. Likewise, search is flat/exact with a full collection scan; there is no approximate-nearest-neighbor (ANN) index, HNSW, or vector quantization.

How it works

Text becomes a fixed-length f32 vector through three pieces:

Embedder — a trait abstracting text-to-vector: dim() (vector length), embed(text) (the vector), and model_id() (a stable name+version+shape identifier).
HashingEmbedder — the bundled implementation. It tokenizes, hashes each token into a bucket, and L2-normalizes.
cosine(a, b) — cosine similarity in [-1, 1], returning 0.0 if the lengths differ or either vector is all-zero.

The HashingEmbedder

HashingEmbedder is deterministic and dependency-free. For each input it:

Splits text on any non-alphanumeric character and drops empty tokens.
Lowercases each token.
Hashes it with 64-bit FNV-1a, takes the result modulo dim to pick a bucket, and increments that bucket’s count.
L2-normalizes the resulting vector.

The default dimensionality is 256 (HashingEmbedder::default()); HashingEmbedder::new(dim) clamps dim to at least 1. The model_id is the string hashing-v1/dim={dim}, so two embedders with different dimensions are treated as different models for provenance purposes. All CLI commands construct HashingEmbedder::default() (dim 256).

Because it is a pure bag-of-words hash, the embedder captures lexical overlap, not deep semantics. A document that shares more tokens with the query ranks higher. The version suffix in model_id (v1) is bumped whenever the tokenization or hashing changes the produced vectors.

Searching

omgdb vsearch ranks documents in a collection by cosine similarity of a string field to a query and prints up to --k hits, best-first. Each hit is one line: the score formatted to four decimal places, a tab, then the matching document as canonical JSON.

Synopsis

omgdb vsearch <path> <collection> <field> <query> [--k N] [--filter JSON]

Flag	Description	Default
`--k`	Number of results to return.	`5`
`--filter`	MongoDB-style filter (JSON) to pre-filter candidates — enables hybrid search.	none

Example

omgdb create app.omgdb
omgdb insert app.omgdb docs '{"text":"embedded vector database search engine"}'
omgdb insert app.omgdb docs '{"text":"a recipe for chocolate cake"}'

# Rank docs in `docs` by similarity of their `text` field to the query (top 5).
omgdb vsearch app.omgdb docs text "database search" --k 5

Each output line is <score>\t<document>, best-first. The database-related document scores higher than the recipe because it shares more tokens with the query.

Note: Only string fields are embedded. Documents that lack a string value at the given field are silently skipped by search. A document missing _id reports its id as null.

Hybrid search

Supplying --filter performs hybrid search: a structured MongoDB-style pre-filter is applied first, then the surviving documents are ranked semantically — both in a single scan pass. Only documents that match the filter are eligible to rank.

# Pre-filter to published docs, then rank those by relevance.
omgdb vsearch app.omgdb docs text "database search" --filter '{"status":"published"}'

If a draft document is equally relevant to a published one, the filter excludes it entirely. The filter uses the same syntax as the rest of OMGDB — see query operators.

Persisting embeddings: `vsync`

Search recomputes embeddings on the fly and does not read any persisted index (see the caveat below). The vsync/vstale pair is a separate facility for materializing embeddings into the store with auditable provenance and answering “which embeddings need re-syncing?”.

omgdb vsync embeds the string field of every document in a collection and persists each vector — together with a provenance envelope — into the sibling collection <collection>.__vectors. That target is an ordinary, op-log-backed, inspectable collection: it shows up in inspect, survives a store reopen, and passes the integrity check.

Synopsis

omgdb vsync <path> <collection> <field>

vsync is idempotent. Re-running upserts by _id (replacing an existing record, or inserting a new one) and refreshes any document whose text has changed. Documents lacking a string field are skipped. It prints the count written:

omgdb vsync app.omgdb docs text
# stdout: synced 1 embedding(s) into `docs.__vectors`

Each persisted record has the shape {_id, provenance, vector}, where provenance is a document with keys model, dim, contentHash, and sourceField, and vector is an array of doubles.

The provenance envelope

Every persisted embedding stores enough to trace it to its producer and detect staleness:

Field	Description
`model`	The `model_id` of the embedder that produced the vector (e.g. `hashing-v1/dim=256`).
`dim`	The embedding dimensionality.
`contentHash`	A 16-hex-digit FNV-1a hash of the exact embedded text.
`sourceField`	The document field the text was taken from.

This makes AI-derived state auditable rather than opaque: a stored vector can always be traced back to the model and the exact text it came from.

Detecting stale embeddings: `vstale`

omgdb vstale reports the _ids in a collection whose persisted embedding is stale relative to its source. An embedding is stale when:

No persisted vector exists for the document, or
The embedder’s model_id differs from the recorded model (different model or configuration), or
The dimensionality differs, or
The source text has changed (the content hash no longer matches).

Synopsis

omgdb vstale <path> <collection> <field>

vstale prints the stale _ids (canonical JSON) to stdout and a summary count (N stale embedding(s)) to stderr.

Example

omgdb create app.omgdb
omgdb insert app.omgdb docs '{"text":"alpha beta"}'

omgdb vstale app.omgdb docs text   # stderr: "1 stale embedding(s)" (no persisted vector yet)
omgdb vsync  app.omgdb docs text   # stdout: "synced 1 embedding(s) into `docs.__vectors`"
omgdb vstale app.omgdb docs text   # stderr: "0 stale embedding(s)"

Editing a document’s source text marks only that document stale; re-running vsync clears it. Because the vectors live in an ordinary op-log-backed collection, they survive a reopen and remain consistent.

Note: vsearch and the context-pack commands recompute embeddings on the fly from the live document field; they do not consult the persisted <collection>.__vectors collection. vsync/vstale provide persistence, provenance, and staleness tracking — not a backing index that query-time search reads from. Using persisted vectors as the search backing store is planned but not yet implemented.

Library API

The omgdb-vector crate exposes these as plain Rust functions over a Store:

let e = HashingEmbedder::default();

// Flat cosine kNN: Vec<(Value /* _id */, f32 /* score */)>, best-first.
let results = search(&store, "docs", "text", "database search", 5, &e);

// Hybrid search with a compiled pre-filter.
let filter = omgdb_query::Filter::compile(
    &Value::from_json_str(r#"{"status":"published"}"#).unwrap(),
).unwrap();
let hits = search_where(&store, "docs", "text", "database search", 5, &e, &filter);

Function	Purpose
`search` / `search_where`	Flat cosine kNN, optionally with a structured pre-filter (hybrid).
`sync_vectors`	Persist embeddings + provenance into `<ns>.__vectors`; returns count written.
`list_stale_vectors`	The `_id`s whose persisted embedding is stale.
`cosine`	Cosine similarity of two slices.
`vector_ns`	The sibling collection name `{ns}.__vectors`.
`context_pack` / `context_pack_where`	Token-budgeted, cited retrieval bundle (see context packs).

MCP tools

The vector surface is also exposed over MCP for agents. Both tools are read-only (idempotent) and gated at the read scope, so they are available even on a read-only MCP server.

Tool	Args	Description
`vsearch`	`path`, `collection`, `field`, `query`, optional `k` (default 5), optional `filter`	Semantic search over a text field; `filter` enables hybrid search.
`context_pack`	`path`, `collection`, `field`, `query`, optional `budget` (default 1000), optional `filter`	Token-budgeted, cited context pack.

See MCP for connecting an agent.

Limitations and caveats

Limitation: Search is flat/exact cosine kNN with a full collection scan — there is no ANN index. Cost is O(N) per query, and every document’s field is re-embedded on each vsearch call.

The bundled HashingEmbedder is a lexical bag-of-words baseline, not a semantic neural model. A real ONNX model (all-MiniLM-L6-v2) is a planned drop-in implementing the same Embedder trait — not yet implemented.
Only string fields are embedded and searched; documents lacking a string value at the field are silently skipped by sync_vectors, search, and context packs.
Both content hashing (staleness detection) and the embedder’s bucketing use a 64-bit FNV-1a hash, chosen for reproducibility, not collision resistance. The contentHash is a 16-hex-digit string.
vsearch and context packs do not yet read the persisted <collection>.__vectors index; those vectors exist for provenance, auditing, and staleness reporting only.

Context packs — token-budgeted, cited retrieval built on the same ranking.
Query operators — the filter syntax used by hybrid search.
MCP — running semantic search and context packs from an agent.

How it works

The HashingEmbedder

Searching

Synopsis

Example

Hybrid search

Persisting embeddings: vsync

Synopsis

The provenance envelope

Detecting stale embeddings: vstale

Synopsis

Example

Library API

MCP tools

Limitations and caveats

Related

Persisting embeddings: `vsync`

Detecting stale embeddings: `vstale`