Indexes

Secondary indexes in OMGDB — how the ordered single-field index accelerates equality, range, and multikey queries, and what is not yet supported.

OMGDB supports secondary indexes on a single top-level field of a collection. An index is an ordered structure that lets find answer certain queries by examining a small set of candidate documents instead of scanning the whole collection. Like every other piece of logical state in OMGDB, an index is a derived artifact: it is recorded in the append-only operation log and rebuilt in memory when the store is opened.

This page explains how to create an index, which predicates an index accelerates, how indexes are persisted and rebuilt, and the current limitations. For the predicate operators themselves, see query operators; for inspecting which plan a query uses, see introspection.

Creating an index

Create a secondary index with the create-index command. It takes the store path, the collection (namespace) name, and the field to index:

omgdb create-index app.omgdb users age

created index on `users.age`

The command opens the store, backfills the index from all existing documents in the collection, appends a create_index operation to the log, fsyncs, and applies the index in memory. Creating an index over a populated collection is therefore a one-time cost paid at creation; subsequent inserts, replaces, and deletes maintain the index incrementally.

Argument	Description
`path`	Store directory, e.g. `app.omgdb`.
`collection`	Collection (namespace) name.
`field`	Single top-level field name to index.

Note: The field name is a single top-level key. Dotted paths such as addr.city are not index targets — see the limitations below.

What the index accelerates

The index is an ordered structure built on each value’s order-preserving key. Concretely it is a BTreeMap<order-key, set-of-_id-keys> per (collection, field) pair. Because it is ordered, the same structure serves both point lookups and range scans. find chooses to use it for two kinds of top-level, single-field predicate: equality and range. Array (multikey) fields are not a separate predicate kind — they are handled by the equality path, as described below.

Equality

A single top-level equality predicate ({ field: value }, which compiles to $eq) on an indexed field is answered by a direct lookup into the index, then re-filtered for exactness:

{"age": 30}
{"role": "admin"}

Range

A single top-level range predicate on an indexed field is answered by an ordered range scan of the index. The combined bounds $gt/$gte form the lower bound and $lt/$lte form the upper bound, so a two-sided range becomes one index range scan:

{"age": {"$gte": 18, "$lte": 65}}
{"age": {"$gt": 18}}
{"age": {"$lt": 30}}
{"age": {"$gte": 18, "$lt": 40}}
{"age": {"$gte": 40.5}}

The index range is a superset of the exact answer: the order key is order-preserving but not injective (for example 2 and 2.0 share a key), and the index bounds are inclusive at the key level. find always re-filters the index candidates with the full filter, which corrects both the strict-versus-inclusive distinction ($gt/$lt vs $gte/$lte) and any order-equal key collisions. The result is therefore identical to a full scan — only faster.

Limitation: Range acceleration requires scalar bounds. If a bound value is an array or object, find falls back to a full collection scan.

Multikey (array fields)

When a document’s indexed field is an array, the index stores one key per array element plus one key for the whole array’s order key. This makes the index multikey: a single top-level equality predicate accelerates both array-contains equality and whole-array equality.

{"tags": "rag"}
{"tags": ["rag", "db"]}

Note: Only the bare-value (equality) forms above use the index. $in (for example {"tags": {"$in": ["db"]}}) is not index-accelerated — it compiles to a non-equality atom the planner does not inspect, so it falls back to a full collection scan and re-filters.

Because a single document can appear in several buckets (one per element), range scans deduplicate document _ids before returning candidates. As always, the candidates are re-filtered for exactness.

How indexes are persisted

An index is a derived artifact, not independent on-disk state. Creating one appends a create_index operation ({ns, field}) to oplog.ndjson; that record is fsynced before the index is applied in memory. The index contents themselves are never written to disk.

When the store is opened, OMGDB replays the entire log: each create_index record reconstructs the index, and inserts/replaces/deletes that follow it maintain it. This makes indexes part of OMGDB’s core invariants:

I1 (text completeness): the log fully determines logical state — including which indexes exist and their contents.
I2 (rebuild equivalence): reopening the store (or compacting and reopening) reproduces the exact same indexes. The .verify integrity check re-reads the log, re-folds it, and asserts that the rebuilt indexes equal live in-memory indexes.

Compaction rewrites the log to a minimal form that includes one create_index record per index (with keys sorted), so an index survives compaction and is re-derived on the next open. See storage for the durability and compaction model.

Note: In the current milestone all state — data, the validation catalog, and indexes — lives only in memory and is rebuilt by replaying the log on every open. There are no persistent binary indexes or caches yet, so startup cost scales with total log size until compaction shrinks it.

Inspecting the plan with `explain`

Use explain to see whether a filter will use an index or fall back to a full scan. It compiles the filter and reports the chosen plan as a plain-language string.

omgdb explain app.omgdb users '{"age":{"$gte":18,"$lte":65}}'

When the field is indexed, explain reports an index plan:

index scan: equality on `users.age` (secondary index), then filter

index range scan: range on `users.age` (secondary index), then filter

When no usable index exists, it reports a full scan — and if a top-level equality is present on an un-indexed field, it appends a suggestion:

full collection scan of `users` (1000 documents), then filter; no index on `age` — suggest `omgdb create-index users age`

See introspection for explain, the diagnose “why-not” debugger, and other planning tools.

Limitations

The index implementation is single-field and single-predicate. Be explicit about what it does not do.

Limitation: Compound (multi-field) indexes are not implemented. Each index covers exactly one field, and there is no index intersection or composite key. A multi-field filter uses at most one single-field index — the first matching top-level equality, otherwise the first matching range — and re-filters the candidates. A query that has no single-field predicate it can use falls back to a full scan.

Limitation: Unique and partial indexes are not implemented. An index never enforces uniqueness and never restricts which documents it covers; it is purely an access accelerator. Use schema validation for the validation rules that exist today.

Additional constraints to be aware of:

Constraint	Behavior
Logical operators	A top-level `$or`, `$nor`, or `$not` — or an equality/range nested inside any logical operator — is not seen by the planner and forces a full scan. Only predicates in the root `AND` of single fields are eligible.
Dotted paths	A dotted-path predicate such as `{"addr.city":"athens"}` is never index-accelerated; only single-segment field names are eligible.
Range bounds	Range acceleration requires scalar bounds; an array or object bound forces a full scan.
Primary `_id` lookups	Documents are keyed by the canonical-JSON form of their `_id` in an ordered map, so a lookup by `_id` is O(log n). There is no separate hash index on `_id`.

For the full operator reference and array (multikey) matching semantics, see query operators.