Introspection — OMGDB Docs

Self-inspection commands that let an agent understand an OMGDB database without guessing — describe, inspect, dump, explain, and diagnose.

OMGDB is built so an agent can understand a database it has never seen before, without guessing. The introspection surface answers four questions directly: what collections and fields exist, how many documents are there, what is the canonical state, and why does a query behave the way it does. Every command reads through the public store API and produces output meant to be consumed by a human or an LLM.

There are five commands. describe and inspect report structure; dump exports state; explain and diagnose explain query behavior. All of them operate on a store directory passed as the first positional argument (the examples use app.omgdb).

describe — a live Markdown manual

omgdb describe app.omgdb

describe renders a Markdown “manual” of the whole database. The database name is taken from the store directory’s file name (falling back to the literal database if it can’t be resolved). The output contains, in order:

A heading # Database: <name> followed by a blank line.
A line <N> collection(s).
For each collection: a ## <ns> (<count> document(s)) heading, an inferred-schema table, and one sample document.

The schema table has three columns — field, types, and present — and one row per top-level field. The present column is a <present>/<count> ratio showing how many of the sampled documents contained that field. When a collection has at least one document, a Sample: line is followed by a fenced json block containing the first document rendered as canonical JSON.

# Database: mydb

1 collection(s).

## users (2 document(s))

| field | types | present |
|-------|-------|---------|
| age | long | 1/2 |
| name | string | 2/2 |

Sample:

```json
{"name":"ana","age":30}


> **Note:** `describe` emits only collections, the inferred-schema table, and one sample document per collection. It does **not** report secondary indexes or vector/embedding state — use [inspect](#inspect-collections-and-counts) to see all collections (including the `__vectors` sidecars from [vector search](/docs/vector-search)), and [explain](#explain-the-query-plan) to learn which indexes a query can use.

An empty collection still prints its table header, but with no rows and no sample block.

## Schema inference: types and presence

The schema table comes from a scan over the collection. For each document, every **top-level** field is recorded: the distinct value types observed, and a presence counter. The result per field is:

| Column | Meaning |
|--------|---------|
| `field` | The top-level field name. |
| `types` | The distinct type-name tokens observed, sorted and de-duplicated. A field seen as both a number and a string lists both. |
| `present` | How many sampled documents contained the field, over the total document count. |

Type tokens are MongoDB-style names (the same ones used by the [`$type` query operator](/docs/query-operators)): `null`, `bool`, `long` (an i64 integer), `double` (an f64 float), `string`, `binData` (bytes), `array`, `object`, `objectId`, and `date`. Note that integers report `long`, not `int`.

> **Limitation:** Schema inference is top-level only. Nested object fields and array element fields are not descended into, so a field whose value is an embedded document appears as a single `object` type with no breakdown of its inner keys.

## inspect — collections and counts

```sh
omgdb inspect app.omgdb
omgdb inspect app.omgdb --json

inspect is the lightweight counterpart to describe: it lists every collection and its document count, without sampling fields. By default it prints one text line per collection in the form <ns>: <n> docs (or (empty store) when there are no collections).

With --json it emits a machine-readable object as canonical JSON:

{"collections":[{"name":"users","count":2}]}

Flag	Description
`--json`	Emit `{"collections":[{name,count}]}` as canonical JSON instead of text lines.

Because vector embeddings are stored as ordinary collections, a synced field shows up here as a sidecar — for example a docs collection synced via vsync produces a docs.__vectors entry visible in inspect.

dump — deterministic canonical export

omgdb dump app.omgdb

dump produces a deterministic, line-oriented export of the entire logical state. It writes one line per document, formatted as the collection name, a tab, then the document as canonical JSON:

<collection>\t<canonical-json>

Documents are emitted in collection iteration order, then in _id order within each collection. The output is stable across runs: two successive dump calls on an unchanged store are byte-for-byte identical. This determinism is the basis of invariant I3 — a property that lets you snapshot, diff, and compare database states reliably (see transactions and durability for the broader integrity model).

c	{"_id":1,"v":"a"}
c	{"_id":2,"v":"b"}

explain — the query plan

omgdb explain app.omgdb users '{"name":"ana"}'

explain compiles a MongoDB-style filter and reports, in plain language, how find will execute it: an index scan, an index range scan, or a full collection scan. The filter argument is a JSON string.

There are three possible plans:

Plan string	When it applies
`index scan: equality on \`.` (secondary index), then filter`	A single top-level equality predicate is on an indexed field.
`index range scan: range on \`.` (secondary index), then filter`	A single top-level range predicate (`$gt`/`$gte`/`$lt`/`$lte`) with scalar bounds is on an indexed field.
`full collection scan of \`` ( documents), then filter`	No usable index; `N` is the live document count.

When the query falls back to a full scan and it contains a top-level equality on a field that is not indexed, the plan appends a self-repair hint:

full collection scan of `users` (1000 documents), then filter; no index on `name` — suggest `omgdb create-index users name`

The hint names the exact command to create the index that would accelerate the query. See indexes for which predicates are eligible for index acceleration; in short, only a single top-level equality or a single top-level scalar range on a directly-indexed field is accelerated, and everything else scans.

diagnose — the why-not debugger

omgdb diagnose app.omgdb users '{"age":{"$gte":25}}'

explain tells you how a query runs; diagnose tells you why it returns what it returns. It is a “why-not” debugger: for each top-level field predicate it counts how many documents satisfy that predicate alone, so an agent can immediately see which condition is the limiting (or eliminating) one. The output is canonical JSON with this shape:

{
  "collection": "users",
  "totalDocuments": 100,
  "matched": 12,
  "predicates": [
    { "field": "role", "matched": 80 },
    { "field": "age", "matched": 12 }
  ]
}

The fields are:

Field	Meaning
`collection`	The namespace diagnosed.
`totalDocuments`	Total documents in the collection.
`matched`	How many documents satisfy the full filter.
`predicates`	One entry per top-level field predicate, with the field name and its standalone match count.

When a single predicate matches zero documents, that entry additionally reports the field’s observed value range — observedMin and observedMax under the engine’s total value order — so the agent can see how far the predicate’s threshold is from any real data:

// filter: {"role":"admin","age":{"$gt":90}}
{
  "collection": "users",
  "totalDocuments": 2,
  "matched": 0,
  "predicates": [
    { "field": "role", "matched": 2 },
    { "field": "age", "matched": 0, "observedMin": 30, "observedMax": 40 }
  ]
}

Here the report makes the failure obvious: role alone matches both documents, but age > 90 matches none — and the observed range (30–40) shows the threshold is far above any stored value.

Tip: Logical operators are not diagnosed per field. Any top-level key beginning with $ (such as $or, $and, $nor) is skipped in the predicates list, so diagnose is most useful on the conjunction of field conditions that makes up the body of a filter.

LLM-targeted errors and did-you-mean suggestions

Introspection extends into error reporting. When a filter passed to find, explain, or diagnose uses an operator OMGDB does not recognize, the compiler does not fail silently — it returns an UnknownOperator error naming the bad token and, when a close match exists, a deterministic “did you mean” suggestion. The suggestion is the closest known operator within a Levenshtein edit distance of 2.

{"age":{"$gtee":1}}
// error: $gtee (did you mean `$gte`?)

This lets an agent self-repair a malformed query in a single step rather than guessing. The known operators are: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $exists, $type, $not, $and, $or, $nor, $size, $all, $mod, $elemMatch, and $regex. See query operators for their full semantics.

Putting it together

A typical agent workflow uses these commands in sequence:

# 1. Learn the shape of an unfamiliar database.
omgdb describe app.omgdb

# 2. Get a quick count of every collection.
omgdb inspect app.omgdb --json

# 3. Write a query; check how it will run.
omgdb explain app.omgdb users '{"age":{"$gte":25}}'

# 4. If it returns nothing unexpected, find out why.
omgdb diagnose app.omgdb users '{"age":{"$gte":25}}'

# 5. Snapshot the exact state for diffing.
omgdb dump app.omgdb

Together these turn a database from an opaque blob into a self-describing artifact: an agent can read its structure, count its contents, reason about query plans, debug empty result sets, and capture a deterministic snapshot — all without prior knowledge of the schema.