Transactions & Durability

How OMGDB provides ACID transactions, crash-safe durability via append+fsync, torn-tail recovery, compaction, integrity verification, and log repair.

OMGDB’s durability model follows directly from its storage design: the append-only operation log (oplog.ndjson) is the canonical source of truth, and every logical state is rebuilt by replaying it on open. This page describes how that log delivers ACID transactions and crash safety, and the CLI tools that compact, verify, and repair it.

Two facts shape everything below. First, the engine is single-writer: a store takes an exclusive advisory lock on its directory at open, so exactly one process mutates the log at a time. Second, every mutation is append, fsync, then apply in memory — the durable log is updated before in-memory state ever changes, so a crash can never leave memory ahead of the log.

ACID transactions

A transaction groups several writes so they become durable and visible together, or not at all. You run one through Store::transaction, passing a closure that receives a Txn handle:

let mut s = Store::open(&dir)?;
s.transaction(|t| {
    t.insert_one("c", doc(&[("k", Value::I64(1))]))?;
    t.insert_one("c", doc(&[("k", Value::I64(2))]))?;
    Ok(())
})?;
assert_eq!(s.count("c"), 2);

// durable + replayed from the log
let s2 = Store::open(&dir)?;
assert_eq!(s2.count("c"), 2);

The Txn handle exposes insert_one, replace_one, delete_by_id, and find_by_id — the same write/read surface as the store, but buffered.

Read-your-writes overlay

A Txn reads from a consistent committed snapshot (base) plus an overlay of its own pending writes. The overlay is a map from (collection, _id-key) to Some(document) (an upsert) or None (a delete). Reads consult the overlay first, then the base — so a transaction always sees its own uncommitted changes:

s.transaction(|t| {
    let id = t.insert_one("c", doc(&[("k", Value::I64(9))]))?;
    let got = t.find_by_id("c", &id).cloned();
    assert_eq!(got.and_then(|d| d.get("k").cloned()), Some(Value::I64(9)));
    assert!(t.delete_by_id("c", &id)?);
    assert!(t.find_by_id("c", &id).is_none());
    Ok(())
})?;
assert_eq!(s.count("c"), 0); // insert then delete in one txn nets nothing

Duplicate-_id checks and validation rules apply against the overlay-merged view, so an insert that collides with an earlier insert in the same transaction is rejected immediately.

Atomic commit

When the closure returns Ok, the buffered operations are written to the log framed by begin and commit markers, all tagged with the same transaction id, then fsynced once:

Step	Record(s) written
Open	one `begin` marker (txn id = the LSN the begin record takes)
Body	the buffered `insert` / `replace` / `delete` ops, in order, each tagged with the txn id
Close	one `commit` marker
Durability	a single `sync()` (fsync) after the commit marker, then ops are applied in memory

The transaction id is the log sequence number (LSN) the begin record occupies, which is globally unique. Because the whole group is fsynced before any in-memory apply, commit is atomic and durable as a unit.

Note: A read-only transaction (one that buffers no operations) writes nothing to the log — no begin, no commit. There is no log churn for pure reads.

Abort

If the closure returns Err, the transaction aborts by writing nothing: no records reach the log and no in-memory state changes. The store never emits an explicit abort record; it simply leaves a dangling begin out of the log entirely (because nothing was written yet).

let r: Result<(), StoreError> = s.transaction(|t| {
    t.insert_one("c", doc(&[("k", Value::I64(1))]))?;
    Err(StoreError::Io(std::io::Error::other("boom")))
});
assert!(r.is_err());
assert_eq!(s.count("c"), 0);

Note: The abort op token is honoured on replay (a begin/…/abort group is discarded), but only external log producers ever write it. OMGDB’s own abort path writes nothing at all.

Isolation

The &mut self borrow in transaction() serializes transactions: each runs to completion before the next can start, with no interleaving. Combined with the single-writer directory lock, the effective isolation is serializable — stronger than snapshot isolation — for this single-writer engine. The Txn reads from a frozen committed snapshot plus its overlay, so a transaction never observes a concurrent writer (there are none).

Limitation: Isolation comes entirely from single-writer serialization. There is no multi-reader / multi-writer concurrency: a second process opening the same store directory is cleanly refused with StoreError::Locked.

Durability & crash recovery

The write path

Every durable mutation — autocommitted (insert_one, delete_by_id, replace_one, create_index, define_collection) or transactional — follows the same ordering:

append framed record(s) to oplog.ndjson  ->  flush + fsync (sync_data)  ->  apply in memory

The fsync happens before the in-memory state changes. On POSIX, after creating the log file (and after a compaction rename) OMGDB also fsyncs the store directory so the file’s directory entry survives a crash.

Limitation: Directory fsync (sync_dir) is a no-op on Windows. The guarantee that a newly created or renamed file’s directory entry survives a crash relies on POSIX filesystems (ext4/xfs).

Torn-tail recovery

A record is durable only once its terminating newline reaches disk. On open, replay splits the file’s bytes at the last newline: everything up to and including it is the durable region; any trailing fragment after it is an incomplete crash-time write and is silently dropped (and flagged via truncated_tail).

This works even when the torn fragment is not valid UTF-8 — for example, a multi-byte character cut mid-encoding. Non-UTF-8 in the unterminated tail is ignored, not an error:

let mut bytes = fs::read(&path).unwrap();
bytes.extend_from_slice(&[0xE2, 0x82]); // first two bytes of '€' (3-byte char), truncated
fs::write(&path, &bytes).unwrap();
let replay = read_log(&path).unwrap();
assert_eq!(replay.records.len(), 1);
assert!(replay.truncated_tail);

The distinction matters: a complete (newline-terminated) record containing invalid UTF-8 is genuine corruption (LogError::Corruption: "a complete record contains invalid UTF-8"), whereas invalid UTF-8 only in the unterminated tail is a recoverable torn tail. The default open path (read_log) is fail-stop on any complete corrupt record — a CRC mismatch, bad UTF-8 in a complete record, a malformed record, or a non-contiguous LSN — so corruption is never silently dropped.

The write-poison guard

If any write to the log fails, the LogWriter sets a poisoned flag and refuses all further appends and syncs. A failed flush/fsync poisons it too (a partial flush may have reached disk).

This prevents record splicing: without the guard, the next record’s bytes could be appended onto a torn, unterminated fragment, together forming a complete line whose CRC no longer matches its content — which would make the whole log fail-stop unreadable. By refusing to write after a failure, the torn bytes stay an unterminated tail (recoverable) rather than becoming a complete-but-invalid line. Recovery is to reopen the store and replay the intact prefix:

assert!(w.append(1, sample_insert(1)).is_err());
assert!(w.poisoned);
// a second append is refused without writing anything more — no splice
let replay = read_log(&path).unwrap();
assert!(replay.records.is_empty());
assert!(replay.truncated_tail);

Crash before commit is discarded

Replay applies transaction semantics: a begin opens a pending buffer keyed by txn id, tagged ops accumulate in it, and a matching commit applies the buffer. A begin with no matching commit — exactly what a crash mid-transaction leaves on disk — is dropped entirely, so a partial transaction never partially applies:

let mut w = oplog::LogWriter::open(&log_path, 0).unwrap();
w.append_txn(0, Some(1), oplog::Op::Begin).unwrap();
w.append_txn(0, Some(1), oplog::Op::Insert { ns: "c".into(), id: Value::I64(1), doc: d }).unwrap();
// ... crash: no commit record written
w.sync().unwrap();
let s = Store::open(&dir).unwrap();
assert_eq!(s.count("c"), 0);

A crash-truncation test confirms this for every byte-prefix of a reference log: truncating at any offset always opens cleanly, always passes the integrity check, and yields a document count that grows monotonically with prefix length and never exceeds the full count — proving begin/op/commit boundaries are crossed atomically.

Compaction

Over time the log accumulates superseded inserts, tombstones, and committed-transaction framing. omgdb compact rewrites the log to its minimal canonical form:

one define record per collection spec (in collection order),
one create_index record per index (keys sorted),
one insert record per live document (in _id order),

discarding all superseded inserts, deletes, and aborted/dangling transactions. The logical state is unchanged — a fresh replay yields exactly the same data.

omgdb compact app.omgdb
# compacted: 412 -> 137 records

Compaction is crash-safe and read-back verified. The new log is written to oplog.ndjson.compacting and fsynced, then re-read with full CRC verification; the read-back rejects any truncated tail or record-count mismatch. Only then is the temp file atomically renamed over oplog.ndjson and the directory fsynced. The original stays intact until the rename, so a rename failure is recoverable by reopening the original. An orphaned .compacting temp left by a crash mid-compaction is removed on the next open.

let gone = s.insert_one("c", doc(&[("k", Value::I64(3))]))?;
s.delete_by_id("c", &gone)?;
let before = s.integrity_check()?.records;
let report = s.compact()?;
assert_eq!(report.records_after, 2, "two surviving inserts");
assert!(report.records_after < before);
assert!(s.integrity_check()?.consistent);

Integrity verification

omgdb verify re-reads the on-disk log (verifying every record’s CRC), folds it from scratch, and checks that the rebuilt data, validation catalog, and secondary indexes equal the live in-memory state. It is the runtime proof that the log reproduces the state.

omgdb verify app.omgdb
# OK: 137 record(s), 134 document(s) in 3 collection(s); log reproduces state

On success it prints OK: <records> record(s), <documents> document(s) in <collections> collection(s); log reproduces state. If a torn trailing record was skipped on open, a WARN: line is printed to stderr but the result is still consistent. If replay does not reproduce the live state, verify fails with INCONSISTENT: replaying the log does not reproduce the live state.

To keep live state byte-for-byte equivalent to a fresh replay (including a replay of a compacted log), the in-memory state is normalized: empty collections are dropped from the data map and empty index value-buckets are removed, exactly as a compacted-log replay would produce them.

Tip: verify opens the store, so it must not be run against a store another process has open. See introspection for describe and dump, the other self-inspection commands.

Repair

A corrupt log cannot be opened as a store at all (open is fail-stop). omgdb repair operates on the raw log file directly using the lenient recovery reader, which stops at the first corrupt record and reports the intact prefix.

By default repair is a dry-run report — it modifies nothing:

omgdb repair app.omgdb
# CORRUPT at byte 8421 of 9013: CRC mismatch
# recoverable prefix: 137 record(s)
# re-run with `--truncate --yes` to drop the corrupt tail (a backup is kept)

If the log is already intact it prints OK: log is intact (<n> record(s)); nothing to repair.

To actually recover, pass both --truncate and --yes. The --yes flag confirms the destructive change; --truncate without it bails with refusing to modify the log without `--yes`.

Flag	Effect
(none)	Report the corruption byte offset, reason, and recoverable record count. Modifies nothing.
`--truncate`	Intent to truncate to the recoverable prefix. Requires `--yes` or it refuses.
`--truncate --yes`	Back up the original to `oplog.ndjson.corrupt.bak`, then truncate the log to the recoverable prefix.

omgdb repair app.omgdb --truncate --yes
# repaired: kept 137 record(s); corrupt original backed up to app.omgdb/oplog.ndjson.corrupt.bak

Recovery writes the recoverable prefix to a temp file, fsyncs it, and atomically renames it over the log; the full original is preserved as oplog.ndjson.corrupt.bak first.

Limitation: Run repair only when the store is closed (no process holds the lock). It rewrites the raw log file directly and discards every record after the first defect — review the dry-run report before passing --truncate --yes.

The invariants

The transaction and durability machinery upholds three invariants OMGDB tests against:

Invariant	Statement
I1 — text completeness	The op-log text fully determines the logical state; replaying it on open reconstructs the entire state, and each complete record is self-describing canonical JSON plus a CRC.
I2 — rebuild equivalence	Reopening the store (or compacting and reopening) reproduces the exact same logical state. `integrity_check` / `omgdb verify` asserts this at runtime, and the crash-truncation matrix confirms it for every byte-prefix of a real log.
I3 — export stability	The log is a stable, rebuildable artifact: compaction rewrites it to a canonical minimal form (defines, then sorted `create_index`, then inserts in `_id` order) whose replay yields identical state.

For the on-disk format these invariants rest on — record framing, the CRC, LSN density, and the directory layout — see storage.