Skip to content

LSM on object storage

NamiDB uses an LSM tree (log-structured merge) layered on top of an object store. The classic LSM building blocks — memtable, WAL, immutable SSTs, leveled compaction — all map naturally onto the object storage model.

The layers

┌──────────────────────────────────────────────┐
│ Memtable │ in RAM, write side
│ (concurrent skip-list, latest per key) │
├──────────────────────────────────────────────┤
│ WAL │ in the bucket
│ (append-only segments, durable on commit) │
├──────────────────────────────────────────────┤
│ L0 SSTs ← memtable flushes │ in the bucket
│ L1 SSTs ← compaction of L0 │ in the bucket
│ L2 SSTs ← compaction of L1 │ in the bucket
│ ... │
└──────────────────────────────────────────────┘

Write path

  1. Cypher mutation parses, plans, and executes against the current snapshot.
  2. WAL append — every record is appended to the current WAL segment and flushed before the call returns. This is what gives you durability on commit_batch.
  3. Memtable insert — the record lands in the in-RAM memtable so subsequent reads see it.
  4. Manifest CAS — the LSN watermark advances. A new manifest version is published.

When the memtable grows past a threshold (or the flush interval ticks), it becomes an L0 SST — a Parquet (nodes) or custom-format (edges) file, uploaded to the bucket, then referenced from the next manifest version. The WAL segments that contributed to that flush become eligible for truncation.

Read path

  1. Open a snapshot at the current manifest version.
  2. Plan the query (cost-based optimizer, predicate pushdown, factorization).
  3. For each node-pattern: probe memtable, then L0, then L1, …, merging the visible versions per LSN.
  4. For each edge-pattern: walk the CSR adjacency (in-memory cache keyed by SST id), reading edge property streams on demand.
  5. Apply filters, projections, joins. Return rows.

Compaction

Compaction is leveled:

  • L0 SSTs may overlap on key range.
  • L1+ SSTs have non-overlapping key ranges.
  • When L_n gets too many files, NamiDB merges them into L_n+1, dropping tombstones and superseded versions.

Compaction outputs are uploaded as new objects, then atomically swapped into the manifest via CAS. The old files stay readable until no snapshot references them — then GC reclaims them.

What’s different from a disk-resident LSM

  • No fsync on the file — the durability primitive is “the PUT returned 200 OK”. The object store handles physical durability.
  • No write amplification on small WAL appends — each WAL append is one PUT (or a streaming upload for large batches).
  • Reads are GETs, not file-system reads. Ranged reads with Range: headers let us pull only the row groups, only the property columns, only the bloom filter pages we need (RFC-003).

See also