LSM on object storage

NamiDB uses an LSM tree (log-structured merge) layered on top of an object store. The classic LSM building blocks — memtable, WAL, immutable SSTs, leveled compaction — all map naturally onto the object storage model.

The layers

┌──────────────────────────────────────────────┐
│  Memtable                                    │  in RAM, write side
│  (concurrent skip-list, latest per key)      │
├──────────────────────────────────────────────┤
│  WAL                                         │  in the bucket
│  (append-only segments, durable on commit)   │
├──────────────────────────────────────────────┤
│  L0 SSTs   ← memtable flushes                │  in the bucket
│  L1 SSTs   ← compaction of L0                │  in the bucket
│  L2 SSTs   ← compaction of L1                │  in the bucket
│  ...                                         │
└──────────────────────────────────────────────┘

Write path

Cypher mutation parses, plans, and executes against the current snapshot.
WAL append — every record is appended to the current WAL segment and flushed before the call returns. This is what gives you durability on commit_batch.
Memtable insert — the record lands in the in-RAM memtable so subsequent reads see it.
Manifest CAS — the LSN watermark advances. A new manifest version is published.

When the memtable grows past a threshold (or the flush interval ticks), it becomes an L0 SST — a Parquet (nodes) or custom-format (edges) file, uploaded to the bucket, then referenced from the next manifest version. The WAL segments that contributed to that flush become eligible for truncation.

Read path

Open a snapshot at the current manifest version.
Plan the query (cost-based optimizer, predicate pushdown, factorization).
For each node-pattern: probe memtable, then L0, then L1, …, merging the visible versions per LSN.
For each edge-pattern: walk the CSR adjacency (in-memory cache keyed by SST id), reading edge property streams on demand.
Apply filters, projections, joins. Return rows.

Compaction

Compaction is leveled:

L0 SSTs may overlap on key range.
L1+ SSTs have non-overlapping key ranges.
When L_n gets too many files, NamiDB merges them into L_n+1, dropping tombstones and superseded versions.

Compaction outputs are uploaded as new objects, then atomically swapped into the manifest via CAS. The old files stay readable until no snapshot references them — then GC reclaims them.

What’s different from a disk-resident LSM

No fsync on the file — the durability primitive is “the PUT returned 200 OK”. The object store handles physical durability.
No write amplification on small WAL appends — each WAL append is one PUT (or a streaming upload for large batches).
Reads are GETs, not file-system reads. Ranged reads with Range: headers let us pull only the row groups, only the property columns, only the bloom filter pages we need (RFC-003).