LSM on object storage
NamiDB uses an LSM tree (log-structured merge) layered on top of an object store. The classic LSM building blocks — memtable, WAL, immutable SSTs, leveled compaction — all map naturally onto the object storage model.
The layers
┌──────────────────────────────────────────────┐│ Memtable │ in RAM, write side│ (concurrent skip-list, latest per key) │├──────────────────────────────────────────────┤│ WAL │ in the bucket│ (append-only segments, durable on commit) │├──────────────────────────────────────────────┤│ L0 SSTs ← memtable flushes │ in the bucket│ L1 SSTs ← compaction of L0 │ in the bucket│ L2 SSTs ← compaction of L1 │ in the bucket│ ... │└──────────────────────────────────────────────┘Write path
- Cypher mutation parses, plans, and executes against the current snapshot.
- WAL append — every record is appended to the current WAL segment
and
flushed before the call returns. This is what gives you durability oncommit_batch. - Memtable insert — the record lands in the in-RAM memtable so subsequent reads see it.
- Manifest CAS — the LSN watermark advances. A new manifest version is published.
When the memtable grows past a threshold (or the flush interval ticks), it becomes an L0 SST — a Parquet (nodes) or custom-format (edges) file, uploaded to the bucket, then referenced from the next manifest version. The WAL segments that contributed to that flush become eligible for truncation.
Read path
- Open a snapshot at the current manifest version.
- Plan the query (cost-based optimizer, predicate pushdown, factorization).
- For each node-pattern: probe memtable, then L0, then L1, …, merging the visible versions per LSN.
- For each edge-pattern: walk the CSR adjacency (in-memory cache keyed by SST id), reading edge property streams on demand.
- Apply filters, projections, joins. Return rows.
Compaction
Compaction is leveled:
- L0 SSTs may overlap on key range.
- L1+ SSTs have non-overlapping key ranges.
- When L_n gets too many files, NamiDB merges them into L_n+1, dropping tombstones and superseded versions.
Compaction outputs are uploaded as new objects, then atomically swapped into the manifest via CAS. The old files stay readable until no snapshot references them — then GC reclaims them.
What’s different from a disk-resident LSM
- No fsync on the file — the durability primitive is “the PUT returned 200 OK”. The object store handles physical durability.
- No write amplification on small WAL appends — each WAL append is one PUT (or a streaming upload for large batches).
- Reads are GETs, not file-system reads. Ranged reads with
Range:headers let us pull only the row groups, only the property columns, only the bloom filter pages we need (RFC-003).