.to_arrow()` | per def | One column per `PropertyDef` declared in the schema at write time. | | `__overflow_json` | `Utf8` | **yes** | Mandatory column. Stores undeclared properties as a JSON object string; `null` when there are none. | | `__schema_version` | `UInt64` | no | Snapshot of the manifest’s schema version at flush time. Lets the reader pin its decode rules. | The `__overflow_json` column is **always present** in the Parquet schema (even when every row is null), so a reader can rely on it being there unconditionally without consulting the manifest first. This closes the revision-1 open question about open-schema ingest: undeclared properties flow through ingest → memtable → SST → reader without any silent drop, and the SDK layer (Python / TS) reconstructs them on read into the caller’s native map type. `__schema_version` is the manifest version the writer used to map property names → columns. Two SSTs written under different schema versions can co-exist; the reader chooses how to reconcile them (typically: newer schema\_version wins; older SST’s columns are mapped back to their names via the manifest of the version that produced them — handled by RFC-004). **Reserved column names.** Every declared property `p` is materialised as the Parquet column **`prop_`** (i.e. the `prop_` prefix is part of the on-disk column name, not an editorial shorthand). Names that would collide with the engine-managed columns — `node_id`, `tombstone`, `lsn`, `__overflow_json`, `__schema_version` — are reserved. The writer **rejects** any `PropertyDef` whose `name`: * starts with the prefix `prop_` (would double-prefix on disk), * starts with the prefix `__` (engine-private namespace), or * equals one of `{node_id, tombstone, lsn}`. Enforcement happens at schema-declaration time in `namidb-core::schema::PropertyDef::new` *and* is asserted again at flush time with `Error::SchemaConflict`. A reader that observes a column whose name violates the namespace rules treats the SST as corrupted. #### 2.2 Sort order Rows inside a node SST are sorted by **`(node_id ascending)`**. The memtable already gives us this order — `MemKey::Node { label, id }` sorts lexicographically by `id` inside a single label scope. The Parquet writer asserts this invariant; out-of-order rows are a writer bug and abort the flush. This sort order matters because: * The read path’s merge needs an ordered stream per SST to do a k-way merge of `(SST_0, …, SST_n, frozen_memtable, live_memtable)` without buffering everything in memory. * The Parquet page index gives O(log n) lookup for a target `node_id` using just min/max of each page, which is the basis of the warm point-lookup path. Within an SST a `node_id` appears at most once: the memtable already collapses repeated upserts of the same key. #### 2.3 Encodings and compression | Setting | Value | Rationale | | ------------------------ | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Compression | `Zstd` level `6` | Sweet spot per Parquet benchmarks: \~25–35 % ratio on string-heavy graph data, \~250 MB/s decode. | | Dictionary encoding | enabled (per column) | Mandatory for low-cardinality `Utf8` (e.g. country, status). `parquet-rs` falls back automatically when the dictionary stops paying. | | Row group size | **128 K rows** (target) | Page-index granularity per §7.5 of the plan. **Assumed average row size 256 B → 32 MiB row groups warm.** Rotation algorithm: writer closes a row group as soon as **either** 128 K rows have been buffered **or** the in-memory uncompressed byte size of the buffered rows would exceed 256 MiB (whichever fires first). Short flushes that would produce a row group smaller than 16 MiB are merged with the previous group of the same SST when possible (policy in RFC-003). | | Data page size | `1 MiB` | Largest page object\_store will fetch with a single ranged GET. | | Write batch size | `8192` rows | Standard Arrow batch sentinel. | | Statistics | column min/max **on** | Per-page and per-row-group; used by reader for predicate pushdown. | | Page index | **on** | Required by `parquet-rs 55` to enable per-page min/max-driven row-group pruning. | | Bloom (Parquet built-in) | **disabled** | We bring our own side-car bloom over `node_id`. Parquet’s bloom adds bytes for no win here. | | Page checksums | enabled | Cheap, defends against torn S3 reads. | These are defaults; the writer accepts a `NodeSstWriterOptions` to override compression level (e.g. `Zstd-18` for cold archives), so a future compaction worker can re-compress L≥2 SSTs without changing the file format. **NaN / `±Inf` handling.** `f32` / `f64` columns may contain `NaN`, `+Inf`, and `-Inf` freely — the raw bytes always land in the page data, the row is **never rejected** by the writer. The contract is only at the stats layer: a column whose page contains any `NaN` or `±Inf` produces `min = None` and `max = None` in `PropertyColumnStats`. Predicate pushdown gracefully falls back to per-row evaluation when min/max is unavailable. The page’s `null_count` remains accurate. #### 2.4 Tombstones Deletes are stored as rows with `tombstone = true` and **null** property columns. The reader treats a tombstone as “this `node_id` does not exist at `>= lsn`” — overriding any older SST that contains the same `node_id`. Tombstones live until they are absorbed by a compaction that proves no older SST or WAL segment still references the key (full-tree compaction; will be defined in RFC-005 compaction policy). We **do not** use Parquet’s “delete files” (Iceberg-style positional deletes). They require a separate index and a second manifest hop; we get the same semantics by treating `tombstone` as a regular column. **Nullable invariant.** Tombstone rows carry `null` in every declared property column. To make that representable regardless of the `PropertyDef.nullable` flag, the SST-level Arrow field for every declared property is **always nullable** in the Parquet schema, even when the schema declares `nullable = false`. Non-null contracts on declared properties are an **ingest-time** invariant (enforced before a row reaches the memtable), not an SST-time invariant. The writer implementation rejects a non-tombstone row with a null value in a `nullable = false` column at the ingest API boundary; once the row is in the SST, only the `tombstone` flag determines whether the column is “validly null”. #### 2.5 Footer + statistics extraction When the writer closes a Parquet file it emits the standard Parquet footer plus a sidecar `NodeSstStats` struct that goes into the manifest’s `SstDescriptor` (see §4 below):
```rust
pub struct NodeSstStats {
pub row_count: u64,
pub tombstone_count: u64,
pub min_node_id: [u8; 16],
pub max_node_id: [u8; 16],
pub min_lsn: u64,
pub max_lsn: u64,
pub property_stats: Vec,
pub schema_version_min: u64,
pub schema_version_max: u64,
}
```
`PropertyColumnStats` carries `null_count`, `min`, `max` for ordered types, and `ndv_estimate` (HyperLogLog++ sketch, 1 KiB) for cardinality hints. These are read from the Parquet footer column stats — no extra scan required. The bloom filter is **not** part of this struct (it goes to the side-car file; see §4.2). ### 3. Edge SST — CSR binary format #### 3.1 Why custom A property graph SST for edges has structure that Parquet does not natively express well: * The natural unit of read is “give me all neighbours of node `s` reached by `edge_type`”, which is a **variable-length list** keyed by `s`. * We want **O(1)** access from `src_id` to the offset of its neighbour list; Parquet’s repetition-level / definition-level list encoding gives us O(row\_group) at best. * Edge property values are co-located with the neighbour they describe. Co-locating them in independent Parquet files would lose the joint ordering invariant we rely on to make a single ranged GET hot. A custom format also lets us implement two specific NamiDB optimisations (open from PDF gap #9 / RFC-001 §“contribución propia”): * **Power-law-aware encoding.** High-degree source nodes get a separate block layout (large delta, dense bitmap of neighbours when degree exceeds a threshold). Long-tail sources use the default split-top64/bottom64 encoding. * **Edge-direction packing.** The same edge data is stored twice per edge type per flush: a **forward** SST sorted by `src_id`, and an **inverse** SST sorted by `dst_id`. Both use the same wire format, distinguished by `flags.INVERSE_PARTNER`. The inverse SST stores the *pair* `(dst, src)` in its neighbour positions, so reading “all in-edges of `v`” is the same code path as reading “all out-edges of `v`”. The choice of “always write inverse partner at flush time” (vs “only at compaction”) trades **2× write amplification** for **bounded in-edge-query latency on freshly-flushed data**. Without inverse partners, a `MATCH (n)-[:KNOWS]->(:Person {name: 'Bob'})` query that hits L0 SSTs has to scan every neighbour list — for a 10 M-edge graph that is single-digit seconds, blowing the budget. Bench data during may later motivate making inverse generation optional per edge type, but the default in v1 is “always”. #### 3.2 File layout All multi-byte integers are **little-endian**. All offsets are absolute byte offsets from the start of the file unless stated otherwise.
```text
┌─────────────────────────────────────┐ offset 0
│ File header (64 bytes, FROZEN) │
├─────────────────────────────────────┤
│ Section: key_ids │ kind = 0x0001
│ sorted UUIDv7s (16 B each) │
│ "src_ids" in fwd / "dst_ids" inv │
├─────────────────────────────────────┤
│ Section: offsets │ kind = 0x0002
│ one entry per key_id + 1 sentinel│
│ bitpacked u24/u32/u40/u48 │
├─────────────────────────────────────┤
│ Section: partners │ kind = 0x0003
│ split or dense per-group blocks │
│ "neighbours (dst)" in fwd │
│ "neighbours (src)" in inv │
├─────────────────────────────────────┤
│ Section: per_edge_lsn │ kind = 0x0004
│ u64 LE, one per edge in order │
├─────────────────────────────────────┤
│ Section: per_edge_tombstones (opt) │ kind = 0x0005
│ bitmap, 1 bit per edge │
│ (omitted when HAS_TOMBSTONES = 0)│
├─────────────────────────────────────┤
│ Section: fence_index (opt) │ kind = 0x0006
│ sparse index over key_ids │
│ (present when key_count > 65 536)│
├─────────────────────────────────────┤
│ Section: property_stream × N (opt) │ kind = 0x0100 (each)
│ one section per declared prop + │
│ `__overflow_json` if any present │
│ Zstd-compressed Arrow IPC chunk │
├─────────────────────────────────────┤
│ Footer (variable size) │
│ section table + body fields + │
│ 20-byte trailer (xxhash + len + │
│ magic) │
└─────────────────────────────────────┘ offset = file_size
```
Section **order on disk is not normative** — the footer’s section table determines the true byte ranges. The diagram above shows the typical layout the writer produces in v1.0. Each section is **independently addressable**: the footer carries a `Section` table mapping `SectionKind → (offset, length, xxhash3, codec)`. A reader that needs only `(key_ids, offsets, partners)` for a label scan fetches exactly those three ranged GETs and ignores the property streams. ##### 3.2.1 File header (64 bytes, frozen) The 64-byte header is **frozen for the lifetime of `format_major`**. Adding any field here requires a major bump. Forward-compatible extension happens only through new footer sections (see §5.2).
```text
offset size field value
─────── ──── ──────────────────────────── ────────────────────────────
0 8 magic b"TGEDGE\0\0"
8 1 format_major u8, current = 1
9 1 format_minor u8, current = 0
10 2 header_size u16 = 64 (sanity check)
12 4 flags u32 bitfield (see below)
16 16 edge_type_id blake3(edge_type)[..16]
32 16 src_label_id blake3(src_label)[..16]
48 16 dst_label_id blake3(dst_label)[..16]
```
Flags: | Bit | Name | Meaning | | ---- | ----------------- | ------------------------------------------------------------------------------------ | | 0 | `HAS_PROPERTIES` | At least one property column present in §6+. | | 1 | `HAS_TOMBSTONES` | At least one tombstone bit is `1` (cheap shortcut). | | 2 | `SKEW_BUCKETS` | Section 3 contains at least one skew-bucket group. | | 3 | `INVERSE_PARTNER` | This file is the inverse-direction CSR; §1 path uses `edges-inv-`. | | 4-31 | reserved | Must be zero in v1.0; reserved bits a v1.x reader does not recognise abort the read. | A reader **must** reject the file if `format_major > 1` or `header_size ≠ 64`. It must treat `format_minor > 0` as forward-compatible per the rules in §5.2. It must reject any non-zero reserved bit it does not understand. ##### 3.2.2 Section 1: key\_ids `key_ids` is the strictly increasing, deduplicated array of key `node_id`s present in this SST. **Semantics depend on `flags.INVERSE_PARTNER`:** * `INVERSE_PARTNER = 0` (forward): keys are `src_id`s; partners in §3.2.4 are `dst_id`s. Reads “out-edges of `s`”. * `INVERSE_PARTNER = 1` (inverse): keys are `dst_id`s; partners are `src_id`s. Reads “in-edges of `d`”. Fixed-size 16-byte UUIDv7 records. The writer asserts: every `key_id` must be `>` the previous one (`Ord` on `[u8; 16]`), no duplicates. This array is the binary-searchable handle into the offsets / partners structure. Length is `key_count`; section size is `16 * key_count`. ##### 3.2.3 Section 2: offsets `offsets[i]` is the byte offset (inside Section 3, relative to the start of Section 3) at which the partner group of `key_ids[i]` begins. `offsets[key_count]` is a sentinel == size of Section 3. Encoding is bitpacked with a width chosen by the writer at close time based on the maximum offset value: | Section 3 size | Bits per offset | Format | | ----------------- | --------------- | --------- | | < 2²⁴ B (16 MiB) | 24 | 3-byte LE | | < 2³² B (4 GiB) | 32 | `u32` LE | | < 2⁴⁰ B (1 TiB) | 40 | 5-byte LE | | < 2⁴⁸ B (256 TiB) | 48 | 6-byte LE | A fixed-width layout is preferable to varint here because the read path needs random access (`offsets[i]`) without scanning. The chosen width is recorded in the footer (`offsets_bits` field). ##### 3.2.4 Section 3: partners (neighbours / sources) Each per-key group is laid out as one of two block kinds. The writer picks the kind per group based on degree. Every block opens with a 1-byte tag and a varint `deg`. v1 defines two tags; future v1.x readers may support more.
```text
┌──────────────┬─────────┬────────────────────────────────────────────┐
│ deg: varint │ tag: u8 │ payload │
└──────────────┴─────────┴────────────────────────────────────────────┘
tag = 0x01 → split block (split-top64/bottom64 encoding)
tag = 0x10 → dense block (raw 16-byte partners)
tag = others → reject as Error::Corrupted in v1.0
```
The writer picks the block kind per key group based on the rule below; the picked kind is independent of `flags.SKEW_BUCKETS`, which is set in the header whenever **any** group of the file emitted a dense block (`tag = 0x10`). ###### Selection rule For a group of degree `d` with partners sorted ascending by the full 128-bit id, the writer computes the encoded byte cost of the split block (deterministically, using the encoding below). It then emits:
```plaintext
let split_cost = …; // see "Split block — encoding" below
let dense_cost = 16 * d; // always
if d > skew_threshold || split_cost >= dense_cost {
emit dense block (tag = 0x10)
} else {
emit split block (tag = 0x01)
}
```
`skew_threshold` is bench-driven; the v1 default is `max(1024, 4 * sqrt(key_count))`. The `split_cost >= dense_cost` clause is the “always-correct fallback”: for pathological partner distributions (spanning the full `u64` range with near-uniform deltas) the split encoding can balloon to 18 B per partner, so the dense block bounds the worst case at 16 B per partner regardless. ###### Split block — encoding (`tag = 0x01`) UUIDv7 splits cleanly into a top 64 bits (ms timestamp + 4-bit version * 12-bit sub-ms entropy) that is nearly monotonic over time, and a bottom 64 bits that is uniformly random. We exploit the top half for compression and write the bottom half raw. Payload, partners sorted ascending by the full 128-bit id:
```text
top64[0]: varint // absolute top64 of partner[0]
bot64[0]: u64 LE // raw bottom64 of partner[0]
top64_delta[j]: varint // = top64[j] - top64[j-1] (j ∈ 1..deg)
bot64[j]: u64 LE // raw bottom64 of partner[j]
```
Encoded cost in bytes: `split_cost = len_varint(top64[0]) + 8 + Σ_{j=1..deg-1} (len_varint(top64_delta[j]) + 8)`. Typical cost (partners clustered within seconds of each other, so `top64_delta <= 127`): **9 B per partner**. Cost when partners span months but were created in the same year: **13–14 B per partner**. Absolute worst case (artificial, e.g. `u64::MAX` deltas): **18 B per partner** — the writer detects this via the selection rule above and emits a dense block instead. `top64_delta[j]` may legally be `0` (two partners created in the same ms with the same 12-bit sub-ms entropy). The bottom-64 ordering breaks the tie; the writer asserts strictly increasing 128-bit partner id, so two partners with both halves equal are a writer bug. ###### Dense block — raw partners (`tag = 0x10`)
```text
┌──────────────┬─────────┬────────────────────────────────────────────┐
│ deg: varint │ tag: u8 │ partners: [u8; 16 * deg] │
└──────────────┴─────────┴────────────────────────────────────────────┘
```
Always-correct fallback. Used for super-nodes (`deg > skew_threshold`) and for any group where the split encoding would not be smaller. Future v1.x readers may support `tag = 0x11..` (e.g. Roaring on `hash(partner) mod 2³²`) and a writer that emits them; v1.0 readers reject any tag not in `{0x01, 0x10}` with `Error::Corrupted`. ##### 3.2.5 Section 4: per-edge LSN For every edge, in the same order as the partner enumeration of Section 3, one `u64` LE with the LSN at which that edge was applied. Length is `edge_count * 8`. Used for: * Conflict resolution at read time when the same `(key, partner)` pair shows up in older and newer SSTs (the newer LSN wins). * Compaction merge to filter shadowed edges. ##### 3.2.6 Section 5: per-edge tombstone bitmap `ceil(edge_count / 8)` bytes; bit *j* is the tombstone flag of edge *j* in the partner enumeration order of Section 3. A tombstone edge keeps its position in the partner array; the reader filters it out unless explicitly asked for history (branching / replay). **Section-omission rule.** If no edge in the SST is tombstoned the writer **omits** this section: it sets `flags.HAS_TOMBSTONES = 0` in the header, and the footer’s section table contains no entry of kind `per_edge_tombstones`. A v1.X reader that finds `HAS_TOMBSTONES = 0` must treat every edge as non-tombstoned without looking for the section; a reader that finds `HAS_TOMBSTONES = 1` but no entry in the section table treats the SST as corrupted. **Forward / inverse consistency invariant.** When an edge `(s, d, lsn)` is tombstoned in the writer’s frozen memtable, its corresponding entry in **both** the forward partner (key = `s`, partner = `d`) and the inverse partner (key = `d`, partner = `s`) is tombstoned at the same LSN. The writer enforces this by reading the tombstone bit from a single canonical source (the frozen memtable’s `MemOp::Tombstone`) during the construction of each partner, never from independent computations over the two transpositions. Tests #5 and #6 (§7) lock this invariant down. ##### 3.2.7 Sections 6..N: property streams One section per declared property `q` on this edge type. Each section holds a Zstd-compressed Arrow IPC chunk with a single column whose row *j* corresponds to edge *j* in the partner enumeration order. We choose Arrow IPC (not Parquet) for property streams because: * Each section already lives inside the CSR file’s footer table, so we do not need Parquet’s column metadata. * Arrow IPC’s record batch layout maps 1:1 to a column; zero-copy decode with `arrow-ipc::reader::StreamReader`. * Reusing Arrow primitives means a property column for an edge looks identical to a property column for a node — same `DataType ↔ ArrowDataType` mapping as in `namidb-core::schema`. Schema-undeclared properties on edges land in a single `__overflow_json` property stream with `name = "__overflow_json"`. Unlike node SSTs (where the `__overflow_json` *column* is always present in the Parquet schema, possibly all-null), the edge SST `__overflow_json` **section** is **only emitted when at least one edge has overflow data**. When no overflow is present, the writer omits the section and `HAS_PROPERTIES` reflects only the declared properties (or is `0` if there are none). A reader that needs overflow data and finds no `__overflow_json` section reads every edge’s overflow as `null`. ##### 3.2.8 Footer The footer is the last bytes of the file. It has a fixed-length **trailer** (always 20 bytes at the very end) and a variable-length **body** that precedes the trailer.
```text
┌──────────────────────────────────────────────────┐ ← footer body start
│ Section table: section_count × SectionEntry │
│ SectionEntry { │
│ kind: u16, // discriminator │
│ offset: u64, // from file byte 0 │
│ length: u64, // bytes │
│ codec: u8, // 0=none, 1=zstd │
│ reserved: u8, │
│ xxhash3_64: u64, // over the on-disk │
│ // bytes of the │
│ // section as stored │
│ name_len: u8, │
│ name: [u8; name_len], // utf8 │
│ } │
├──────────────────────────────────────────────────┤
│ section_count: u32 │
│ key_count: u64 │
│ edge_count: u64 │
│ offsets_bits: u8 // 24 / 32 / 40 / 48 │
│ min_key_id: [u8; 16] │
│ max_key_id: [u8; 16] │
│ min_lsn: u64 │
│ max_lsn: u64 │
│ schema_version_min: u64 │
│ schema_version_max: u64 │
├──────────────────────────────────────────────────┤ ← trailer start
│ footer_xxhash3_64: u64 (covers footer body) │
│ footer_len: u32 (body + trailer length) │
│ magic: 8 bytes b"TGEDGE\xFE\xEF" │
└──────────────────────────────────────────────────┘ ← end of file
```
Precise definitions: * **`footer_xxhash3_64`** is computed over the **footer body** only: from the first byte of the section table up to and including `schema_version_max` (i.e. all bytes between the body-start marker and the trailer-start marker above). It does *not* cover any byte of the trailer itself. * **`footer_len`** is the total byte length of footer body + trailer (i.e. the offset from the trailer’s last byte to the body’s first byte, inclusive). Equivalently: `footer_len = file_size - body_start`. A reader uses this to find the body start once it has the trailer. Section `kind` discriminators (u16): | Value | Kind | Notes | | ------ | --------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | 0x0001 | key\_ids | Mandatory. | | 0x0002 | offsets | Mandatory. | | 0x0003 | partners | Mandatory. | | 0x0004 | per\_edge\_lsn | Mandatory. | | 0x0005 | per\_edge\_tombstones | Optional (see §3.2.6). | | 0x0006 | fence\_index | Optional; required when `key_count > 65 536`. See §3.2.9. | | 0x0100 | property\_stream | Optional; **one entry per property**, distinguished by `name`. Reserved names: `__overflow_json` for schema-undeclared props. | | Others | reserved | A v1.0 reader skips unknown kinds outside the reserved ranges (forward-compat per §5.2). | All property streams share the same `kind = 0x0100`; the `name` field discriminates them. The writer rejects any property declaration whose `name` collides with a reserved column name (see §2.1) at SST creation time — `__overflow_json` is the only legal entry beginning with `__`. A reader locates the footer by: 1. Ranged GET for the last 4 KiB of the object (covers any footer up to \~4 KiB; for SSTs with few sections this is enough). 2. Read the trailing 8-byte magic at the end of the response. If absent, expand to the last 64 KiB and retry. If still absent the file is corrupt. 3. From the trailer read `footer_len`. If the prefetched window is too small, issue a second ranged GET for `[file_size - footer_len, file_size)`. 4. Verify `footer_xxhash3_64` against the body bytes. Mismatch → `Error::Corrupted`. 5. Validate that every `SectionEntry`’s `[offset, offset + length)` range lies strictly within `[64, file_size - footer_len)`. Any overflow → `Error::Corrupted`. The section table is sorted ascending by `offset`. A reader can linear-scan by `kind` when looking up a specific section. ##### 3.2.9 Fence-pointer index (optional) The naive lookup of “find `s` in `key_ids`” requires either fetching the entire `key_ids` section (16 B × `key_count`) or doing a remote binary search (≈`log2(key_count)` ranged GETs over 16-byte windows). For `key_count = 1 M` the first option costs a 16 MiB cold GET; the second costs \~20 round-trips. Neither is acceptable for the §14.1 cold-query budget when SSTs grow past a few hundred thousand keys. The fence-pointer index solves this with a sparse local index over `key_ids`. The writer emits one fence entry **every `fence_stride` keys** (default `fence_stride = 256`). Each entry stores the key value and the byte offset of that key within the `key_ids` section.
```text
┌──────────────────────────────────────────────────┐
│ fence_stride: u32 (e.g. 256) │
│ entry_count: u32 (= ceil(key_count / stride)) │
│ entries: [ FenceEntry ; entry_count ] │
│ FenceEntry { │
│ key: [u8; 16], // = key_ids[i * fence_stride] │
│ key_ids_offset: u64, // = i * fence_stride * 16 │
│ // (relative to byte 0 │
│ // of section key_ids) │
│ } │
└──────────────────────────────────────────────────┘
```
Total size: `4 + 4 + entry_count * 24` bytes. For 1 M keys with stride 256 → 3 906 entries → ≈94 KiB — cacheable by foyer on first probe. **Writer rule.** A fence index is **emitted** when `key_count > 65 536`. Below this threshold the entire `key_ids` section is small enough (≤ 1 MiB) to fetch and binary-search in memory cheaply. **Reader algorithm for “find offset of key `k` inside `key_ids`”:**
```text
if footer has no fence_index section:
fetch the full key_ids section (≤ 1 MiB by construction)
binary search in memory
else:
fetch the fence_index section once (cached)
binary search the fence entries to find the bracket
[fence[i].key, fence[i+1].key) containing k
issue one ranged GET for
key_ids[fence[i].key_ids_offset .. fence[i+1].key_ids_offset]
binary search that window in memory
```
Total cold cost: **2 GETs** (fence + key\_ids window) regardless of `key_count`. Warm cost: 1 GET (the window; fence is cached). The fence index is a v1.0 optional artefact: an older reader that ignores the section still works correctly via the naive path. #### 3.3 Statistics extraction When the writer closes either partner of an edge SST it emits:
```rust
pub enum EdgeDirection {
/// Keys are src_id; partners are dst_id. File path uses `edges-fwd-`.
Forward,
/// Keys are dst_id; partners are src_id. File path uses `edges-inv-`.
Inverse,
}
pub struct EdgeSstStats {
pub direction: EdgeDirection,
pub key_count: u64,
pub edge_count: u64,
pub tombstone_count: u64,
pub min_key_id: [u8; 16],
pub max_key_id: [u8; 16],
pub min_lsn: u64,
pub max_lsn: u64,
pub degree_histogram: DegreeHistogram,
pub property_stats: Vec,
pub schema_version_min: u64,
pub schema_version_max: u64,
}
pub struct DegreeHistogram {
/// 64 log2-spaced buckets:
/// counts[i] = #keys with deg in [2^i, 2^(i+1))
pub counts: [u32; 64],
pub max_degree: u64,
pub sum_degree: u64,
}
```
For a **forward** partner, `degree_histogram` describes out-degree. For an **inverse** partner, it describes in-degree. The cost-based optimizer reads the histogram of the partner it is about to traverse. The bloom filter is **not** part of this struct (side-car; see §4.2). #### 3.4 Read access patterns and ranged GETs This section quantifies the GET count for the common access patterns of v1, to make the columnar layout’s cost explicit. Notation: * `D` = direct-cached descriptor reads (`current.json` + manifest body), amortised across queries. * `B` = bloom side-car GET (1 GET per SST when min/max does not already exclude). Cached by foyer; second visit free. * `F` = SST footer GET (last 4 KiB; cached per SST). * `Khdr` = SST header GET (first 64 B + the section table prefix; can be coalesced with `F` in one ranged GET for SSTs ≤ \~16 MiB). **Pattern A — point lookup `node_id = v`.**
```plaintext
D + (per candidate SST) [B + F + ranged GET into the matching page]
```
Cold per SST: \~3 GETs. With foyer warm, `B + F` are free; only the page GET remains (\~1 GET). **Pattern B — out-edge expansion of a known src `s` (forward SST).** The reader resolves `s → index_in_key_ids → offset_in_partners → range of partners` using the fence index (§3.2.9) when present, or the full `key_ids` section otherwise. For SSTs **without** a fence index (`key_count ≤ 65 536`, so `key_ids ≤ 1 MiB`):
```plaintext
D + B + F + GET key_ids (≤ 1 MiB)
+ GET offsets[i..i+1]
+ GET partners[off..off+len]
```
Cold per SST: **5 GETs**; the `key_ids` and `offsets` ranges coalesce into a single ranged GET when `key_count * 16 + offsets_bytes ≤ 1 MiB` (true for L0 SSTs after a single flush). Warm: `B + F + key_ids + offsets` are foyer-cached; only the `partners` GET remains (**1 GET warm**). For SSTs **with** a fence index (`key_count > 65 536`):
```plaintext
D + B + F + GET fence_index (~100 KiB)
+ GET key_ids window (≤ fence_stride * 16 ≈ 4 KiB by default)
+ GET offsets[i..i+1]
+ GET partners[off..off+len]
```
Cold per SST: **6 GETs**, all independent and parallelisable. Warm: 1 GET (partners). **Pattern C — in-edge expansion of a known dst `d` (inverse SST).** Identical to Pattern B, just with the inverse partner SST. **Pattern D — edge expansion with property predicate (e.g. `where edge.since > date`).** Pattern B + 1 additional GET on the property stream’s range corresponding to the partners we touched. Cold per SST: \~6 GETs; property stream GET coalesces with `partners` when both ranges lie in the same MiB window. The “concurrent ranged GETs” feature of `object_store::aws` lets us fire patterns B/D’s GETs in parallel; cold p50 wall time is bounded by the slowest GET, not their sum. With `S3 Express One Zone` the per-GET floor drops from \~30 ms to \~5 ms — directly on the §14.1 budget. ### 4. Embedded statistics + bloom side-car in the manifest #### 4.1 Extended `SstDescriptor` This RFC promotes `SstDescriptor` from the minimal version in RFC-001 to the form below. Everything in this struct is JSON-cheap (a few hundred bytes per SST excluding `property_stats`, which scales with column count). For 100 K SSTs the manifest stays under \~10 MiB, the budget at which we switch JSON → Arrow IPC (recorded as an Open Question in RFC-001).
```rust
pub struct SstDescriptor {
// ── identity ──
pub id: Uuid,
pub kind: SstKind, // Nodes | EdgesFwd | EdgesInv
pub scope: String, // label or edge_type
pub level: SstLevel,
pub path: String, // relative to namespace
// ── physical ──
pub size_bytes: u64,
pub row_count: u64, // node rows or edge rows
pub created_at: DateTime,
// ── key range (raw bytes; serialised as base64 in JSON) ──
pub min_key: [u8; 16], // node_id (Nodes) or key_id (Edges)
pub max_key: [u8; 16],
pub min_lsn: u64,
pub max_lsn: u64,
pub schema_version_min: u64,
pub schema_version_max: u64,
// ── stats embedded ──
pub property_stats: Vec,
pub kind_specific: KindSpecificStats,
// ── bloom side-car pointer (None when the SST is small enough
// that scanning is cheaper than probing; see §4.2) ──
pub bloom: Option,
}
pub enum SstKind {
Nodes,
EdgesFwd,
EdgesInv,
// Vectors lands in RFC-007; reserved here so reader code can match
// exhaustively against the v1 set.
}
pub enum KindSpecificStats {
Nodes { tombstone_count: u64 },
Edges {
// key_count == row_count for nodes; for edges key_count is
// distinct src/dst count (depending on direction).
key_count: u64,
tombstone_count: u64,
degree_histogram: DegreeHistogram,
},
}
```
JSON serialisation: `min_key` / `max_key` are 16-byte arrays serialised as **base64** (`base64::STANDARD`). All other fields use their natural JSON encoding. #### 4.2 BloomDescriptor (side-car pointer) The bloom filter for an SST lives in its own object next to the SST body. The manifest only carries a pointer to it plus the parameters needed to probe it without re-reading the body:
```rust
pub struct BloomDescriptor {
pub path: String, // object_store path of the side-car
pub size_bytes: u32, // total side-car file size (header + blocks + trailer)
pub key_count: u64, // number of keys inserted into the filter
pub bits_per_key: u8, // default 10 → ~1 % FPR
pub block_count: u32, // 256-bit (32-byte) blocks
pub xxhash3_64: u64, // checksum over the side-car body (per §4.2 wire spec)
}
```
We use the **split-block bloom filter (SBBF)** construction Parquet adopted — a single 64-bit hash per key drives a deterministic 8-bit mask inside one 256-bit block. There is no separate `k_hashes` parameter (the “k” is fixed at 8 by construction). The hash function is **xxHash3-64** (same library, same seed = 0 as elsewhere in this RFC). Block selection from a hash `h`:
```plaintext
let block_index = ((h >> 32) * block_count as u64) >> 32;
```
The 8-bit mask inside the chosen block is the standard SBBF mask (see Putze et al., 2010; identical to Parquet’s `bloom_filter_algorithm = SPLIT_BLOCK`). Implementations crib the constants from `parquet-rs 55::bloom_filter`. The total side-car size is exactly `28 (header) + 32 * block_count + 8 (trailer xxhash)`. For 10 bits / key the writer rounds up `block_count = ceil(key_count * bits_per_key / 256)`; e.g. 1 M keys ⇒ `block_count = 39 063` ⇒ side-car = 1 250 052 bytes ≈ 1.19 MiB. ##### Side-car wire format
```text
┌──────────────────────────────────────┐ offset 0
│ magic: 8 bytes b"TGBLOOM\0" │
│ format_major: u8 = 1 │
│ format_minor: u8 = 0 │
│ reserved: u16 = 0 │
│ bits_per_key: u8 │
│ reserved2: u8 = 0 │ // was k_hashes pre-rev3; kept
│ │ // for alignment, value MUST be 0
│ reserved3: u16 = 0 │
│ block_count: u32 │
│ key_count: u64 │
├──────────────────────────────────────┤
│ blocks: [SbbfBlock; block_count] │
│ SbbfBlock = [u8; 32] │
├──────────────────────────────────────┤
│ xxhash3_64 over the entire file │
│ minus the trailing 8 bytes: │
│ trailing: u64 LE │
└──────────────────────────────────────┘
```
Split-block bloom filters (SBBF) at 10 bits/key give \~1 % FPR — the same parameters Parquet uses internally. For 1 M keys the side-car is ≈1.25 MiB; for a typical SST of 100 K–200 K keys it is ≈125–250 KiB. **A reader probes the bloom by**: 1. (Optional) `min_key`/`max_key` overlap test — manifest-only, no GET. If no overlap, skip the SST. 2. Resolve `bloom.path` to an absolute object\_store path. 3. Issue one ranged GET for the side-car body; foyer caches it after the first probe per process. 4. Verify `xxhash3_64`. Run k hashes against the appropriate `SbbfBlock`. If absent, skip the SST. The bloom over `node_id` (for node SSTs) and over `key_id` (for edge SSTs of either direction) is therefore the gate between “manifest says maybe” and “let’s pay for the SST body GET”. For very small SSTs (`size_bytes < 256 KiB`), the writer **omits the bloom side-car** entirely — `SstDescriptor.bloom` is set to `None` and no `.bloom` object exists on object storage. A 200-key SST is faster to scan than to probe. Readers seeing `bloom = None` skip the bloom step (and skip the corresponding ranged GET) but still respect the manifest’s `min_key`/`max_key` overlap test. #### 4.3 PropertyColumnStats
```rust
pub struct PropertyColumnStats {
pub name: String,
pub null_count: u64,
pub min: Option,
pub max: Option,
pub ndv_estimate: Option, // 1 KiB HLL++; None for vectors/json
}
pub enum StatScalar {
Bool(bool),
Int32(i32),
Int64(i64),
Float32(f32), // NaN / Inf are stat-disqualifying; field is None
Float64(f64), // idem
Utf8(String),
Binary(Bytes),
Date32(i32),
TimestampMicrosUtc(i64),
}
```
Vector columns (`FloatVector { dim }`) and `Json` columns produce no `min`/`max` (they remain `None`) but still contribute a `null_count`. The `__overflow_json` column always produces `min`/`max = None`, `null_count` only — its `ndv_estimate` is also `None` (HLL over JSON documents has no operational use here). ### 5. Wire compatibility #### 5.1 Node SSTs Parquet itself carries its own version + magic (`PAR1`) and is forward-compatible across `parquet-rs` minor versions. We pin `parquet = "55"` workspace-wide. Reading SSTs written by future NamiDB builds works as long as we do not introduce new logical column conventions; if we do, we will bump a `node_sst_format` field in `SstDescriptor.kind_specific` so old readers can refuse. `__overflow_json` is required by **all** v1 node SSTs (even when every row is null). A reader that loads an SST missing this column refuses with `Error::Corrupted { detail: "node SST missing __overflow_json" }`. #### 5.2 Edge SSTs Edge SSTs are **owned** by NamiDB. The compatibility contract is: | Condition observed by reader v1.X (X ≥ 0) | Action | | --------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | `format_major > 1` | Refuse: `Error::Corrupted`. | | `format_major < 1` | Refuse: `Error::Corrupted` (no v0 exists). | | `format_major = 1`, `header_size ≠ 64` | Refuse: `Error::Corrupted`. | | `format_major = 1`, `format_minor ≤ X` | Read normally. | | `format_major = 1`, `format_minor > X` | Read normally; skip any footer section whose `kind` is not in this reader’s v1.X table; refuse if any section table entry crosses the file end. | | Unknown reserved bit in `flags` | Refuse: an unknown flag implies an unknown invariant. | | Unknown `partners` block tag | Refuse: `Error::Corrupted`. | | Unknown footer section `kind` outside the reserved ranges | Skip the section (forward-compat). | A writer **must not** introduce a breaking change to the 64-byte header or to any existing section’s internal layout without bumping `format_major`. Adding a new footer section kind is a `format_minor` bump only. Removing a footer section kind is a major bump. #### 5.3 Side-cars The bloom side-car follows the same major / minor convention. v1.0 readers refuse any bloom side-car with `format_major > 1`. ### 6. Implementation plan (Rust crate layout) Inside `crates/namidb-storage/src/sst/`:
```text
sst/
├── mod.rs # re-exports + common types (SstId, BloomDescriptor, …)
├── stats.rs # PropertyColumnStats, DegreeHistogram, HLL sketch
├── bloom.rs # SBBF build + probe; side-car wire format
├── nodes.rs # NodeSstWriter, NodeSstReader (Parquet)
└── edges/
├── mod.rs # public API: EdgeSstWriter, EdgeSstReader, EdgeDirection
├── header.rs # 64-byte header struct + serde
├── footer.rs # section table + xxhash3 + magic (per §3.2.8)
├── writer.rs # streaming writer (forward + inverse in one pass)
├── reader.rs # ranged-GET reader; section cache; bloom integration
├── encoding.rs # bitpacked offsets, split-top64/bottom64 neighbours,
│ # selection rule (split vs dense)
├── fence_index.rs # writer + reader for the optional fence-pointer index
└── inverse.rs # in-memory transpose of a FrozenMemtable edge bucket
```
New types lift into `namidb-storage::lib.rs` as part of the public crate API for downstream crates (query engine). The `manifest` module is updated in lockstep to carry the extended `SstDescriptor`. Two new workspace dependencies are required: * `xxhash-rust` — feature `xxh3`, used for all SST + bloom checksums. * `base64` — used by the `min_key` / `max_key` JSON encoding in the manifest. ### 7. Test plan The following tests land alongside this RFC’s implementation. Test budget: bring the workspace from 36 → ≥ 70 passing tests. 1. **Round-trip property nodes.** Build a memtable of `Person` rows, freeze, write Parquet SST to `object_store::memory::InMemory`, read back, assert byte-for-byte equality of property values. 2. **Overflow round-trip.** Write a node with one declared and one undeclared property; assert the undeclared one round-trips through `__overflow_json` losslessly. 3. **Tombstone semantics.** Insert + delete + insert at increasing LSNs; read back; assert the latest LSN wins and that deleted-then-reinserted nodes are present. 4. **Edge CSR round-trip (forward).** Build a graph with 100 K edges across 10 K sources, write forward CSR, read back, assert neighbour lists equal. 5. **Edge CSR inverse partner.** Same graph; assert the inverse SST, when probed by `dst`, returns each src that originally pointed to that dst, in sorted order. 6. **Inverse partner == transposed forward.** Build a graph; write both partners; for every edge, assert it is present in both. 7. **Edge skew bucket.** Construct one super-node with degree 5 000, the rest degree ≤ 4; assert the writer emitted `tag = 0x10` for that group and `tag = 0x01` otherwise; reader returns the full list. 8. **Split-encoded compression win.** Generate 1 000 partners with all their top64 equal (same ms); assert the encoded size is `< 9 * 1000 + small_overhead` (vs 16 KiB raw). 8a. **Split-to-dense fallback.** Construct a 100-partner group whose partners are spaced so that every `top64_delta` would require ≥ 9 varint bytes; assert the writer emitted `tag = 0x10` (dense) for that group and that the bytes-on-disk for the group are exactly `1 (varint deg) + 1 (tag) + 100 * 16`. 8b. **Reserved column name rejected.** Build a `SchemaBuilder` with a `PropertyDef { name: "tombstone", … }`; assert `Error::SchemaConflict`. 8c. **Fence-pointer index round-trip.** Build an edge SST with `key_count = 200 000` (above the fence threshold); assert that the footer contains a `fence_index` section, that the reader cold-path issues exactly 2 GETs for a `src` lookup (fence + window), and that the result matches a naive linear scan. 8d. **Fence-pointer index absent below threshold.** Build an edge SST with `key_count = 1 000`; assert no `fence_index` section in the footer and that the reader takes the “fetch full key\_ids” branch. 8e. **Tombstone consistency fwd ↔ inv.** Flush a memtable with one tombstoned edge `(s, d, lsn)`; assert that the forward partner has `tombstone_bit[j] = 1` at the position corresponding to `(s → d)` and the inverse partner has `tombstone_bit[k] = 1` at the position corresponding to `(d → s)`, with both LSNs equal. 9. **Random-access edge lookup.** Open the reader, query `src=X`, assert only the expected ranged GETs hit the store (use the `object_store::memory::InMemory` plus a counting wrapper). Validate pattern B’s GET count. 10. **Stats correctness.** After writing an SST, the returned stats match those independently computed from the source data (`row_count`, `min/max`, `tombstone_count`, `degree_histogram`). 11. **Bloom correctness.** Bloom contains every inserted key (FPR check on a held-out set is ≤ 2 × theoretical). 12. **Bloom side-car wire.** Write a bloom side-car, corrupt one byte, assert `Error::Corrupted` at probe time. 13. **Small SST omits bloom.** Write an SST with `size_bytes < 256 KiB`; assert `bloom.path == ""` in the descriptor and the reader uses the in-body scan path. 14. **Footer corruption.** Truncate the last 16 bytes of an edge SST, assert `Error::Corrupted`. 15. **xxHash3 mismatch detected.** Flip one byte inside a section’s body, assert the section’s checksum verification fails when read. 16. **Forward-compat skip.** Write an SST with a synthetic `section_kind = 0x0FFF` of payload `"ignored"`; the v1 reader must ignore it and still return correct data. 17. **Major mismatch refused.** Manually flip `format_major = 2` in the header, assert the reader returns `Error::Corrupted`. 18. **Header size mismatch refused.** Flip `header_size = 80`, assert the reader refuses. 19. **Unknown reserved flag refused.** Set flag bit 5, assert `Error::Corrupted`. 20. **LocalStack integration.** A single end-to-end test (`#[ignore]`) that writes both a node SST and a forward+inverse edge pair through `object_store::aws` against LocalStack and reads them back, including pattern B GET-count assertions against the LocalStack request log. ## Alternatives considered ### A. Parquet for edges too Use Parquet `list>` keyed by `src_id`. Rejected: * A list-shaped column needs Parquet repetition levels, which add \~2 bytes per edge of metadata and a definition-level mask that has to be walked on read. * Random access to “neighbours of src = X” still costs O(row\_group), not O(1), because Parquet has no random access into a list cell. * We lose the ability to encode the skew optimisation cleanly (would require a sibling sparse column). ### B. Lance v2 for edges Lance v2 is excellent for vectors and blobs but is not optimised for the adjacency-list shape. Its strengths (zero-copy random access into blob columns; Vamana / IVF integration) do not map onto CSR. We will use Lance for the **vector** SST kind in RFC-007. ### C. SlateDB’s SST verbatim SlateDB is a KV store. Its SST format is well-tuned for `(key, value)` pairs but does not carry the columnar invariants we need for property columns or for CSR offsets. Reusing it would force every read to do a key-decode pass that we get for free in Parquet. ### D. Iceberg manifest + Parquet data files We considered structuring SSTs as an Iceberg table. Rejected for v1 because: * Iceberg’s manifest layout is heavier than ours and adds a level of indirection irrelevant to a single-writer LSM. * Iceberg’s snapshot semantics overlap with ours but with different retention semantics; we would not be able to express “branch with fork retention” without subverting Iceberg’s vacuum. * We will revisit in **RFC-014 Iceberg integration** as an *export* surface — write an Iceberg view *of* the SSTs, not store them as one. ### E. Embedded bloom (revision-1 plan) Originally we planned to inline the bloom inside the manifest `SstDescriptor`. Rejected during revision 2: a 1.25 MiB raw bloom becomes 1.65 MiB base64, and a 100 K-SST namespace would produce a \~165 GB manifest. Side-car keeps the manifest under the JSON budget while still allowing the bloom to be fetched lazily and cached by foyer. ### F. Single-direction edge SSTs Skipping the inverse partner halves write amplification at flush time. Rejected for v1: a `MATCH (n)-[:KNOWS]->(:Person {name: 'Bob'})` query against L0 SSTs degenerates to `O(|E|)` neighbour scans, which destroys the §14.1 cold-query budget. Single-direction may be reintroduced as a per-edge-type override (e.g. for write-heavy log-shaped edges) once we have bench data. ### G. Per-section CRC32 (revision-1 plan) Originally CRC32 IEEE, matching the WAL. Rejected during revision 2 in favour of xxHash3-64 for SSTs only. Rationale: S3 already provides strong integrity (HTTP MD5 / CRC32C) end-to-end, so SST checksums exist to defend against client-side / memory-side corruption. xxHash3 is \~3-5 × faster than CRC32 IEEE at the same defence quality for the fail-modes that matter at this layer. WAL keeps CRC32 because its fail-modes include torn 4 KiB writes where CRC’s burst-error guarantees are useful. ## Drawbacks 1. **Bloom side-car costs a GET.** A query that does not have min/max pruning available pays one extra ranged GET per candidate SST on the cold path. Mitigation: foyer caches every bloom side-car after first touch (typical size 125 KiB–1.5 MiB), and the “small-SST omit-bloom” rule means the cost only applies to SSTs large enough to benefit anyway. Bench-targeted. 2. **Inverse partner doubles write amp on the edges path.** Mitigation: per-edge-type override is a v1.1 follow-up. The expectation is that most graph workloads are write-once-read-many, so the asymmetry between flush and query cost is acceptable. 3. **Custom edge format is more code to maintain.** We are taking on a wire format that we now own forever. Mitigations: a small `format_major / minor` invariant, exhaustive round-trip tests, and a `namidb-storage` CLI subcommand (`inspect-sst `) to dump the header / footer for ops debugging (lands with the writer). 4. **Parquet’s per-row-group footer overhead** dominates for very small node SSTs (< 10 K rows). Mitigation: the writer aggregates short flushes to ≥ 128 K rows when possible — see flush path RFC-003 for the policy. 5. **Skew block ships only `tag = 0x10` dense.** Roaring integration (`tag = 0x11`) lands once bench data justifies it; until then a super-node with 1 M out-edges uses 16 MiB of dense storage per SST. Acceptable for prototype. 6. **`f32` / `f64` stats skip NaN / Inf.** This matches Parquet’s strict stats but means we silently drop min / max when a column contains them. Predicate pushdown gracefully falls back to per-row evaluation. Tracked. 7. **Manifest growth is bounded but not constant.** With 100 K SSTs the manifest is still \~10 MiB (dominated by `property_stats`, `degree_histogram`, key ranges). The JSON → Arrow IPC switch lands when bench data warrants it; until then we set a hard 10 MiB cap and the writer fails the commit if a new manifest would exceed it (with a clear error pointing at the migration RFC). ## Open questions 1. **Bloom probe in pattern A vs page-index probe.** For point lookups on node SSTs the Parquet page index already gives row-group pruning at min/max granularity. The bloom helps when min/max intervals overlap. Bench may show that the bloom is unnecessary for node SSTs and only edge SSTs need it. Leaving the bloom on for both kinds in v1 keeps the read path uniform. 2. **HLL sketch byte budget.** 1 KiB per column per SST is the current default. Could be lowered to 256 bytes (less accuracy) if manifest growth becomes a bottleneck before the IPC migration. Bench-driven. 3. **`tag = 0x11` (Roaring) timing.** Promote when a workload has a super-node with degree ≥ 1 M *and* benches show ≥ 2 × savings. 4. **Edge property layout: per-edge vs per-key chunks.** Today property stream row *j* maps to edge *j* in partner enumeration order. An alternative is to chunk by key group so that all properties of a single src’s out-edges are contiguous. The current choice maximises columnar scan efficiency for `WHERE edge.prop ...` predicates; the alternative would maximise per-key locality. Defer until query engine benches. 5. **JSON → Arrow IPC manifest threshold.** Currently set at 10 MiB. This RFC’s structures keep manifests well under that for 100 K SSTs; the threshold will be re-evaluated when a namespace approaches it. Tracked as RFC-003 follow-up. 6. **Per-edge-type inverse opt-out.** Defer to bench data; a “log-shaped” edge type (e.g. immutable events) might never need in-edge expansion, and could opt out of inverse partner generation at schema declaration time. ## References * Parquet specification, . * Lemire, Boytsov, **Decoding billions of integers per second through vectorisation** (Software: Practice & Experience, 2015). Varint / bitpacking implementation reference. * Chambi et al., **Better bitmap performance with Roaring bitmaps** (SP\&E, 2016). Section 3.2.4 skew layout. * Putze, Sanders, Singler, **Cache-, Hash- and Space-Efficient Bloom Filters** (J. Exp. Algorithmics, 2010). SBBF foundations. * Heule, Nunkesser, Hall, **HyperLogLog in practice** (EDBT 2013). * Y. Collet, **xxHash3** specification, . * Jin et al., **Kùzu** (CIDR 2023). Property graph + CSR + factorised representation in a single binary; reference architecture. * Hu et al., **EmptyHeaded** (SIGMOD 2017). WCOJ over factorised intermediate results — relevant for how SST stats inform planning. * **DuckDB DataChunk + Parquet integration**, . Reference for column-store integration over Parquet. * **turbopuffer architecture**, . Embedded-stats-in-manifest pattern + bloom side-car pattern. * **SlateDB SST format**, . We diverge by being column-oriented and CSR-aware. * **Apache Iceberg manifest spec**, . Reference for the design we did *not* adopt as the primary layout, but will use as an *export* surface in RFC-014. * **UUIDv7 specification**, RFC 9562. Layout that underlies the split-top64 / bottom64 encoding in §3.2.4.
# RFC 003: Read-path ranged reads + Parquet page index
> **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** —
> *Mirrored from [`docs/rfc/003-read-path-ranged-reads.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/003-read-path-ranged-reads.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Replace the full-body `object_store::get()` that every cold `lookup_node`/`edge_lookup` issues today with a byte-ranged fetch driven by the Parquet **page index** and the existing per-row-group min/max stats. The goal is to bring cold `lookup_node` p50 on real S3 (50–100 MB/s, \~1 ms RTT from a co-located EC2 instance; far worse from a developer laptop) inside the the envelope of `<500 ms p50` at 10 M nodes — which the in-process / LocalStack bench cannot exercise because localhost bandwidth hides the issue. ## Motivation A previous iteration closed the bench gate against LocalStack: row-group pruning on the `node_id` column reduced per-lookup decode from O(rows\_in\_sst) to O(rows\_per\_row\_group), and the resulting numbers at 10 M nodes are: | Metric | Target | Measured (LocalStack, MacBook) | | ---------------------- | ------- | ------------------------------ | | Cold `lookup_node` p50 | <500 ms | **381 ms** | | Warm `lookup_node` p50 | <10 ms | **9.27 ms** | Both gates pass — but the cold number is misleading because `Snapshot::get_sst_body` still fetches the **entire** Parquet body (currently 300–500 MB for a 10 M-node SST with zstd compression). LocalStack on a single host moves that in \~350 ms; real S3 moves it in 2–10 s depending on co-location. The same code path against `s3.us-east-1.amazonaws.com` from a developer laptop would consistently violate the gate, even though the test harness reports green. The root mismatch is structural: a point lookup needs \~tens of KB of column data from a single row group, not the whole SST. The Parquet 2.0 page index gives us exactly that — per-column-chunk per-page min/max offset + length — but we currently ignore it. Cost of doing nothing: any production deploy against real S3 ships with a hidden \~10× regression on cold lookups vs the bench gate. That blocks the SaaS demo and the public launch. ## Design ### Surface change `NodeSstReader::open` and `EdgeSstReader::open` today take `body: Bytes`. The new path takes an `Arc` + `Path` + `ObjectMeta` (size known from the manifest descriptor) and uses `parquet::arrow::async_reader::ParquetObjectReader` under the hood. The existing `Bytes`-backed constructors stay for the in-process test path and for the eager `scan_label` use case (which already needs every row group).
```rust
// New constructor (additive).
impl NodeSstReader {
pub async fn open_async(
label: LabelDef,
store: Arc,
path: Path,
size_hint: u64,
) -> Result { /* ... */ }
}
```
The async reader exposes a parallel `targeted_scan_async(&[u8; 16]) -> RecordBatch` that: 1. **Footer fetch.** `ParquetObjectReader` issues one `get_range` for the trailing \~8 KB of the SST to read the Parquet footer + column-chunk metadata. For a 500 MB SST this is \~one round-trip and \~8 KB transferred. 2. **Row-group pruning.** Same min/max stats check we already have in `targeted_scan`. Pick the single row group that straddles the target key (writer guarantees strict ascending `node_id` so there is at most one). 3. **Page index fetch.** If `with_page_index(true)`, the reader fetches the `OffsetIndex` + `ColumnIndex` for the chosen row group (\~few KB). Combined with the per-page min/max from the column index, we identify the single data page in the `node_id` column that contains the target. 4. **Page fetch.** A single `get_range` of the chosen page’s bytes (\~1–8 KB depending on rows-per-page). Decode, find the row offset within the page, project the same row offset across the other columns’ pages — each is one more `get_range`. For a `Person` label with \~6 declared properties + 2 system columns, that’s 8 ranged GETs of \~1–8 KB each, or a `get_ranges` batched call. Total wire footprint per cold lookup: **\~50–100 KB** (vs \~500 MB today) and **3–4 round trips** (footer, page index, batched column pages). On S3 us-east-1 from EC2 (\~1 ms RTT), that’s \~5–20 ms. From a laptop (\~30 ms RTT), \~100–150 ms — both inside the 500 ms gate with comfortable margin. ### Cache integration The current `SstCache` keys on the full path and stores the entire body. Under the new design we shift to **range-keyed caching**: keys become `(path, offset, length)` or a normalised `(path, kind)` for the three structurally-fixed regions: * `:footer` — the trailing footer + column metadata block (size known after the first fetch). * `:row_group_:column_` — per-column-chunk pages for the hot row group. Warm lookups against the same SST and same row group hit memory without re-fetching. This is essentially a buffer pool keyed by Parquet’s logical units instead of by file. Foyer continues to back it; the `weighter` adds the key length plus the value length and the budget stays in real bytes. ### Edge SST counterpart The edge SST format is custom (RFC-002 §3) and already ships a fence-pointer index for `key_count > 65 536`. The same idea applies: today `EdgeSstReader::open` reads the full body; the async variant reads the footer + fence index + the per-key partner block. Wire format is unchanged — only the reader navigates differently. ### Manifest descriptor extension `SstDescriptor.size_bytes: u64` already exists in the manifest. No schema change needed; the reader passes that as the `size_hint` so `ParquetObjectReader` can position the trailing footer read without a HEAD request. ## Alternatives considered ### A. Persist a separate “index” side-car per SST A `.idx` blob with `node_id → (row_group, offset)` mapping, written at flush time. Cold lookup = 1 GET of the side-car + 1 GET of the chosen row group. Rejected: the side-car would essentially duplicate the Parquet column index, and we’d carry two sources of truth that must stay in sync. The Parquet page index is already on disk inside the body — re-using it is free. ### B. Maintain a sorted in-memory key→row-group map per SST Build it on `open()` and cache. Cold lookup pays one full-SST decode the first time, warm is instant. Rejected: the first lookup is what we’re trying to fix. Building the map requires reading the footer + column index anyway, so we may as well consume that information directly instead of caching it in a parallel structure. ### C. Smaller row groups (e.g., 4 K rows) Today’s row group is 128 K rows. Smaller groups would amortise less per-group overhead and let us decode less per pruned hit. Rejected as a complete solution: ratio improvement is linear in the row-group shrink but at some point per-group metadata cost dominates the body. Real fix is page-level granularity, not finer row groups. ### D. Materialise hot keys into a separate SST per layer LSM-style “block index” promoted to its own file. Rejected: adds a writer-side component (when to promote? what to evict?) and another manifest descriptor. The Parquet page index already gives us per-page granularity for free; promoting hot keys is premature. ## Drawbacks 1. **Two read paths to maintain.** The async ranged path coexists with the eager `Bytes`-backed path used by `scan_label` / `scan_edge_type` / compaction. We accept the surface area because compaction genuinely needs every row group and would issue worse access patterns if forced through the ranged reader. 2. **Foyer cache keying changes.** Existing tests that assert `SstCache.usage() > 0` after a warm cycle keep working (the cache holds page bytes instead of body bytes) but the bytes-per-entry distribution shifts dramatically — smaller entries, more of them. Eviction tuning may need a second pass. 3. **Round-trip count on S3.** A cold lookup goes from 1 wide GET to \~3 narrow GETs. For backends with HEAD+GET RTT penalties (some self-hosted gateways) this could regress wall-clock time despite the bandwidth win. Mitigation: support `object_store::get_ranges` (which coalesces) and benchmark explicitly against real S3 + LocalStack before declaring victory. 4. **Bench harness debt.** `benches/read_latency.rs` today exercises the cached `Bytes` path. The harness needs a new bench (`cold_ranged_from_s3`) that exercises the async reader and reports both the LocalStack and real-S3 numbers — otherwise we re-introduce the LocalStack-only blind spot this RFC was written to close. ## Open questions 1. **Coalescing strategy for column pages.** `object_store::get_ranges` issues a single multi-range request when the backend supports it; for backends that don’t (some S3 gateways), it falls back to parallel single-range GETs. Need to measure which dominates for our typical 8-column projection. 2. **Page index always-on?** The writer can produce the page index unconditionally (\~negligible footer overhead) or only when row count exceeds a threshold. Cheap to always emit — recommend on by default and revisit only if footer size becomes a problem. 3. **Bloom filter probe ordering.** Today: manifest min/max → bloom → body GET. New flow: manifest min/max → bloom → footer GET (cheap) → row-group prune → page GET. Bloom still saves a footer round trip on a true miss, so keep it first. But if the bloom misses are rare in practice (well-tuned FPR), we may want to skip it and go straight to the footer fetch which is similarly small. 4. **Property-stream evolution interaction.** RFC-002 §3.2.7 (declared edge property streams) is a follow-up. When per-property streams ship, the ranged read pattern extends naturally: one extra `get_range` per requested property stream. No new design needed, just one more knob on the column projection. 5. **`scan_label` / `scan_edge_type` retention of the eager path.** Confirm that range scans always stay on the body-fetch path or whether they should also use ranged reads when the result set is small. Probably “always eager for now, revisit when the query engine surfaces predicate push-down.” ## References * RFC-002 §4.1 (SstDescriptor format, `size_bytes` already in the manifest). * Apache Parquet [Page Index spec](https://github.com/apache/parquet-format/blob/master/PageIndex.md). * `object_store::ObjectStore::get_ranges` ([docs.rs](https://docs.rs/object_store/0.13.0/object_store/trait.ObjectStore.html#method.get_ranges)). * `parquet::arrow::async_reader::ParquetObjectReader` ([docs.rs](https://docs.rs/parquet/55.2.0/parquet/arrow/async_reader/struct.ParquetObjectReader.html)).
# RFC 004: Cypher subset compatibility scope
> **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** —
> *Mirrored from [`docs/rfc/004-cypher-subset.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/004-cypher-subset.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Declara el subconjunto exacto de Cypher 25 / openCypher / GQL ISO/IEC 39075:2024 que el parser `namidb-query` acepta en la primera iteración del query engine (v0). La meta es **parsear sin error las 12 queries de LDBC SNB Interactive Complex que no dependen de `shortestPath`/`allShortestPaths`**, dejando IC13 e IC14 explícitamente fuera de scope hasta RFC-009 (WCOJ + recursive patterns) y cerrando 100 % de la superficie cubierta con tests que viven con el código. El subset es **deliberadamente menor** que el de Neo4j Community 5.x y que el de Kùzu: privilegiamos compatibilidad estricta sobre features, evitamos APOC, evitamos subqueries `CALL` y evitamos `FOREACH`. El compromiso es: *“lo que el parser acepta corre o devuelve un error tipado claro — nunca un warning silencioso que cambia la semántica”*. ## Motivation Cypher es un lenguaje grande. El Cypher 25 specification (mayo 2025) define unas 80 cláusulas/expresiones de primer nivel, openCypher TCK suma \~10 000 casos de test. Una implementación completa toma 18+ meses (Memgraph tardó \~2 años en cubrir el 80 % útil; Kuzu nunca llegó al 100 % antes del archive). Sin un scope explícito el parser se vuelve un agujero negro de tiempo: * Cada feature nuevo demanda decisiones de semántica (e.g. `MERGE` con multi-label, `OPTIONAL MATCH` con left-anti-join, `WITH *`). * Cada feature nuevo demanda tests, error messages, lowering al IR del logical plan. * Sin un gate de “qué está adentro y qué afuera” no podemos honestamente comunicar al usuario qué funciona. **Referencia de scope:** LDBC SNB Interactive Complex Q1–Q14. Cubrir 12 de las 14 queries en el parser deja scope ejecutable para las etapas siguientes (lowering, optimizer, executor) sin abrir nuevos frentes de compatibilidad. ## Design ### Versión declarada del estándar * **Base normativa:** GQL ISO/IEC 39075:2024 (publicado 11 abril 2024) + openCypher 9 (el último cuya specification es libre de patentes). * **Cypher 25 (Neo4j):** trataremos como referencia de naming y syntax pero **no** implementaremos nada exclusivo de Neo4j (e.g. `db.*` functions, APOC). * **Cuando hay conflicto** entre GQL y openCypher, **GQL wins**. Razón: evitar lock-in vendor-specific, posicionarnos junto a la dirección que Memgraph, RisingWave y la comunidad académica están tomando. ### Subconjunto v0 (in-scope) #### Clauses | Clause | v0 | Notas | | -------------------------- | -- | ----------------------------------------------------------------------------------------------------------------- | | `MATCH` | ✅ | Patrón fijo o variable-length `*n..m` con bounds finitos. | | `OPTIONAL MATCH` | ✅ | Semantics left-outer-join. | | `WHERE` | ✅ | Predicados arbitrarios sobre el scope visible. | | `RETURN` | ✅ | Projection list con aliases (`AS`). `DISTINCT` soportado. `*` no soportado en v0 (se exige projection explícita). | | `WITH` | ✅ | Pipe que reinicia el scope. Soporta `WHERE` interior y aliases. | | `ORDER BY` | ✅ | Multi-key `ASC`/`DESC`. | | `SKIP` / `LIMIT` | ✅ | Solo literales o `$param`. Sin expresiones. | | `UNWIND` | ✅ | Lista → rows. | | `CREATE` | ✅ | Nodes y edges con properties literales o `$param`. | | `MERGE` | ✅ | `MERGE... ON CREATE SET... ON MATCH SET...`. | | `SET` | ✅ | Property assign, label add. | | `DELETE` / `DETACH DELETE` | ✅ | Single binding por delete. | | `REMOVE` | ✅ | Property remove, label remove. | | `UNION` / `UNION ALL` | ✅ | Mismo arity y mismos aliases. | #### Patterns | Element | v0 | Notas | | ----------------------------------------------------- | -- | --------------------------------------------------------------------------------- | | Node pattern `(a:Label {prop: val})` | ✅ | Multi-label `(a:A:B)`. Map property filter inline. | | Relationship pattern `-[r:TYPE]->` | ✅ | Direction `-->`, `<--`, `--`. | | Relationship type alternation `-[r:TYPE_A\|TYPE_B]->` | ✅ | | | Variable-length `-[r:KNOWS*1..3]->` | ✅ | Bounds finitos requeridos. `*` solo o `*n..` (sin upper bound) → error explícito. | | Pattern chain `(a)-[]-(b)-[]-(c)` | ✅ | | | Pattern de múltiples partes `MATCH (a), (b)` | ✅ | | | Anonymous variable “, `[]` | ✅ | | #### Expressions | Categoría | v0 | | ---------------------------------------------------------------------- | ------------------------------------- | | Literals: int, float, string, bool, null, list `[1,2,3]`, map `{k: v}` | ✅ | | Parameters `$name` | ✅ | | Variable reference `a`, property access `a.prop` | ✅ | | Operators arith `+ - * / % ^` | ✅ | | Operators string `+` (concat), `=~` (regex) | ✅ | | Operators bool `AND OR NOT XOR` | ✅ | | Comparison `= <> < > <= >=` | ✅ | | `IS NULL` / `IS NOT NULL` | ✅ | | `IN` (membership lista) | ✅ | | `STARTS WITH`, `ENDS WITH`, `CONTAINS` | ✅ | | Function call `length(x)`, `count(a)`, `collect(a.prop)` | ✅ (built-ins listados abajo) | | `CASE WHEN... THEN... ELSE END` | ✅ (forma simple y forma multi-branch) | | List comprehension `[x IN list WHERE pred \| expr]` | ✅ | | Pattern comprehension `[(a)-[]->(b) \| b.name]` | ✅ | | Pattern predicates `WHERE (a)-[]->(b)` | ✅ | #### Built-in functions (mínimas para Q1–Q12) **Aggregations:** `count(*)`, `count(x)`, `count(DISTINCT x)`, `sum`, `avg`, `min`, `max`, `collect`, `collect(DISTINCT x)`. **Scalar:** `id(n)`, `labels(n)`, `type(r)`, `keys(n)`, `properties(n)`, `length(p)`, `size(coll)`, `head(coll)`, `last(coll)`, `tail(coll)`, `coalesce(x, y,...)`. **String:** `toLower`, `toUpper`, `trim`, `substring`, `replace`, `split`, `toString`, `toInteger`, `toFloat`. **Numeric:** `abs`, `ceil`, `floor`, `round`, `rand`, `sign`. **Temporal:** `date`, `datetime`, `duration` (forma constructor solo con ISO 8601 strings; no la álgebra completa todavía). **Pattern:** `exists(pattern)`, `nodes(path)`, `relationships(path)`. #### Tipos `INTEGER` (64-bit signed), `FLOAT` (64-bit), `STRING`, `BOOLEAN`, `NULL`, `LIST` (heterogénea permitida — typecheck en runtime), `MAP`, `NODE`, `RELATIONSHIP`, `PATH`, `DATE`, `DATETIME` (sin timezone), `DURATION`. Out-of-scope v0: `BYTES`, `POINT`, `LOCALDATETIME`, `ZONEDDATETIME`, `LOCALTIME`, `TIME`. #### Semántica de NULL Three-valued logic estándar Cypher: * `NULL = NULL` → `NULL` (no `true`). * `NULL AND false` → `false`, `NULL AND true` → `NULL`. * `WHERE` filter rechaza rows con predicado `NULL` (como `false`). * `IS NULL` / `IS NOT NULL` son las únicas formas de testear NULL. #### Error model `ParseError { code: ErrorCode, message: String, span: SourceSpan, help: Option }` donde `ErrorCode` es un enum exhaustivo (`E001_UnexpectedToken`, `E002_UnboundedVariableLength`, `E003_ReservedKeyword`,…). Mensaje sigue el formato de `ariadne` con caret highlighting y `help:` opcional. Múltiples errores se reportan en la misma pasada via `chumsky::recovery`. ### Out-of-scope explícito v0 Lista exhaustiva — cualquier feature que NO esté aquí ni en el subset in-scope falla con error de “feature no soportada” + número de RFC futuro donde aterriza. | Feature | Por qué afuera | Aterriza en | | ------------------------------------------------------------------- | ---------------------------------------------------------------------------- | --------------------------------------------------------- | | `shortestPath(...)` | Recursive pattern matching. Requiere WCOJ + planner especial. | RFC-009 | | `allShortestPaths(...)` | Idem. | RFC-009 | | `CALL {... }` (subqueries) | Subquery scoping rules son sutiles, no necesarias para LDBC SNB Interactive. | RFC futuro | | `CALL procedure.name(...)` | No tenemos procedure registry. APOC explícitamente out. | RFC futuro | | `FOREACH` | Imperativo, raramente útil. | RFC futuro | | `USE database` | Cross-database queries. Single namespace por sesión en. | RFC-010 (cloud) | | `LOAD CSV` | Bulk ingest path es `WriterSession`. | Nunca; usar el ingest API. | | `CREATE INDEX` / `CREATE CONSTRAINT` | DDL fuera de Cypher; lo manejará el schema API directo. | RFC futuro | | `EXPLAIN` / `PROFILE` | Pendiente pero ya con scope: vienen una vez exista LogicalPlan. | RFC futuro | | Transacciones explícitas (`BEGIN`/`COMMIT`/`ROLLBACK` Cypher-level) | El cliente las maneja externamente via `WriterSession.commit_batch`. | Nunca via Cypher en v0. | | Variable-length sin upper bound (`*1..`) | Sin upper bound el optimizador no puede limitar el blowup. | Posible relajación con WCOJ. | | Pattern de longitud cero (`*0..n`) | Trivial pero abre dudas semánticas (auto-loops). | RFC futuro. | | `MATCH p = (a)-[*]->(b) RETURN p` (paths como first-class) | Requiere materialización del path; útil pero no crítico para Q1–Q12. | RFC futuro. | | Tipos `POINT`, `TIME`, `ZONEDDATETIME` | Sin uso en LDBC SNB Interactive. | RFC futuro cuando aterricen verticales geo / time-series. | | `db.*` / `apoc.*` namespaces | Vendor-specific Neo4j; no portables. | Nunca. | ### Mapping a LDBC SNB Interactive Complex Q1–Q14 Cada query se evalúa contra el subset y se marca `IN` (parsea en v0) o `OUT` (queda excluida hasta el RFC indicado). | Query | Features requeridas | v0 | | ----------------------------------------------- | ------------------------------------------------------- | --------------- | | **IC1** — Friends by name (transitive) | `MATCH... *1..3... WHERE... ORDER BY... LIMIT` | ✅ IN | | **IC2** — Recent messages by friends | `MATCH 2-hop... WHERE timestamp <... ORDER BY... LIMIT` | ✅ IN | | **IC3** — Friends in two countries | `MATCH... WHERE country IN [...]` | ✅ IN | | **IC4** — New topics on friend posts | `MATCH 2-hop + WITH + collect + UNWIND + WHERE NOT IN` | ✅ IN | | **IC5** — New groups (membership count) | `MATCH... WITH... count + ORDER BY` | ✅ IN | | **IC6** — Tag co-occurrence | `MATCH 2-hop... WITH tag, count... ORDER BY` | ✅ IN | | **IC7** — Recent likers | `MATCH... ORDER BY... LIMIT` | ✅ IN | | **IC8** — Recent replies | `MATCH... ORDER BY... LIMIT` | ✅ IN | | **IC9** — Recent messages by friends-of-friends | `MATCH *2..2... WHERE... ORDER BY... LIMIT` | ✅ IN | | **IC10** — Friend recommendation | `MATCH 2-hop... WITH common_count... ORDER BY` | ✅ IN | | **IC11** — Job referral | `MATCH... WHERE... ORDER BY` | ✅ IN | | **IC12** — Expert search by tag class | `MATCH 2-hop + tag class hierarchy + count + ORDER BY` | ✅ IN | | **IC13** — Single shortest path | `shortestPath((a)-[*]-(b)` | ❌ OUT — RFC-009 | | **IC14** — All shortest paths weighted | `allShortestPaths` + weight calc | ❌ OUT — RFC-009 | **Cobertura v0:** 12/14 (85.7 %). IC13–IC14 son los únicos excluidos y ambos requieren recursive pattern matching que el WCOJ planner desbloquea. ### Estructura del crate `namidb-query`
```plaintext
crates/namidb-query/src/
├── lib.rs # reexports públicos
├── parser/
│ ├── mod.rs # entry point: parse(&str) -> Result>
│ ├── lexer.rs # &str → Vec<(Token, SourceSpan)>
│ ├── ast.rs # tipos AST (Query, Clause, Pattern, Expression,...)
│ ├── grammar.rs # chumsky combinators
│ ├── display.rs # Display impl canonical (round-trip)
│ └── error.rs # ParseError, ErrorCode, SourceSpan
└── tests/ # integration tests parser
```
LogicalPlan, optimizer y executor viven en módulos hermanos cubiertos por RFCs hermanas — quedan fuera del scope de RFC-004. ### Dependencias agregadas | Dep | Versión | Por qué | | --------- | ------- | ----------------------------------------------------------------------------------- | | `chumsky` | 0.10 | Parser combinators con error recovery y AST-friendly. Justificado en §Alternativas. | | `ariadne` | 0.5 | Pretty error messages (caret, span highlight, multi-error). | No agregamos `nom`, `pest`, `lalrpop`, ni `antlr-rs`. Justificación en §Alternativas. ## Alternatives considered ### A. Hand-written recursive descent parser **Pro:** máxima velocidad de parsing, control absoluto de error messages, sin dependency tree. **Con:** \~3 000–5 000 LoC para cubrir el subset declarado, \~30–50 % del tiempo se va en boilerplate de precedence + error recovery, refactor caro cuando agregamos features. **Veredicto:** Rechazado. Es la opción “Postgres” — válida cuando el parser es el producto principal. Para nosotros el producto es el storage + executor, el parser es overhead. ### B. `nom` parser combinators **Pro:** maduro (\~10 yrs), rápido, gran comunidad Rust. **Con:** error messages requieren mucha plumbing manual (`VerboseError` ayuda pero queda lejos de `ariadne`), no tiene recovery built-in, tipo de combinators byte-stream-first (no token-stream-first) — friction natural con un lexer tokenizado separado. **Veredicto:** Rechazado. Es la mejor opción si el parser fuera la única prioridad pero el dev experience de errores es inferior a chumsky. ### C. `chumsky` 0.10+ **Pro:** parser combinators con error recovery first-class (`recovery::skip_then_retry_until`, `nested_delimiters`), AST-friendly (retorna `Result>` con todos los errores no solo el primero), buena integración con `ariadne` para pretty errors, version 1.0 cerca. **Con:** versión 0.10 cambió API significativamente vs 0.9 — un breaking change vertical futuro probable. Slower que `nom` en benchmarks micro (\~2×). **Veredicto:** **Aceptado**. Velocidad de parsing es irrelevante en nuestro workload (la query string viene del usuario una vez, se parsea, se cachea). Error quality es lo que importa. ### D. ANTLR4 + generador de parser Rust (antlr-rust) **Pro:** openCypher distribuye una gramática ANTLR oficial; reusarla evita re-litigar precedencia y syntax edge cases; cobertura del estándar “para free”. **Con:** `antlr-rust` no está bien mantenido (último release 2022), la gramática openCypher cubre features que están out-of-scope (`shortestPath`, `CALL`, `FOREACH`,…) y filtrarlos post-parse es más caro que parsear el subset directo. ANTLR genera código que es pesado de leer; el debugging cuando algo sale mal es difícil. **Veredicto:** Rechazado. Reusar la gramática openCypher como referencia informal — sí. Generar Rust desde ella — no. ### E. LALRPOP (LR(1) generator) **Pro:** maduro, rápido, parser determinístico. **Con:** Cypher no es LR(1) limpio (ambigüedad pattern vs expression dentro de `WHERE` clauses), forzar grammar a LALR causa hacks. Error recovery en LR(1) es notoriamente difícil. **Veredicto:** Rechazado. LR genera grammars rígidas; quereremos evolucionar rápido (futuro). ### F. Lexer separado vs lexer inline en chumsky chumsky soporta parsear directo desde `&str` sin lexer (es lo idiomático en muchos ejemplos). Decisión: **lexer separado**. **Razones:** * Comments (`//`, `/* */`) y whitespace son más limpios de manejar en lexer. * Keyword vs identifier es ambigüedad léxica (`COUNT` puede ser función o identifier en algunos contextos) — resolverlo a nivel de token simplifica el grammar. * Spans más precisos: cada token lleva su span; el parser solo conecta tokens, no recomputa offsets. * Test independiente: el lexer puede testearse sin tocar el parser, y viceversa. Costo: \~150 LoC extra de lexer. Aceptable. ## Drawbacks 1. **Subset muy chico** comparado con Neo4j (5 % de la superficie) — early adopters que vienen de Neo4j chocarán con “feature not supported” en cada feature avanzado. Mitigación: error message indica qué RFC futuro lo cubre, link a roadmap público. 2. **Cypher 25 está evolucionando**: GQL ISO/IEC 39075 puede ganar ammendments. Mitigación: rebase del subset cada release; RFC-004 se trata como living document (Status puede pasar a `superseded` cuando aparezca RFC-004.1 o RFC-004 v1). 3. **`chumsky` 0.10 → 1.0** breaking change esperado en próximos meses. Mitigación: encapsular el uso detrás de `parser::grammar::*` privado, refactor confinado a un módulo. 4. **`MERGE` con multi-label patterns** tiene semantics ambiguas (Neo4j y Memgraph difieren). Decisión: en v0 `MERGE` requiere exactamente un label por node pattern. `MERGE (a:A:B)` retorna error parser-level. Documentado en error code `E007_MergeMultiLabel`. 5. **`OPTIONAL MATCH` con variable-length** no está bien definida en el estándar (¿qué pasa con OPTIONAL en `*0..n`?). Decisión v0: rechazar la combinación en el parser. `E008_OptionalVariableLength`. Aterriza con RFC-009. ## Open questions * **Q1: `RETURN *`** — en v0 no se soporta. ¿Lo agregamos más adelante cuando llegue el binding scope resolver? Likely sí, es feature high-value low-cost. * **Q2: `WITH *`** — idem. Decisión deferida. * **Q3: User-defined functions** — el plan §13.2 menciona RFC futura pero no está numerada. ¿`namidb.fn.*` namespace? ¿WASM sandbox? Out of scope v0; lo deciden. * **Q4: `LOAD CSV`** — fuera explícitamente; pero usuarios que vienen de Neo4j lo van a buscar. ¿Documentamos un equivalente “`namidb-cli ingest --csv...`” o lo dejamos al SDK Python? Decisión separada de este RFC. * **Q5: Identifiers con backticks** — `MATCH (a:`Foo Bar`)`. openCypher los permite, GQL los exige para identifiers con espacios o reserved words. Decisión: **soportar siempre** (mejor superset de standards). ## References * GQL ISO/IEC 39075:2024 — * openCypher 9 specification — * Cypher 25 (Neo4j) — * LDBC SNB Interactive Workload, v0.4 — Erling et al., SIGMOD 2015. * Memgraph Cypher subset — * Kuzu Cypher compatibility — (snapshot pre-archive oct 2025). * chumsky 0.10 documentation — * ariadne — * `recursive-descent` vs `combinators` discussion in Rust DBMS community — Niko Matsakis, “Why I built lalrpop” (2017); Geal blogposts on nom.
# RFC 008: Logical Plan IR
> **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** —
> *Mirrored from [`docs/rfc/008-logical-plan-ir.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/008-logical-plan-ir.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Define la representación intermedia (IR) que el query engine usa entre el AST de Cypher (RFC-004) y el executor. El IR es un árbol de operadores relacionales extendidos para grafos — el shape estándar de los DBMS modernos (DuckDB, DataFusion, Materialize, Kùzu) — adaptado al modelo property-graph. Esta RFC fija los operadores, su semántica, las reglas de lowering desde Cypher, el tipo runtime `RuntimeValue` y la API del executor naïve inicial (tree-walking, eager `Vec`). Out-of-scope explícito en la versión inicial: streaming/morsel-driven execution, cost-based optimizer, WCOJ planner, parallelism, distribución multi-namespace, query result caching. ## Motivation Sin un IR estable, lowering, optimizer y executor terminan acoplados a la forma del AST. Esto duele en tres dimensiones: 1. **Optimizer imposible de injertar.** Para rewrite predicate pushdown / join-order / projection-elimination necesitamos un árbol que sepa hablar de `Filter(input, pred)` y `Project(input, items)` como operadores independientes, no como cláusulas anidadas. 2. **Executor morsel-driven no puede compartir código.** El executor vectorizado va a operar sobre el mismo árbol de operadores que el naïve; solo cambia la representación de filas (Arrow `RecordBatch` vs `Vec`) y la estrategia de scheduling. Si el IR es estable, la versión vectorizada reemplaza solo la implementación de `Operator::execute`. 3. **EXPLAIN/PROFILE necesitan algo qué imprimir.** Sin IR, el `EXPLAIN` tendría que recorrer el AST y traducirlo on-the-fly cada vez. Con IR imprimimos el árbol directo. El costo de hacerlo de entrada (vs diferirlo) es \~700–1 000 LoC y una iteración de design. El costo de diferirlo es refactor obligatorio cuando entren optimizer y executor vectorizado — peor. ## Design ### Tipo runtime: `RuntimeValue` `namidb-core::Value` cubre los escalares (`Null/Bool/I64/F64/Str/Bytes/Vec`) pero le faltan los compuestos que Cypher necesita: `LIST`, `MAP`, `NODE`, `RELATIONSHIP`. Definimos `RuntimeValue` standalone en `namidb-query` para mantener `core` agnóstico del query layer:
```rust
pub enum RuntimeValue {
Null,
Bool(bool),
Integer(i64),
Float(f64),
String(String),
List(Vec),
Map(BTreeMap),
Node(Box),
Rel(Box),
// Date / DateTime / Duration: stubs iniciales; semantics completas más adelante.
Date(i32), // days since 1970-01-01
DateTime(i64), // microseconds since 1970-01-01T00:00:00Z
}
```
`NodeValue` y `RelValue` envuelven `NodeView` / `EdgeView` del storage:
```rust
pub struct NodeValue {
pub id: NodeId,
pub label: String,
pub properties: BTreeMap,
}
pub struct RelValue {
pub edge_type: String,
pub src: NodeId,
pub dst: NodeId,
pub properties: BTreeMap,
}
```
Conversiones `From` y `From` mapean `core::Value → RuntimeValue` row-wise; esto introduce una copia pero es aceptable en la versión inicial (la versión vectorizada futura va a operar directo sobre Arrow batches sin esta conversión). ### Tipo runtime: `Row`
```rust
pub struct Row {
pub bindings: BTreeMap,
}
```
Una `Row` es el estado completo de un binding scope en el current scope. `MATCH (a)-[r]->(b) RETURN a.name, r.weight, b.id` produce rows con tres bindings vivos (`a`, `r`, `b`) hasta el `RETURN`, que projecta a una nueva row con solo `a.name`, `r.weight`, `b.id`. Decisión `BTreeMap` (no `HashMap`): determinismo en orden de iteración para tests y `EXPLAIN` output. Lookup `O(log k)` con `k = #bindings` — inmaterial vs el costo de IO. ### Operadores del IR Cada operador es una variante de `LogicalPlan`. El árbol es child-pointer single-input excepto `Union` (dos inputs). Aristas implícitas: cada operador “produce rows” para su parent.
```rust
pub enum LogicalPlan {
/// Producer de rows: scan completo de todos los nodes con `label`.
/// `alias` es el binding que cada NodeValue ocupa en la row de salida.
NodeScan {
label: String,
alias: String,
},
/// Variante O(1): scan de un único node por id. Usado cuando el AST
/// llega con `(p:Person {id: $personId})` — lowering detecta el filtro
/// trivial y lo convierte en `NodeById` en vez de `Filter(NodeScan, ...)`.
NodeById {
label: String,
alias: String,
id: Expression, // typically Parameter("personId") or Literal(NodeId)
},
/// Toma rows del `input`, expande la binding `source` por sus edges
/// `direction`/`edge_type`, materializa el destino bajo `target_alias`
/// y opcionalmente bind la rel en `rel_alias`.
Expand {
input: Box,
source: String,
edge_type: Option,
direction: RelationshipDirection,
rel_alias: Option,
target_alias: String,
/// Cuando el AST trae variable-length `*min..max`, este campo
/// guarda los bounds; lowering decide si genera un único `Expand`
/// con length o (a futuro) un sub-plan recursivo.
length: Option,
},
/// Selecciona rows que satisfacen `predicate`.
Filter {
input: Box,
predicate: Expression,
},
/// Reemplaza el row con una nueva proyección. Mantiene scope abierto
/// vía la lista de items (cada item es expression + optional alias).
/// Si `discard_input_bindings = true`, las bindings no proyectadas
/// se borran (RETURN-style). Si `false`, se conservan (WITH-style).
Project {
input: Box,
items: Vec,
distinct: bool,
discard_input_bindings: bool,
},
/// Agrupa por `group_by` y aplica las funciones aggregate.
Aggregate {
input: Box,
group_by: Vec<(Expression, String)>, // (key expression, output alias)
aggregations: Vec<(String, AggregateExpr)>, // (output alias, agg)
},
/// Sort + skip + limit fundidos. Si solo hay sort, `skip = 0`,
/// `limit = u64::MAX`. Si solo hay limit, `keys` es vacío.
TopN {
input: Box,
keys: Vec,
skip: u64,
limit: u64,
},
/// Distinct sobre el set completo de columnas visibles.
Distinct {
input: Box,
},
/// UNION o UNION ALL.
Union {
left: Box,
right: Box,
all: bool,
},
/// Expande una expression-list a multiple rows, una por elemento.
Unwind {
input: Box,
list: Expression,
alias: String,
},
/// Driver inicial sin filas — produce exactamente un row vacío.
/// Necesario para queries que abren con UNWIND o WITH literal, ni
/// para subqueries que arrancan independientes.
Empty,
}
pub struct ProjectionItem {
pub expression: Expression,
pub alias: String,
}
pub struct OrderKey {
pub expression: Expression,
pub direction: OrderDirection,
}
pub enum AggregateExpr {
Count { arg: Option, distinct: bool },
Sum { arg: Expression, distinct: bool },
Avg { arg: Expression, distinct: bool },
Min { arg: Expression },
Max { arg: Expression },
Collect { arg: Expression, distinct: bool },
}
```
### Semántica NULL (three-valued logic) Misma que Cypher 25 / GQL: * `NULL OP NULL = NULL` para todo `OP ∈ {=, <>, <, >, ...}`. * `NULL AND false = false`, `NULL AND true = NULL`, `NULL AND NULL = NULL`. * `NULL OR true = true`, `NULL OR false = NULL`, `NULL OR NULL = NULL`. * `Filter(predicate)` descarta rows cuyo predicate evalúa a `NULL` (igual que `false`). * `IS NULL` / `IS NOT NULL` son los **únicos** operadores que devuelven `Bool` para input `NULL`. * Aggregate functions (excepto `count(*)`) **ignoran NULL** en sus inputs. * Comparison entre tipos incompatibles (e.g. `1 = "x"`) → `NULL` (no error). * Division by zero entre enteros → error runtime. Entre floats → `NaN` (siguiendo IEEE 754; downstream `<` con `NaN` retorna `NULL`). ### Semántica de scope Cada clause `MATCH`/`OPTIONAL MATCH`/`UNWIND`/`WITH`/`CREATE`/`MERGE` extiende el scope con nuevas bindings. * `WITH` **cierra** el scope: bindings no proyectadas se descartan. Es el único punto de re-arranque limpio. Cypher fuerza un `WITH` entre dos `MATCH` que comparten bindings — esto se controla en el AST, no en el IR. * `OPTIONAL MATCH` propaga `NULL` en todas las bindings cuando el match no tiene resultado. Implementado como `Filter` + outer-join semantics a futuro — inicialmente se baja a `Expand` con flag `optional` que produce rows con bindings `NULL` cuando no encuentra targets. * Las bindings de una `OrderBy` clausula siguiente a `RETURN` (o `WITH`) son las de la proyección, no las pre-proyección. Eso obliga a lower `RETURN ... ORDER BY` como `Project + TopN`, no `TopN + Project`. ### Evaluation order garantizado El executor ejecuta el árbol bottom-up, depth-first. Si un operador tiene dos entradas (`Union`) ejecuta `left` antes que `right`. Side-effects en el executor están prohibidos inicialmente (no hay `SET` / `CREATE` / `DELETE` todavía); cuando lleguen van a operadores dedicados (`SetProperty`, `CreateNode`, `DeleteNode`) que ejecutan strictly after todos los reads de la query (o lazy según RFC futuro). ### Lowering rules Para cada cláusula Cypher del subset RFC-004: | Cypher | LogicalPlan | | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | | `MATCH (a:L)` (no patterns más) | `NodeScan { label: "L", alias: "a" }` | | `MATCH (a:L {id: $x})` (igualdad sobre id) | `NodeById { label: "L", alias: "a", id: Parameter("x") }` | | `MATCH (a:L {id: $x})` (igualdad sobre otra prop) | `Filter(NodeScan, a.prop = $x)` | | `MATCH (a)-[r:R]->(b)` | `Expand { input: , source: a, edge_type: R, dir: Right, rel_alias: r, target_alias: b }` | | `MATCH (a) WHERE p` | `Filter(, p)` | | `RETURN x, y AS z` | `Project { items: [x, z=y], discard_input=true }` | | `RETURN DISTINCT x` | `Project { distinct: true, ... }` | | `WITH x, y AS z` | `Project { items: [x, z=y], discard_input=true }` (mismo que RETURN — diferencia es solo si hay clauses siguientes) | | `WITH x WHERE p` | `Filter(Project(...), p)` | | `ORDER BY k1, k2 SKIP s LIMIT l` (después de Project) | `TopN { keys: [k1, k2], skip: s, limit: l }` | | `UNION ALL` | `Union { all: true }` | | `UNION` | `Distinct(Union { all: false })` | | `UNWIND list AS x` | `Unwind { input: , list, alias: x }` | | `MATCH a, b` (multiple patterns, mismo `MATCH`) | Cross product: lower `b` con `input = lowered(a)` y sin shared bindings. | La regla específica para `OPTIONAL MATCH`: * `OPTIONAL MATCH (a)-[r]->(b)` con `a` ya bindeada del scope anterior: lower como `Expand { ..., optional: true }`. Si no hay match, produce un row con `r = NULL` y `b = NULL` (preserva el row input). * Sin variable-length permitido (parser ya lo rechaza, ver `RFC-004 §Drawbacks 5`). ### EXPLAIN format
```plaintext
Project [name=a.firstName, age=a.age]
TopN keys=[a.age DESC] skip=0 limit=10
Filter (a.age > 18)
Expand source=a edge_type=KNOWS dir=-> target=b
NodeScan label=Person alias=a
```
Cada operador se imprime en una línea con el nombre del operador, sus parámetros entre `[...]` o `nombre=value`, y los hijos indentados con dos espacios. `EXPLAIN` produce esto; `PROFILE` (a futuro) lo decora con runtime stats (`rows_out`, `time_ms`, `bytes_read`). ### API del executor
```rust
pub async fn execute(
plan: &LogicalPlan,
snapshot: &Snapshot<'_>,
params: &BTreeMap,
) -> Result, ExecError>;
```
Trae todo a memoria. Eager. Single-thread (tokio current\_thread). `ExecError` cubre: binding not found, type error, parameter not provided, storage error. ## Alternatives considered ### A. AST → directamente executor (no IR) **Pro:** menos código, menos boilerplate. **Con:** acopla executor a AST. Optimizer requeriría refactor masivo. EXPLAIN tendría que reconstruir el plan en string-time. **Veredicto:** rechazado. La inversión IR-first es \~300 LoC extra que ahorra >1000 LoC más adelante. ### B. Push-based dataflow (Materialize-style) **Pro:** modelo dataflow nativo, encaja con streaming y continuous queries. **Con:** mucho más complejo. Cada operador es un actor con state + input/output channels. Overhead alto para queries one-shot. Diferencial solo aparece en multi-query / streaming scenarios. **Veredicto:** rechazado; potencial RFC futuro si entramos a streaming/CDC. ### C. Volcano-style iterator (`trait Operator { fn next(); }`) **Pro:** estándar en DBMS clásicos (Postgres, MySQL pre-pipelined). Lazy, low-memory per operator. Streaming natural. **Con:** sin parallelism. Function-call overhead por row. La industria moderna (DuckDB, Velox) lo abandonó. **Veredicto:** rechazado. Inicialmente eager `Vec` es más simple y suficiente; a futuro vamos directo a morsel-driven, no Volcano. ### D. DataFusion como IR **Pro:** maduro, optimizer “para free”, compatibilidad con Arrow. **Con:** DataFusion es relacional, no graph-shaped. Adaptar `Expand`, multi-hop, WCOJ a DataFusion es trabajo grande y nunca natural. **Veredicto:** rechazado como **IR único**. A futuro lo cableamos como **bridge para SQL surface paralelo** (graph queries en nuestro IR, SQL surface en DataFusion, mismo executor). ### E. Single-input vs multi-input operators Decisión: single-input excepto `Union`. `Join` (Hash, NL, LFTJ) es explícito multi-input pero **no aparece inicialmente** (lowering produce `Expand` chain, no joins). Joins entran cuando el optimizer re-ordene. ## Drawbacks 1. **`RuntimeValue` introduce conversión row-by-row vs Arrow.** Aceptable inicialmente (correctness-first); la versión vectorizada elimina la conversión midiendo sobre `RecordBatch` directo. Mientras tanto, hot loops convierten `BTreeMap → BTreeMap` por cada NodeView accedida. 2. **`Empty` operator + `NodeById` son corner cases.** Podrían vivir como casos especiales del `NodeScan`, pero declararlos explícitos en el IR los hace inspeccionables en `EXPLAIN` y trivial de optimizar después. 3. **OPTIONAL MATCH como flag en `Expand`** mezcla orthogonality (left outer join semantics) con sintaxis (cypher-specific clause). A futuro probablemente lo refactorizamos a `LeftOuterExpand` o un explicit `LeftJoin` operator cuando el optimizer lo necesite. 4. **`Distinct` sobre el row entero** no permite optimizar `DISTINCT col` donde solo necesitamos uniqueness de una columna. Optimización diferida. 5. **Lowering errors no son recuperables** — un solo `BindingNotFound` aborta el plan. En contraste, parser tiene multi-error recovery. Aceptable: semantic errors son menos frecuentes que typos sintácticos y queremos fail-fast. ## Addendum — `SemiApply`, `Argument`, `PatternList` Tres operadores adicionales al IR para soportar pattern predicates, pattern comprehensions y back-references a outer scope: * **`Argument { bindings: Vec }`** — single-row placeholder cuyas bindings se cargan desde el outer scope. Aparece como leaf de subplans dentro de `SemiApply` o `PatternList`. El executor materializa `vec![row]` donde `row` copia las bindings nombradas desde el outer. * **`SemiApply { input, subplan, negated }`** — semi-join existencial. Para cada row producida por `input`, ejecuta `subplan` parametrizado por el row (vía `outer_row`); mantiene la row iff el subplan emitió ≥1 (positivo) ó =0 (negated). Reemplaza la semántica `Filter(Exists(...))` con un operador dedicado. Pendiente: convertir nested-loop semi-apply a hash-semijoin cuando hay >N rows. * **`PatternList { input, subplan, projection, alias }`** — materializa una `RuntimeValue::List` por outer row. Para cada row, ejecuta `subplan` parametrizado por la row, evalúa `projection` sobre cada inner row, colecta a una lista y bindea a `alias` en la row outer. Es el lowering de `[(pattern) WHERE p | proj]` cuando aparece como top-level projection item. ### Lowering rules adicionales * **WHERE con EXISTS**: descompone el AND-tree del predicate. Cada término que es `Exists(pattern)` o `NOT Exists(pattern)` se extrae a un `SemiApply` chained sobre el input plan; los residuos se reconstruyen como `Filter` encima de la chain. Casos no soportados en v0: `Exists` dentro de `OR`, `CASE`, doble negación, etc. → `UnsupportedFeature`. * **Pattern comprehension top-level**: hoist a `PatternList` con alias sintético `__pcN`, substitute la comprehension expression por `Variable(__pcN)` en el item de la projection. * **Aggregate nesting** (e.g. `head(collect(x))`): el lowering walk recursivo cada item expression, hoist cada aggregate function call a un alias sintético `__aggN` con la `AggregateExpr` correspondiente, substituye la call por `Variable(__aggN)`. Group keys = items que no contienen ningún `__aggN`. Items con agg-nesting se evalúan sobre la row post-Aggregate. * **`RETURN *` / `WITH *`**: expande `ExpressionKind::Star` a una projection item por cada binding nombrada visible en `LowerCtx` (skip `__anon*`). Cierra RFC-004 Q1. * **Back-reference de head pattern**: cuando `(a)` reutiliza una binding ya en scope y no hay input plan, emite `Argument { bindings: [a] }` en vez de `Empty`. Esto permite que un subplan de `SemiApply`/ `PatternList` reciba la binding outer al ejecutarse. ### Out-of-scope todavía (pendiente para versiones futuras) * Pattern comprehensions nested dentro de scalar functions (`size([(a)-[]->(b)|b.name])`). * `EXISTS` fuera del AND-root del WHERE (dentro de OR/CASE/etc). * Path bindings (`p = (a)-[*]->(b)`) + path materialization. * Write clauses (CREATE/MERGE/SET/REMOVE/DELETE). ## Open questions * **Q1: ~~Pattern predicates como sub-plans.~~** ✅ Cerrada vía `SemiApply` + `Argument`. La optimización a hash-semijoin queda pendiente. * **Q2: Variable-length patterns sin variable-length operator.** Inicialmente podemos pasar `length: Option` al `Expand` y dejar que el executor itere `length.min..=length.max` iterations. Eso funciona pero no escala. ¿Variable-length explícito como operador separado (`Traverse`) a futuro con WCOJ? Probable sí. * **Q3: Materialización de paths.** `MATCH p = (a)-[*]->(b)` requiere que `p` sea materializable como List. Diferido. * **Q4: ~~`WITH *` y `RETURN *`.~~** ✅ Cerrada vía `expand_star_items` en el lowering. * **Q5: Hoist de pattern comprehensions nested.** Hoy solo top-level en projection items. Hoist nested requiere planning de orden de evaluación y bookkeeping de scopes intermedios. Diferido. ## References * DuckDB logical/physical plans — (architecture notes en el repo de DuckDB). * Kuzu morsel-driven execution — Boncz et al., CIDR 2024 paper . * Materialize/Differential Dataflow operators — McSherry et al., 2013. * Volcano model — Goetz Graefe, “Volcano—An Extensible and Parallel Query Evaluation System”, IEEE TKDE 1994. * Cypher openCypher 9 §Section 3 (Linear queries semantics). * GQL ISO/IEC 39075:2024 §17 (Linear queries) y §18 (Composite queries).
# RFC 009: Write clauses + execution model
> **Status:** accepted **Author(s):** Matías Fonseca **Supersedes:** —
> *Mirrored from [`docs/rfc/009-write-clauses.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/009-write-clauses.md) in the engine repo. Source of truth lives there.* **Status:** accepted **Author(s):** Matías Fonseca **Supersedes:** — ## Summary El read-path está completo: `parse → lower → execute` contra `Snapshot` (read-only) corre MATCH / Expand / Filter / Project / TopN / Aggregate / Distinct / Union / Unwind / SemiApply / PatternList / list & pattern comprehensions / `RETURN *` / `EXPLAIN`, y los 4 IC representativos del LDBC SNB Interactive (IC2/7/8/9) producen resultados correctos sobre un mini-graph. Este RFC extiende el query engine al **write-path**: las cláusulas `CREATE`, `MERGE`, `SET`, `REMOVE`, `DELETE` (y `DETACH DELETE`) parsean y producen AST válido, pero el lowering reporta `UnsupportedFeature`. Esta RFC cierra ese gap. ## Motivation Sin write-path, el subset Cypher de namidb no es completo: cualquier usuario que quiera cargar datos debe hacerlo via la API Rust `WriterSession::upsert_node/upsert_edge` directamente. Eso rompe el pitch “developer-first universal con embed + Cypher” y bloquea de plano: * LDBC SNB **Update queries** (IU1 insertPerson, IU2 addPostLike, IU3 addCommentLike, IU4 addForum, IU5 addForumMembership, IU6 addPost, IU7 addComment, IU8 addFriendship). Sin estas no hay pipeline LDBC end-to-end (load → run → measure). * Quickstart docs (“crea un nodo, agrega una arista, lee de vuelta”) — hoy requieren un `Cargo.toml` + `tokio::main` boilerplate. * Loading desde scripts Cypher (.cypher files con CREATE chains que Neo4j / Kuzu aceptan de fábrica). Costo de no hacerlo ahora: cada query LDBC IU permanece como dead-letter; los benchmarks de siguen necesitando harnesses ad-hoc que escriben via API Rust en vez de via la abstracción Cypher idiomática; ningún consumidor externo puede probar namidb sin escribir Rust. ## Design ### Operadores nuevos en `LogicalPlan`
```rust
pub enum LogicalPlan {
// ... read operators ...
Create {
input: Box,
elements: Vec,
},
Merge {
input: Box,
pattern: CreateElement,
on_match_sets: Vec,
on_create_sets: Vec,
},
Set {
input: Box,
items: Vec,
},
Remove {
input: Box,
items: Vec,
},
Delete {
input: Box,
targets: Vec,
detach: bool,
},
}
```
Helpers:
```rust
pub enum CreateElement {
Node {
alias: String,
label: String,
properties: Vec<(String, Expression)>,
},
Rel {
alias: Option,
edge_type: String,
source_alias: String,
target_alias: String,
direction: RelationshipDirection,
properties: Vec<(String, Expression)>,
},
}
pub enum SetOp {
Property { target_alias: String, key: String, value: Expression },
Replace { target_alias: String, value: Expression }, // a = {...}
Merge { target_alias: String, value: Expression }, // a += {...}
Labels { target_alias: String, labels: Vec }, // a:Label[:Label]
}
pub enum RemoveOp {
Property { target_alias: String, key: String },
Labels { target_alias: String, labels: Vec },
}
```
`children()` retorna `[input]` para los 5 nuevos. `operator_name()` retorna `"Create"`, `"Merge"`, `"Set"`, `"Remove"`, `"Delete"` (prefijado por `Detach` cuando aplica). ### Lowering rules * **CREATE** clause sin MATCH previo: `Empty → Create`. Cuando hay MATCH previo, `Create` se encadena: `... → Create { input, elements }`. Las bindings nuevas (node aliases + rel aliases) se introducen en `LowerCtx` antes del próximo clause. * **MERGE** clause: solo una pattern part en v0. Se baja a `Merge { input, pattern, on_match_sets, on_create_sets }`. Las bindings del pattern se introducen en `LowerCtx`. * **SET**: cada item se traduce a un `SetOp`; el operador `Set` lee el binding del row y muta. * **REMOVE**: similar a SET; cada `RemoveOp` se aplica. * **DELETE / DETACH DELETE**: las expressions de `targets` se evalúan per-row para producir Node/Rel/Path; el operador lo tombstones. * Una query solo-write (sin MATCH) arranca con `LogicalPlan::Empty` para proveer una “single driver row”. Esto reusa el patrón ya usado por UNWIND. Bindings de salida: al final del query, las bindings del último write clause + las del último read clause permanecen visibles si hay un RETURN posterior (Cypher 25 permite `CREATE (a:Person {name: 'Ada'}) RETURN a`). ### Executor split: read vs write Mantengo dos entry points distintos:
```rust
// Read-only path
pub async fn execute(
plan: &LogicalPlan,
snapshot: &Snapshot<'_>,
params: &Params,
) -> Result, ExecError>;
// Write-aware path
pub async fn execute_write(
plan: &LogicalPlan,
writer: &mut WriterSession,
params: &Params,
) -> Result;
pub struct WriteOutcome {
pub rows: Vec,
pub nodes_created: u64,
pub edges_created: u64,
pub nodes_deleted: u64,
pub edges_deleted: u64,
pub properties_set: u64,
}
```
`execute_write`: 1. Walk down the plan. Read operators (NodeScan/Expand/Filter/…) usan `writer.snapshot()` interno (re-pinned por clause). 2. Write operators (Create/Merge/Set/Remove/Delete) llaman `writer.upsert_node/upsert_edge/tombstone_node/tombstone_edge` per-row. 3. Al final, **auto-commit**: `writer.commit_batch().await` antes de retornar `WriteOutcome`. Garantiza durabilidad de toda la query como unidad. `execute_write` queda separado de `execute` por dos razones: * Type safety — `&mut WriterSession` vs `&Snapshot<'_>` no son intercambiables. * Permite que `execute` se siga ejecutando contra snapshots persistidos (read-replicas) en SaaS sin acoplar el writer side. ### Read-your-own-writes: NO en v0 Una query como:
```cypher
CREATE (a:Person {name: 'Ada'})
MATCH (p:Person) RETURN p.name
```
verá rows = whatever existía pre-CREATE. La nueva Ada **no** está visible al MATCH. Razón: * Implementar visibility intra-query require overlay sobre Snapshot (memtable+SST+pending\_payloads). El WriterSession actual ya tiene `pending_payloads` pero solo se aplican al memtable post-`commit_batch`. * La complejidad de read-your-own-writes choca con la semántica de cluster-distributed eventual consistency que querremos en SaaS. * La gran mayoría de queries write-then-read son separadas por commits (sesiones interactivas). LDBC IU queries son monolíticas pero write-only. Mitigación: una vez se introduzca transactional consistency real, overlay la memtable + pending → read-your-own-writes “just works”. Hasta entonces, error explícito si detectamos write-then-read en el mismo plan tree (advisor warning, no hard fail). ### MERGE semantics
```plaintext
MERGE (n:Label {key: value})
ON MATCH SET n.lastSeen = $now
ON CREATE SET n.firstSeen = $now, n.lastSeen = $now
```
Ejecución: 1. Intenta matchear el pattern (igual que MATCH). Si encuentra ≥1 row: * Para cada row matched, aplica `on_match_sets`. * Output rows reflejan los matches. 2. Si encuentra 0 rows: * Genera el pattern (igual que CREATE). * Aplica `on_create_sets` al row del CREATE. * Output rows reflejan la creación. Limitaciones v0: * Solo una pattern part por MERGE (no multi-element). RFC-004 ya rechazaba multi-label en parser. * No locks/serializability. Una MERGE concurrente con otra writer puede crear duplicados — esto queda para una RFC futura. ### DETACH DELETE semantics
```plaintext
MATCH (a:Person {id: $id}) DETACH DELETE a
```
Para cada `a` matched, antes de tombstone el node, enumera todas las edges incidentes vía `out_edges(*, a.id) + in_edges(*, a.id)` para CADA edge\_type declarado en el manifest schema, y las tombstones primero. Luego tombstone el node. DELETE sin DETACH falla con `ExecError::Mutation` si el node tiene edges (mensaje explícito sugiriendo DETACH). ### Path binding (caso simple)
```rust
pub enum RuntimeValue {
// ...
Path(Vec), // alternating Node, Rel, Node, Rel, ..., Node
}
```
Para `MATCH p = (a)-[r]->(b) RETURN p`: * `PatternPart.binding = Some(p)` se baja a `Expand { ..., path_binding: Some("p") }`. * El executor, al producir cada row, materializa `[a_value, r_value, b_value]` y bindea a `p`. * Para chains más largos `p = (a)-[r1]->(b)-[r2]->(c)`, el executor acumula a través del Expand chain. Variable-length paths (`p = (a)-[*1..3]->(b)`) requieren materializar listas de longitud variable y quedan diferidos. `fingerprint_value` se extiende con un caso `Path(items)` para que Distinct + collect distinct funcionen sobre paths. ## Alternatives considered **A. Single executor entry que toma `&mut WriterSession` siempre.** Rechazada: Snapshot read path es claramente diferente del write path (no mutación, lifetime más corto, posible read-only replica). Forzar WriterSession en TODOS los reads acopla los SaaS paths. **B. Lazy commit (caller decides cuándo flush).** Rechazada para v0: hace que `execute_write` retorne un handle a un “pending transaction” y requiere transaction API formal. La sentencia “una query es una transacción” es predecible y suficiente para LDBC IU + quickstart. **C. Read-your-own-writes via overlay.** Considerada pero deferida: el overlay sobre Snapshot requiere mantener un view temporal “memtable + pending\_payloads + el plan write effects acumulados hasta ahora”. Es \~300 LoC y complica el reasoning sobre snapshot lifetimes. Vuelve a futuro con el transactional model. **D. MERGE con locks.** Considerada y rechazada para v0: requiere coordinación a nivel WriterSession (single-writer per namespace ya nos da serialización a nivel de namespace, pero MERGE necesita serialization local entre clauses). Vive bien con LWW pero introduce flakiness en tests si dos writers race. Mientras tenga single-writer-per-namespace (que tiene), MERGE es safe. **E. Mantener Create/Merge/Set/Remove/Delete como UnsupportedFeature.** Rechazada: bloquea LDBC IU y quickstart developer experience indefinidamente. El opportunity-cost de no tenerlos es mayor que la complejidad de implementarlos ahora. **F. Soportar variable-length path bindings de entrada.** Rechazada: materializar lista de longitud variable + interaccionar con `Expand` multi-hop es \~150 LoC más y un test surface considerable. El caso simple cubre la mayoría de quickstart docs; var-len queda diferido. ## Drawbacks 1. **No read-your-own-writes** rompe expectativas de usuarios que vienen de Neo4j / Kuzu. Mitigación: documentar explícitamente en README + retornar warning en `WriteOutcome` si se detectó el pattern; cerrar a futuro. 2. **Auto-commit per query** no permite multi-statement transactions. Para LDBC IU es suficiente (cada IU es atomic by design); para workloads ETL más complejos no. Mitigación: a futuro se introduce explicit `BEGIN TRANSACTION ... COMMIT` clauses con session API. 3. **MERGE sin locks** depende del single-writer-per-namespace invariant. Si en multi-tenant SaaS hacemos multi-writer sharded namespaces, MERGE necesita revisitarse. Documentado. 4. **DETACH DELETE enumeration is O(edge\_types × incident\_edges).** Para nodos high-degree (super-nodes) puede ser caro. Acceptable para v0; optimización vive junto con el catálogo de edge\_types activos. 5. **`WriteOutcome` counters son aproximados.** Counters incrementan por cada operación del executor, no por cada cambio real de estado (e.g. SET de la misma propiedad al mismo valor cuenta como 1 property\_set aunque sea no-op). Documentado. ## Open questions * **Q1: WriteOutcome.rows.** ¿Una query write-only (CREATE sin RETURN) retorna `Vec` vacío? Cypher dice sí. ¿Y con RETURN? `RETURN a` después de CREATE retorna el row con `a` bound. Implementar igual que un Project encima del Create. * **Q2: Schema discovery via CREATE.** Si CREATE introduce una label o edge\_type nueva, ¿se autopopula el schema en el manifest? RFC-002 permite schema implícita via property names. Sí — el executor introspecciona la label + edge\_type y los agrega si no existen. Requiere que `WriterSession` exponga un schema extension API; hoy el commit\_batch no toca schema. Pieza adicional. * **Q3: Multi-statement Cypher.** `CREATE (a) ; CREATE (b)` (con semicolon). Hoy parser lo acepta como query terminator pero no como separator entre statements. ¿Statement separator es necesario para Cypher scripts? Diferido. ## References * openCypher 9 §6 (Write clauses), §7 (Reading + writing clauses). * GQL ISO/IEC 39075:2024 §19 (Linear data modifications). * Neo4j MERGE semantics: * Kuzu storage write path: kuzudb/kuzu README §“Bulk loading + transactions”. * DuckDB inserts as plans: * RFC-008 (Logical Plan IR + addendum). * RFC-002 (SST format) — schema introspection at storage layer.
# RFC 010: Cost-Based Optimizer — Foundation
> **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** —
> *Mirrored from [`docs/rfc/010-cost-based-optimizer.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/010-cost-based-optimizer.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Fija la base del cost-based optimizer (CBO) que cierra el gate de LDBC SNB Interactive dentro de 2× de Kuzu. El alcance de **esta** RFC es solamente la **fundación**: un catálogo de estadísticas derivado del `Manifest`, una rutina de estimación de cardinalidad por operador, una rutina de estimación de selectividad de predicates, y `EXPLAIN VERBOSE` con números. Los rewrites estructurales (predicate pushdown, join reorder, hash-conversión de `SemiApply`/`CrossProduct`) son out-of-scope explícito de esta RFC; se encadenan en RFCs siguientes sobre esta base. La RFC se publica como **draft pre-implementation** para alinear shape y fórmulas antes de quemar decisiones de rewrite. Las cifras concretas salen del fixture LDBC SNB micro-graph (6 Person / 8 Message / 4 Comment) y de las estructuras `PropertyColumnStats` / `DegreeHistogram` que ya viven en cada `SstDescriptor`. Out-of-scope explícito de esta RFC: * Predicate pushdown / filter merging. * Join-order DP/greedy sobre `Expand` chains y `CrossProduct`. * Conversión `SemiApply` → `HashSemiJoin` y `CrossProduct` con shared bindings → `HashJoin` (ver RFC-011). * Histogramas equi-depth o quantiles para selectividad de rangos precisa. * HyperLogLog actually populated por el writer (hoy `ndv_estimate` es siempre `None` — la plomería existe, el cómputo todavía no). * Adaptive / runtime cost feedback. * PROFILE con observed cardinality post-ejecución. * Cost model multi-namespace / partition-aware. ## Motivation El executor naïve inicial ejecuta cualquier `LogicalPlan` válido correcta y deterministamente. El problema visible es: 1. **Multi-pattern MATCH naïve.** `MATCH (a:Person {id: $x}), (b:Message {id: $y})` se baja a `CrossProduct { NodeById, NodeById }`. Sin reorder el outer puede ser el lado pesado; con `SemiApply` el outer nested-loop reejecuta el subplan |outer| veces sin cache. 2. **EXPLAIN sin números.** El árbol del plan es indentado pero no dice cuántas filas espera procesar cada operador. Sin números, ningún rewrite tiene base para decidir “este Expand explota a 10 K rows, conviene pushear el Filter antes”. 3. **`PropertyColumnStats` + `DegreeHistogram` sin consumer.** Las dos estructuras viven en cada `SstDescriptor`. El writer las puebla con `min`/`max`/`null_count` (HLL todavía no), pero ningún consumer las lee — son data dormida. El costo de no hacerlo ahora es: * Los rewrites posteriores tendrían que inventar su propio cost model inline, contaminando cada paso con lookups de stats. * Las decisiones de pushdown/reorder se tomarían a ciegas (heurísticas sin números), reproduciendo el problema “Cypher.runtime=slotted” de Neo4j: optimizaciones que parecen razonables pero pierden en queries reales. * LDBC SNB SF1 (gate) no se puede preparar sin una baseline numérica que diga *dónde* el plan actual gasta tiempo. Hacerlo ahora cuesta \~1 500 LoC (módulo `cost::`, EXPLAIN VERBOSE, smoke tests) y abre la puerta a rewrites sin refactor. ## Design ### 1. Catálogo de estadísticas (`StatsCatalog`) crates/namidb-query/src/cost/stats.rs
```rust
pub struct StatsCatalog {
labels: BTreeMap,
edge_types: BTreeMap,
/// Total nodes across all labels — usado como denominador para
/// estimaciones de patrones anónimos (label desconocido).
total_nodes: u64,
/// Total edges across all edge types — análogo para edges
/// anónimos.
total_edges: u64,
}
pub struct LabelStats {
pub name: String,
/// Σ row_count - tombstone_count sobre SSTs del label (no incluye
/// memtable: el catálogo se construye desde Manifest committed).
pub node_count: u64,
/// Propiedad → estadísticas por columna. Se mergean per-name a
/// través de todos los SSTs del label.
pub properties: BTreeMap,
}
pub struct PropStats {
pub null_count: u64,
pub non_null_count: u64,
pub min: Option, // reusado de storage::sst::stats
pub max: Option,
/// NDV decodificado del HLL fused; `None` cuando el writer no
/// pobló el sketch (caso default en v0).
pub ndv: Option,
}
pub struct EdgeTypeStats {
pub name: String,
/// Σ row_count - tombstone_count sobre SSTs `EdgesFwd` del tipo.
pub edge_count: u64,
/// avg_degree para src → dst, derivado de degree_histogram fused.
/// Si no hay SST `EdgesFwd`, es 0.
pub avg_out_degree: f64,
pub max_out_degree: u64,
/// idem para EdgesInv (dst → src).
pub avg_in_degree: f64,
pub max_in_degree: u64,
/// Schema-declared endpoints. `None` cuando no hay schema explícito
/// (caso típico hoy: las queries inferieron edge_type del pattern).
pub src_label: Option,
pub dst_label: Option,
}
```
**Construcción:**
```rust
impl StatsCatalog {
pub fn from_manifest(m: &Manifest) -> Self;
pub fn empty() -> Self; // fallback cuando el query corre sin Snapshot
pub fn label(&self, name: &str) -> Option<&LabelStats>;
pub fn edge_type(&self, name: &str) -> Option<&EdgeTypeStats>;
pub fn total_nodes(&self) -> u64;
pub fn total_edges(&self) -> u64;
}
```
**Merge de stats per-label**: itera `m.ssts` filtrando por `kind == SstKind::Nodes && scope == label`. Para cada `SstDescriptor`: * `node_count += row_count - tombstone_count` (`KindSpecificStats::Nodes`). * Para cada `PropertyColumnStats`: * `null_count += sst.null_count`. * `non_null_count += (row_count - tombstone_count - null_count)`. * `min = stat_min(self.min, sst.min)` (lex-order según tipo). * `max = stat_max(self.max, sst.max)`. * `ndv`: cuando los SSTs traen HLL (v1 follow-up), fuse; v0 → `None`. **Merge de stats per-edge\_type**: itera `m.ssts` filtrando por `(EdgesFwd, edge_type)` y `(EdgesInv, edge_type)`: * `edge_count = Σ row_count(EdgesFwd) - Σ tombstone_count(EdgesFwd)`. * `avg_out_degree = sum_degree(EdgesFwd) / key_count(EdgesFwd)` (Σ y Σ). * `max_out_degree = max(max_degree(EdgesFwd))` across SSTs. * idem para `EdgesInv` → `avg_in_degree`, `max_in_degree`. * `src_label` / `dst_label`: lookup `m.schema.edge_type(name)`; si no hay declaración, `None`. **Coste de construcción**: O(|ssts|). En un manifest real típico (1 M nodos / 1 M edges sobre R2) son \~10² SSTs — micro-segundos. Para SF1 LDBC (\~3 M nodes / 17 M edges) serán \~10³ SSTs, sigue siendo sub-milisegundo. El catálogo se construye **una vez por `Snapshot`** y se reutiliza para todas las optimizaciones del plan; no es hot-path. **Edge case — schema vacío + zero SSTs (CLI ephemeral `namidb run`):** el catálogo retorna `LabelStats::empty()` para cualquier label solicitado. La cardinalidad cae al fallback default (ver §3.4) y EXPLAIN VERBOSE marca el nodo con `(no stats)`. Esto permite que `namidb explain --verbose` funcione sin datos cargados, útil para debugging del plan shape. ### 2. Selectividad de predicates (`cost::selectivity`) Función pura: dado un `Expression`, un `LabelStats` (o tabla de `LabelStats` por alias) y un mapa de tipos opcional, retorna la fracción esperada de filas que satisface el predicate.
```rust
pub fn selectivity(
expr: &Expression,
bindings: &BindingStats,
) -> f64;
pub struct BindingStats<'a> {
/// alias → LabelStats. None cuando el alias no está bound a un
/// label conocido (Argument / Project synthetic / etc).
pub by_alias: BTreeMap,
}
```
**Reglas (v0):** | Predicate | Estimación | | -------------------------------------------------------- | -------------------------------------------------------------- | | `prop = literal` | `1 / ndv(prop)` si hay HLL; `0.1` fallback (10 %). | | `prop <> literal` | `1 - eq_sel(prop, literal)`. | | `prop < literal` | rango sobre `[min, max]` si min/max + tipo numérico; `0.33`. | | `prop <= literal` / `prop > literal` / `prop >= literal` | mismo trato que `<`. | | `prop BETWEEN low AND high` | rango bilateral; `0.25` fallback. | | `prop IN [list]` | `min(1, len(list) / ndv)`; `min(1, len(list) * 0.1)` fallback. | | `prop IS NULL` | `null_count / (null_count + non_null_count)`; `0.05` fallback. | | `prop IS NOT NULL` | `1 - is_null_sel`. | | `prop STARTS WITH 'p'` | `0.1` (sin tries / sin index). | | `prop CONTAINS 'p'` / `ENDS WITH 'p'` | `0.1`. | | `prop LIKE 'pattern'` (no soportado) | n/a. | | `__label_eq(alias, L)` | fold pre-Filter — siempre `1.0` (el operador ya garantiza). | | `AND` | producto: `sel(left) * sel(right)`. Asume independencia. | | `OR` | unión: `sel(left) + sel(right) - sel(left)*sel(right)`. | | `NOT (pred)` | `1 - sel(pred)`. | | `XOR` | `sel(left) + sel(right) - 2*sel(left)*sel(right)`. | | Cualquier otro caso | `0.5` (unknown). | **Independencia**: asumimos columnas independientes — clásico Selinger ‘79. Es un fallback, no un teorema; selectividades correlacionadas quedan para más adelante (multi-column histograms). **Rangos**: para `prop < lit` y un `PropStats { min, max }` numérico, `sel = clamp01((lit - min) / (max - min))`. Si `min == max`, retorna `1.0` cuando `min < lit` y `0.0` otherwise (degenerate column). **Tipos no comparables** (e.g. `min: Utf8`, `lit: Int64`): el selector cae al fallback `0.33`. La selectividad nunca propaga errores — robustez sobre exactitud. **Tabla rationale**: los defaults de la columna derecha siguen el folklore PostgreSQL `default_statistics_target=100` calibrado para queries OLTP, no porque sean “verdad”, sino porque son el menor mal en ausencia de stats reales. En particular el `0.1` para `eq` es el “selectividad agresiva” que prefiere planes index-friendly cuando hay duda. Se documentan acá para auditarlas después. ### 3. Estimación de cardinalidad por operador Función pura sobre el árbol:
```rust
pub fn estimate(plan: &LogicalPlan, catalog: &StatsCatalog) -> Cardinality;
pub struct Cardinality {
/// Filas estimadas que emite este nodo.
pub rows: f64,
/// Cardinalidad de los inputs, en mismo orden que `plan.children()`.
pub children: Vec,
/// Bindings que el operador deja "vivos" downstream, junto con la
/// `LabelStats` asociada cuando se conoce. Heredado por el padre.
pub bindings: BTreeMap,
}
pub struct BindingMeta {
/// Cuando el binding está bound a un nodo de un label conocido,
/// referenciamos esa LabelStats por nombre. (No anidamos el
/// borrow porque `Cardinality` es owned.)
pub label: Option,
/// Cuando el binding es de un edge.
pub edge_type: Option,
}
```
#### 3.1 Operadores leaf | Operador | Cardinalidad | | ----------------------- | --------------------------------------------------------------------- | | `Empty` | `1.0` (single driver row; consistente con `RETURN 1+1` retornando 1). | | `Argument { bindings }` | `1.0` (placeholder de outer; siempre exactamente una fila). | | `NodeScan { label }` | `catalog.label(label).node_count` (0 si no hay stats). | #### 3.2 Operadores con un input | Operador | Cardinalidad | | -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | | `NodeById { input, .. }` | `min(input.rows, 1.0)` cuando `input.rows >= 1`. Si `input` es `Empty`, `1.0`. Punto-lookup, asume hit típico. | | `Expand { input, edge_type, direction, optional }` | `input.rows * branch_factor(edge_type, direction)` + `if optional && branch == 0 { input.rows }`. | | `Filter { input, predicate }` | `input.rows * selectivity(predicate, bindings)`. | | `Project { input, distinct: false }` | `input.rows` (projection no cambia cardinalidad). | | `Project { input, distinct: true }` | `dedup_estimate(input)` — `min(input.rows, Π ndv(item))` cuando los items son props con NDV; fallback `input.rows^0.7`. | | `Distinct { input }` | `dedup_estimate(input)`. | | `Aggregate { input, group_by, .. }` | si `group_by.is_empty()`: `1.0`. Si no: `Π ndv(group_by_i)` truncado a `input.rows`; fallback `input.rows ^ 0.5`. | | `TopN { input, skip, limit, .. }` | `min(input.rows - skip, limit)` clamp a `[0, input.rows]`. | | `Unwind { input, list }` | `input.rows * avg_list_length(list)` — para `list = Literal::List(xs)` usamos `xs.len()`; para `Parameter` o `Variable` usamos default `5.0`. | | `PatternList { input, subplan, .. }` | `input.rows` (emite una row por outer; la lista es value, no rows). | | `Argument`-like wrappers | identidad. | #### 3.3 Operadores con dos inputs | Operador | Cardinalidad | | ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | `CrossProduct { left, right }` | `left.rows * right.rows`. Si comparten un binding, un rewrite posterior lo convierte a `HashJoin` y la fórmula cambia. | | `Union { left, right, all: true }` | `left.rows + right.rows`. | | `Union { left, right, all: false }` | `dedup_estimate(left + right)` aproximado como `max(left.rows, right.rows) + 0.5 * min(...)`. | | `SemiApply { input, subplan, negated: false }` | `input.rows * min(1.0, subplan.rows)` — naïve probabilidad de match. | | `SemiApply { input, subplan, negated: true }` | `input.rows * max(0.0, 1.0 - subplan.rows)`. | #### 3.4 `branch_factor(edge_type, direction)` para `Expand` Cuando `edge_type` está declarado:
```plaintext
if direction == Right (out):
branch = catalog.edge_type(et).avg_out_degree
elif direction == Left (in):
branch = catalog.edge_type(et).avg_in_degree
elif direction == Both:
branch = avg_out_degree + avg_in_degree
```
Cuando `edge_type` es `None` (anonymous `-[]-`): suma sobre todos los edge\_types `Σ avg_*_degree`. Fallback default `2.0` cuando no hay stats. Para `Expand { length: Some(l) }` (variable-length): `branch ^ l.max` hasta cap `MAX_VARLEN_BRANCH = 10_000` (para que `*1..6` no exploten el estimate a infinito en grafos densos). Esta es la fórmula naive de DuckDB-graph; mejora con Markov / random-walk va a futuro con WCOJ. #### 3.5 Operadores write Los `Create/Merge/Set/Remove/Delete` retornan `0.0` rows (el executor no emite tuplas; emite `WriteOutcome`). Su `input` mantiene su cardinalidad para EXPLAIN VERBOSE pero el operador write en sí es “sink”. #### 3.6 Bindings y heredado * `NodeScan { label, alias }` introduce `alias → BindingMeta { label, .. }`. * `NodeById` idem. * `Expand { target_alias, target_label, .. }` introduce `target_alias` con `label = target_label` cuando el lowering lo declaró. * `Project { distinct, items, discard_input_bindings: true }` reemplaza el set de bindings con los aliases del proyectado (sin LabelStats asociada, salvo que el item sea `Variable(x)` con `x` un alias pre-existente). * `Project { discard_input_bindings: false }` (WITH) merge: agrega los aliases de los items sobre los heredados. * Los demás operadores (`Filter/TopN/Distinct/Unwind/...`) heredan bindings sin modificar. ### 4. `EXPLAIN VERBOSE` Nueva función `explain_verbose(plan, catalog) -> String` que extiende `explain(plan)` con cardinalidad estimada por nodo y costo total. **Formato:**
```plaintext
TopN keys=[m.creationDate DESC, m.id ASC] limit=20 (est=20)
Project [...] (est=20)
Expand source=p edge_type=KNOWS dir=-> target=friend (est=180)
NodeById label=Person id=$personId (est=1)
Empty (est=1)
```
**Convenciones:** * `(est=N)` redondea `f64` a entero positivo (ceil cuando `0 < x < 1`, para no mostrar “est=0” a un operador que sí emite filas). * Para nodos cuyo `LabelStats` no existe en el catálogo, se agrega `(no stats)` después del `(est=...)`. * El header del root incluye total: `# Estimated rows: N` antes del árbol. **Total cost (informativo):** Σ rows sobre todos los nodos. No es un cost en sentido fuerte (no factoriza CPU vs IO), es una baseline para comparar plans pre y post-rewrite. Futuras iteraciones refinarán si necesario. **Parser**: `EXPLAIN VERBOSE `. Sintaxis: * `EXPLAIN` sin VERBOSE: comportamiento actual (sin números). * `EXPLAIN VERBOSE`: agrega `Query.explain_verbose: bool` flag (además del `explain: bool` existente). `Display for Query` round-trips. * `EXPLAIN VERBOSE` exige stats; cuando se invoca sin Snapshot (CLI ephemeral), usa `StatsCatalog::empty()` y todos los nodos se marcan `(no stats)`. No es error. ### 5. Integración con CLI `namidb explain --verbose ` activa el flag. La query string tampoco necesita el `VERBOSE` prefix (`--verbose` lo inyecta). `namidb run ` sigue siendo read/write como hoy; el cost model **no** afecta la ejecución en esta versión (no hay rewrites todavía). ### 6. API pública del crate crates/namidb-query/src/cost/mod.rs
```rust
pub mod stats;
pub mod selectivity;
pub mod cardinality;
pub use stats::{StatsCatalog, LabelStats, EdgeTypeStats, PropStats};
pub use selectivity::selectivity;
pub use cardinality::{estimate, Cardinality, BindingMeta};
// Re-exports desde lib.rs
pub use crate::cost::{StatsCatalog, estimate};
pub use crate::plan::explain_verbose;
```
## Alternativas consideradas ### A. Inferir stats del primer `scan_label` (lazy) Levantar el catálogo cada vez que el optimizer toca un operador con label desconocido. Rechazado: triple-pago de IO si dos ramas del plan hablan del mismo label, y rompe el invariante “todo el plan se optimiza antes de empezar a ejecutar” (necesario para correctness de pushdown). ### B. Cost en BigDecimal / fixed-point `f64` puede acumular error de redondeo en plans de 10+ operadores. Rechazado: el error relativo de un f64 sobre 10 operaciones está en \~10^-13, varios órdenes de magnitud por debajo del error de modelo (asumir independencia ya introduce 10–50 %). El folklore PostgreSQL / DuckDB usa f64; no inventemos un problema que no existe. ### C. Sketch-only (sin min/max), HLL-everywhere Hace los rangos imposibles. Rechazado: rangos numéricos sobre `creationDate` aparecen en 7/14 LDBC IC; sin min/max el estimate del filter colapsa al fallback 0.33 y matamos el optimizer en queries date-bounded. ### D. Cost model basado en bytes (DuckDB-style “rows × width”) Multiplicar `rows` × `avg_row_bytes` para tener algo cercano a IO. Rechazado por ahora: el executor naïve mantiene todo en memoria; no hay disco-spill ni vectorización donde el ancho importe. Con morsels y Arrow vectorization sí, y ahí refinamos. ### E. Manifest-side reporta `StatsCatalog` ya armado Mover `from_manifest` al crate `namidb-storage`. Rechazado: el catálogo lo consume el query layer; mantenerlo en `namidb-query` preserva separation of concerns y permite que el storage lib quede agnóstico de PropStats con NDV (que es concepto de query). El storage expone `Manifest`, `SstDescriptor`, `PropertyColumnStats`, `DegreeHistogram` — primitivas, no agregados. ### F. Pre-construir el catálogo cuando el manifest se carga El `Snapshot::new` podría construir `StatsCatalog` y exponerlo via `Snapshot::stats()`. Considerado, **deferido**: requeriría exportar el tipo cross-crate. Por ahora el caller (executor o CLI) construye el catálogo a partir de `snapshot.manifest().manifest`. La API `from_manifest(&Manifest)` queda pura. ## Drawbacks 1. **HLL no poblado → eq selectivity siempre 0.1.** Hoy el writer no emite sketches, así que para `prop = literal` el optimizer usa fallback aunque haya min/max. Es aceptable v0; HLL real va en follow-up (writer side \~200 LoC, cost-side cero). 2. **`avg_degree` es promedio, no mediana.** Distribuciones power-law (típicas de social graphs: LDBC SNB tiene exponente \~2.3) hacen que el promedio sea engañoso — un fan-out de 100 K en un super-nodo eleva el avg sin que la mayoría de nodos lo cumpla. Hoy `degree_histogram` está disponible pero no lo usamos en la fórmula (los buckets log₂ están ahí para join-order percentile-based futuro). Documentado. 3. **Selectividad asume independencia entre columnas.** En LDBC SNB, `Person.firstName` y `Person.lastName` son altamente correlacionados con `id`; un `WHERE firstName='Alice' AND lastName='Smith'` puede ser mucho más selectivo que el producto. A futuro introducimos multi-column stats. 4. **No hay sample-based cardinality.** PostgreSQL y CockroachDB hacen sampling para columnas con histogramas. Acá no — el writer no muestrea y el cost path no lo invoca. Llega a futuro con el morsel executor donde sampling es \~free. 5. **Stats viven en el manifest committed → no incluye memtable.** Las queries que corren contra una `Snapshot` con memtable activo (caso normal de single-writer) usan estimates del manifest sin contar las filas no-flushed. Cuando el writer está callado, es \~OK; cuando hay ingest activo, el catálogo subestima. Aceptable v0: el writer flush-cadence típico es ≤1 GB de memtable, así que el under-estimate está acotado. A futuro el vectorized executor agregará `memtable_stats` live. 6. **`Cardinality` paraleliza el árbol del plan.** En vez de mutar `LogicalPlan` con annotations inline, retornamos un árbol paralelo `Cardinality`. Es \~2× memoria del plan pero mantiene `LogicalPlan` inmutable (otros consumers — EXPLAIN, executor, future PROFILE — no tienen que filtrar las annotations). Trade-off explícito. ## Open questions * **OQ1.** ¿Selectividad debe ser `f64` o `Probability` (tipo wrapper con clamp a \[0,1])? Hoy es `f64`; v1 considera wrapper si vemos un bug por overflow. * **OQ2.** ¿`StatsCatalog::from_manifest` debe ser `async` (por si en el futuro lee sketches HLL desde un side-car)? Por ahora se mantiene síncrono — todo lo que necesita está in-line en el manifest. Si HLL side-car aterriza, se rompe la API y lo trabajamos. * **OQ3.** ¿Cost total debe ser `Σ rows` (estimate-based) o `Σ rows × per-operator-weight` (CPU model)? Por ahora usamos el primero. El segundo llega cuando midamos costo real por operador en el morsel executor. * **OQ4.** Cómo expresamos “shared bindings entre lados de CrossProduct” en el modelo. Hoy `CrossProduct` cardinality es `L × R`. Cuando se introduzca `HashJoin`, queremos algo como `(L × R) / max(ndv(shared_key, L), ndv(shared_key, R))`. La estructura de `BindingMeta` ya carga el alias; falta agregar acceso a PropStats del binding desde Cardinality. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — origen del cost-based optimizer y del fallback 0.1. * Heimel et al., *Hardware-Oblivious Parallelism for In-Memory Column-Stores* — defaults modernos para selectividad sin index. * PostgreSQL `default_statistics_target` documentation — fuente de los fallbacks numéricos. * Kuzu paper (Mhedhbi & Salihoglu, SIGMOD ‘23) — cardinality estimation para graph join enumeration via WCOJ; referencia para trabajo futuro. * DuckDB CBO blog series (Mark Raasveldt 2023) — uso de stats inline-en-Parquet para skipping y join-order; mismo patrón que acá. * HyperLogLog++ paper (Heule et al., EDBT ‘13) — formato del sketch cuando aterrice el writer. * `docs/rfc/008-logical-plan-ir.md` — operadores que esta RFC anota. * `docs/rfc/009-write-clauses.md` — write ops que retornan 0 rows. ## Plan de implementación 1. Crate `namidb-query`: * `src/cost/mod.rs` — re-exports. * `src/cost/stats.rs` — `StatsCatalog`, `LabelStats`, `EdgeTypeStats`, `PropStats` + `from_manifest`. \~300 LoC + 6-8 unit tests. * `src/cost/selectivity.rs` — `selectivity(expr, bindings) -> f64`. \~250 LoC + 10-12 unit tests (eq/range/IN/AND/OR/NOT/IS NULL/ STARTS WITH/fallback). * `src/cost/cardinality.rs` — `estimate(plan, catalog) -> Cardinality`. \~350 LoC + 8-10 unit tests cubriendo cada operator. 2. `src/plan/explain.rs`: * `explain_verbose(plan, catalog) -> String`. \~80 LoC + 5 tests. 3. `src/parser/grammar.rs`: * Reconocer `VERBOSE` como soft keyword después de `EXPLAIN`. * `Query.explain_verbose: bool`. Display round-trips. * \~30 LoC + 3 tests. 4. CLI: * `namidb explain --verbose `. \~15 LoC. 5. Tests integration: * `tests/cost_smoke.rs` — micro-graph → `StatsCatalog::from_manifest`, `estimate(plan)` vs `execute(plan).len()`. Documentar gap. 6-8 tests cubriendo IC2/IC7/IC8/IC9 + filter selectivity sweep. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 348 → \~390 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~1 500 src + \~500 tests. * Sin cambios en `namidb-storage` (consumer-only).
# RFC 011: Predicate Pushdown + Filter Normalization
> **Status:** draft **Author(s):** Matías Fonseca AND split + literal folding + `__label_eq` elimination) **Builds on:** RFC-010 (cost model foundation) **Supersedes:** —
> *Mirrored from [`docs/rfc/011-predicate-pushdown.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/011-predicate-pushdown.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca AND split + literal folding + `__label_eq` elimination) **Builds on:** RFC-010 (cost model foundation) **Supersedes:** — ## Summary Primer rewrite estructural del optimizer: empuja cada predicate del `LogicalPlan` lo más cerca posible de los operadores leaf (NodeScan, NodeById, Argument, Empty), reduciendo la cardinalidad que los operadores caros (`Expand`, `CrossProduct`, `SemiApply`, `PatternList`) procesan. El alcance es **solamente el predicate pushdown a nivel `LogicalPlan`**: el rewriter manipula la estructura del árbol; **no** baja predicates al storage layer (Parquet predicate pushdown queda diferido), **no** reordena joins, **no** convierte `SemiApply`/`CrossProduct` a hash joins. Acompañando el pushdown, incluye tres normalizaciones del Filter tree que el lowering deja sub-óptimo y que el pushdown necesita para funcionar: 1. **AND-split** — `Filter(a AND b AND c)` se descompone en tres conjuntos pushables independientemente. 2. **Adjacent merge** — dos `Filter` consecutivos post-pushdown se fusionan en uno con AND, para minimizar nodos en EXPLAIN. 3. **Literal fold** — `Filter(true)` se elimina (`input` directo); `Filter(false)` se preserva (no se sustituye por `Empty` porque el plan podría seguir requiriendo bindings sin filas; el executor maneja 3VL). 4. **`__label_eq` cleanup** — el `Filter(__label_eq(target, L))` que el lowering inyecta defensivamente arriba de un `Expand` con `target_label=Some(L)` se elimina (el operador ya garantiza el label en la capa storage). El contrato público cambia: `lower(query)` sigue siendo puro (unchanged), pero **`execute` / `execute_write` ahora aplican `optimize` por default**. EXPLAIN VERBOSE muestra el plan optimizado; EXPLAIN RAW (nueva sintaxis) muestra el plan literal del lowering. Out-of-scope explícito: * Parquet predicate pushdown al storage layer. * Join-order DP/greedy sobre `Expand` chains y `CrossProduct`. * Conversión `SemiApply`/`CrossProduct` con shared bindings → HashJoin (ver RFC-012). * Projection pushdown / column pruning. * Boolean simplification más allá de `true`/`false` literales (De Morgan, Karnaugh, common-subexpression). * HLL populated por el writer (RFC-010 §“Drawbacks 1”). ## Motivation La **fundación** del optimizer (catálogo de stats real, selectividad, cardinalidad, EXPLAIN VERBOSE) ya está. El gap visible es: el plan que el lowering produce hoy es estructuralmente naïve y deja trabajo grueso sobre la mesa. Ejemplo concreto, consulta LDBC SNB IC-shape:
```cypher
MATCH (a:Person)-[:KNOWS]->(b:Person)
WHERE a.age > 30 AND b.firstName = 'Alice'
RETURN b.id
```
Lowering produce:
```plaintext
Project [b.id]
Filter (a.age > 30 AND b.firstName = 'Alice')
Filter (__label_eq(b, "Person"))
Expand source=a edge_type=KNOWS dir=-> target=b
NodeScan label=Person alias=a
```
Sobre el micro-graph LDBC (6 Person, avg\_degree=1, age:\[25,40]), eso expande 6 Person × 1 = 6 pairs antes de filtrar — pequeño, pero la forma es la misma sobre SF1 (3 M Person, avg\_degree≈30): 90 M pairs materializados antes de filtrar al \~17 % (`age > 30 ⇒ sel≈0.67`, `firstName='Alice' ⇒ sel≈0.10⇒0.067` total). Con pushdown:
```plaintext
Project [b.id]
Expand source=a edge_type=KNOWS dir=-> target=b
Filter (a.age > 30)
NodeScan label=Person alias=a
(Filter b.firstName = 'Alice' queda arriba — refiere target_alias)
```
Filter sobre `a` baja por debajo del Expand (1 M Person → 670 k); el filter sobre `b` queda por estructura (refiere el alias introducido por el Expand). El Expand procesa 670 k × 30 = 20 M pairs en lugar de 90 M. **4.5× menos trabajo procesado**, sin tocar storage. El costo de no hacerlo: * Cada query LDBC con WHERE compuesto paga el costo de un plan estructuralmente subóptimo. SF1 gate inalcanzable. * Join reorder y hash conversión operan sobre el plan optimizado de filters; sin pushdown previo, los algoritmos de reorder ven `CrossProduct + Filter combinado` en lugar de `Filter ⇒ Subtree`, lo cual oscurece el grafo de joins. * EXPLAIN VERBOSE hoy muestra `(est=N)` que es matemáticamente correcto pero no refleja el plan que el motor podría correr — el número es engañoso porque el plan está mal estructurado. Hacerlo ahora cuesta \~1 500 LoC src + \~700 LoC tests y desbloquea join reorder y hash conversion. ## Design ### 1. API pública crates/namidb-query/src/optimize/mod.rs
```rust
/// Apply the full optimizer pipeline to `plan`. Idempotent — calling
/// `optimize(optimize(p, c), c)` returns a structurally identical plan.
///
/// Today the pipeline consists of `predicate_pushdown` followed by
/// `normalize_filters` (AND-split, adjacent-merge, literal fold,
/// `__label_eq` elimination) iterated to fixpoint (cap 8 rounds).
pub fn optimize(plan: LogicalPlan, catalog: &StatsCatalog) -> LogicalPlan;
/// Push every `Filter` predicate as close to the leaves as possible.
/// Splits AND-conjunctions and dispatches each conjunct independently
/// based on the aliases it references.
pub fn predicate_pushdown(plan: LogicalPlan) -> LogicalPlan;
/// Tidy the Filter tree: merge adjacent Filters into a single AND,
/// fold `Filter(true)` away, drop the `__label_eq` defensive filter
/// when the immediate child is an Expand already constraining the
/// target label.
pub fn normalize_filters(plan: LogicalPlan) -> LogicalPlan;
```
`StatsCatalog` se acepta para que el pipeline pueda usar estimates para decisiones futuras (`predicate_pushdown` actual no lo necesita — es estructural — pero ya queda en la firma para evitar romper el contrato cuando un rewrite futuro lo necesite).
```rust
// crates/namidb-query/src/lib.rs (nuevo)
pub use optimize::{optimize, predicate_pushdown, normalize_filters};
/// Convenience: lower + optimize. Used by the executor and EXPLAIN
/// VERBOSE by default. Tests that want the raw lowering should call
/// `lower(query)` directly.
pub fn plan(query: &Query, catalog: &StatsCatalog) -> Result {
Ok(optimize(lower(query)?, catalog))
}
```
`execute(plan, snapshot, params)` y `execute_write(plan, writer, params)` no cambian — siguen aceptando un `LogicalPlan` listo. El cambio es en los **call sites** (CLI, walker bench, tests integration): donde antes hacían `let p = lower(&query)?;`, ahora hacen `let p = plan(&query, &catalog)?;`. Tests internos que prueban operadores específicos (lowering tests, executor unit tests) siguen usando `lower(query)` directamente. ### 2. Algoritmo `predicate_pushdown` Single-pass top-down con accumulator. Cada llamada recursiva pasa un `Vec` de predicados pendientes que el caller quiere empujar hacia abajo. Cada nodo del plan decide cuáles puede absorber y cuáles devuelve a su parent vía un `Filter` materializado encima.
```rust
fn pushdown_at(plan: LogicalPlan, pending: Vec) -> LogicalPlan {
match plan {
// Leaf nodes: aplicar pending arriba y terminar.
LogicalPlan::Empty
| LogicalPlan::Argument { .. }
| LogicalPlan::NodeScan { .. } => apply_filters(plan, pending),
// Filter node: descomponer y propagar.
LogicalPlan::Filter { input, predicate } => {
let mut acc = pending;
for term in split_and_terms(&predicate) {
acc.push(term);
}
pushdown_at(*input, acc)
}
// Operadores que introducen aliases — particionar pending por
// alias-set y propagar lo pushable; el resto queda arriba.
LogicalPlan::Expand { /* ... */ } => { /* §2.1 */ }
LogicalPlan::NodeById { /* ... */ } => { /* §2.2 */ }
LogicalPlan::CrossProduct { /* ... */ } => { /* §2.3 */ }
LogicalPlan::Project { /* ... */ } => { /* §2.4 */ }
LogicalPlan::Aggregate { /* ... */ } => { /* §2.5 */ }
LogicalPlan::Union { /* ... */ } => { /* §2.6 */ }
LogicalPlan::Unwind { /* ... */ } => { /* §2.7 */ }
LogicalPlan::SemiApply { /* ... */ } => { /* §2.8 */ }
LogicalPlan::PatternList { /* ... */ } => { /* §2.8 */ }
// Barreras — Distinct / TopN no se cruzan porque cambian
// cardinalidad de forma que el filter pre/post no es semánticamente
// equivalente.
LogicalPlan::TopN { /* ... */ }
| LogicalPlan::Distinct { /* ... */ } => { /* §2.9 */ }
// Writes son barreras: pending queda arriba, recurse solo en su
// input con pending vacío.
LogicalPlan::Create { /* ... */ }
| LogicalPlan::Merge { /* ... */ }
| LogicalPlan::Set { /* ... */ }
| LogicalPlan::Remove { /* ... */ }
| LogicalPlan::Delete { /* ... */ } => { /* §2.10 */ }
}
}
```
#### 2.1 `Expand { source, target_alias, rel_alias, target_label, optional, .. }` El `Expand` introduce `target_alias` y opcionalmente `rel_alias`. Un predicate puede empujarse al input sii **no referencia ningún alias introducido por el Expand**. La distinción entre `optional` y non-optional **no afecta la pushability**: el rule es estructural. Lo que sí cambia bajo `optional` es la **forma del Filter que queda arriba** — un predicate sobre `target_alias` post-OPTIONAL ya está evaluando 3VL contra `NULL` correctamente; no necesitamos invertir su semántica. (El lowering, además, folds los property/label filters INSIDE el Expand cuando `optional=true`, así que la situación clásica “`Filter(b.x > 0) ⇒ OptionalExpand(target=b)`” sólo ocurre con WHERE explícito del usuario, que es el caso 3VL correcto.)
```rust
let introduced: BTreeSet = {
let mut s = BTreeSet::new();
s.insert(target_alias.clone());
if let Some(r) = &rel_alias { s.insert(r.clone()); }
s
};
let (pushable, stay) = pending.into_iter()
.partition(|e| expression_aliases(e).is_disjoint(&introduced));
let new_input = pushdown_at(*input, pushable);
let new_expand = LogicalPlan::Expand {
input: Box::new(new_input), source, edge_type, direction, rel_alias,
target_alias, target_label, length, optional,
};
apply_filters(new_expand, stay)
```
#### 2.2 `NodeById { input, alias, .. }` Introduce `alias`. Idéntico a `Expand` pero con un set de un elemento. #### 2.3 `CrossProduct { left, right }` Cada conjunct puede ir a `left`, `right`, o quedarse arriba:
```rust
let left_aliases = produced_aliases(&left);
let right_aliases = produced_aliases(&right);
let mut to_left = Vec::new();
let mut to_right = Vec::new();
let mut keep_top = Vec::new();
for term in pending {
let refs = expression_aliases(&term);
let hits_left = !refs.is_disjoint(&left_aliases);
let hits_right = !refs.is_disjoint(&right_aliases);
match (hits_left, hits_right) {
(true, false) => to_left.push(term),
(false, true) => to_right.push(term),
(true, true) => keep_top.push(term),
(false, false) => keep_top.push(term), // constant — safe to keep up
}
}
```
**Mixed-side equality** (e.g. `a.x = b.y` con `a∈left, b∈right`) queda en `keep_top`. La inspección de `keep_top` para detectar **join-candidate** queda como hint visual en EXPLAIN VERBOSE — no modificamos el IR ni introducimos un `HashJoin` (queda diferido). La detección es:
```rust
fn is_join_candidate(expr: &Expression, left: &BTreeSet, right: &BTreeSet) -> bool {
if let ExpressionKind::Binary { op: BinaryOp::Eq, left: l, right: r } = &expr.kind {
let la = expression_aliases(l);
let ra = expression_aliases(r);
let l_side = la.is_subset(left) && ra.is_subset(right);
let r_side = la.is_subset(right) && ra.is_subset(left);
return l_side || r_side;
}
false
}
```
EXPLAIN VERBOSE anota cada Filter inmediatamente sobre un CrossProduct con `[join candidate]` cuando `is_join_candidate` true. #### 2.4 `Project { items, distinct, discard_input_bindings }` Un alias del input sobrevive arriba del Project **sii** algún `items[i]` tiene la forma `Variable(x)` con `items[i].alias == x` (identity projection sin renaming). Si el predicate refiere solo aliases identidad-proyectados, podemos bajarlo. Si refiere un alias introducido por la projection (e.g. `expr AS y`), queda arriba — debajo del Project el alias `y` no existe.
```rust
let preserved: BTreeSet = items.iter().filter_map(|it| {
if let ExpressionKind::Variable(id) = &it.expression.kind {
if id.name == it.alias { return Some(id.name.clone()); }
}
None
}).collect();
let (pushable, stay) = pending.into_iter()
.partition(|e| expression_aliases(e).is_subset(&preserved));
```
WITH \* (a futuro) podrá relajar esto. Hoy es conservador. #### 2.5 `Aggregate { group_by, aggregations }` Análogo a Project, pero con una distinción: predicates que refieren **aliases de agregaciones** son HAVING semánticos y nunca bajan. Para group\_by keys que son identity (`Variable(x)` con alias `x`), pushdown OK como pre-aggregate filter.
```rust
let preserved: BTreeSet = group_by.iter().filter_map(|(e, alias)| {
if let ExpressionKind::Variable(id) = &e.kind {
if id.name == *alias { return Some(id.name.clone()); }
}
None
}).collect();
let agg_aliases: BTreeSet = aggregations.iter().map(|(a, _)| a.clone()).collect();
let (pushable, stay) = pending.into_iter().partition(|e| {
let refs = expression_aliases(e);
refs.is_subset(&preserved) && refs.is_disjoint(&agg_aliases)
});
```
#### 2.6 `Union { left, right, all }` Pushable a ambos lados sii **todos los aliases referenciados existen en ambos**. Caso típico: post-Union los dos lados proyectan el mismo schema, así que un Filter sobre la projection sale a ambos sin ambigüedad. Si un alias falta en un lado, queda arriba.
```rust
let l_aliases = produced_aliases(&left);
let r_aliases = produced_aliases(&right);
let (pushable, stay) = pending.into_iter().partition(|e| {
let refs = expression_aliases(e);
refs.is_subset(&l_aliases) && refs.is_subset(&r_aliases)
});
let new_left = pushdown_at(*left, pushable.clone());
let new_right = pushdown_at(*right, pushable);
```
(Cloning pushable es OK — predicates suelen ser pequeños.) #### 2.7 `Unwind { list, alias }` Introduce `alias`. Predicate sobre `alias` queda arriba; otros bajan al input. #### 2.8 `SemiApply` / `PatternList` Ambos toman un `input` (outer) y un `subplan` (inner, parametrizado por la row outer). El **subplan nunca recibe pushdown** del rewriter — son scopes nested y el pushdown cross-scope requiere correlation analysis (decorrelation), out-of-scope. * `SemiApply`: no introduce nuevos aliases visibles arriba (es un semi-join, no proyecta). Pending fluye entero a `input`. Subplan intacto. * `PatternList`: introduce `alias` (el valor list). Predicates sobre `alias` quedan arriba; otros bajan a `input`. #### 2.9 `TopN` / `Distinct` (barreras de cardinalidad) NO se cruzan. Razones: * `TopN limit=L`: `Filter(p) ⇒ TopN(L)` retorna ≤ L filas filtradas; `TopN(L) ⇒ Filter(p)` retorna L filas pre-filter y luego filtra. Cardinalidades distintas; rows distintas. * `Distinct`: para predicates puros (deterministas, sin side-effects) el resultado **set** es el mismo, pero permitir el cruce nos obliga a verificar la pureza de cada subexpresión. Más seguro mantener como barrera v0.
```rust
LogicalPlan::TopN { input, keys, skip, limit } => {
let new_input = pushdown_at(*input, vec![]);
let new = LogicalPlan::TopN { input: Box::new(new_input), keys, skip, limit };
apply_filters(new, pending)
}
```
#### 2.10 Write ops (`Create / Merge / Set / Remove / Delete`) Barreras. Pending queda arriba (en la práctica el lowering nunca emite un `Filter` encima de un write — el patrón `MATCH ... WHERE ... SET` produce `Set { input: Filter { input: ... } }`, no `Filter { input: Set { ... } }`. La barrera es defensiva). ### 3. Algoritmo `normalize_filters` Bottom-up. Cuatro reglas, aplicadas en orden: 1. **Recursividad sobre children primero** (post-order). 2. **`Filter { input: Filter { input: x, predicate: p1 }, predicate: p2 }`** → `Filter { input: x, predicate: p1 AND p2 }`. 3. **`Filter { input, predicate: Literal::Boolean(true) }`** → `input`. 4. **`Filter { input: Expand { ..., target_alias=A, target_label=Some(L) }, predicate: __label_eq(A, L) }`** → `Expand { ... }` (el filter se elimina). La regla 4 también aplica recursivamente: si después de eliminar el filter, hay otro `__label_eq` apilado abajo, se elimina. La regla 2 fusiona las cláusulas que el split en pushdown dejó separadas. `Filter(false)` queda como está — el executor evalúa el predicate literal y descarta cada row; el optimizer no convierte a `Empty` porque eso requiere reasoning sobre los bindings que el plan necesita introducir (e.g. para un downstream Aggregate count(\*) = 0). ### 4. Helpers
```rust
/// Set of aliases (Variable identifiers) referenced anywhere in `expr`.
/// Property accesses contribute their target alias. Pattern subqueries
/// (`Exists`, `PatternComprehension`) and list comprehensions are
/// treated as opaque — we return ALL bindings they could possibly
/// reference, by collecting free variables in the inner expression
/// without descending into nested patterns. Conservative: when in
/// doubt, the alias set is wider, so the predicate stays higher up.
fn expression_aliases(expr: &Expression) -> BTreeSet;
/// Set of aliases that `plan` makes visible to its parent.
fn produced_aliases(plan: &LogicalPlan) -> BTreeSet;
/// AND-flatten: `a AND b AND c` → vec![a, b, c]. Used by pushdown to
/// split a compound predicate.
fn split_and_terms(expr: &Expression) -> Vec;
/// Concatenate `terms` with binary AND, preserving source order.
/// Returns None if `terms` is empty.
fn and_chain(terms: Vec) -> Option;
/// If `terms` non-empty, wrap `plan` in a `Filter(AND(terms))`.
/// Otherwise return `plan` unchanged.
fn apply_filters(plan: LogicalPlan, terms: Vec) -> LogicalPlan;
```
`produced_aliases` enumera los aliases por tipo de operador: | Operador | Produce | | ------------------- | ------------------------------------------------- | | `NodeScan/NodeById` | `{alias}` | | `Argument` | bindings literales | | `Expand` | `produced(input) ∪ {target_alias, rel_alias?}` | | `Filter` | `produced(input)` | | `Project` | \`items.iter().map( | | `Aggregate` | `group_by.aliases ∪ aggregations.aliases` | | `TopN`/`Distinct` | `produced(input)` | | `Union` | `produced(left) ∩ produced(right)` (schema-aware) | | `Unwind` | `produced(input) ∪ {alias}` | | `Empty` | `∅` | | `CrossProduct` | `produced(left) ∪ produced(right)` | | `SemiApply` | `produced(input)` | | `PatternList` | `produced(input) ∪ {alias}` | | Writes | `produced(input) ∪ alias(elements)` | ### 5. Fixpoint `optimize` corre `predicate_pushdown` + `normalize_filters` en loop hasta que dos iteraciones consecutivas producen árboles idénticos (`PartialEq` already derived on `LogicalPlan`). Cap en 8 rondas para prevenir loops infinitos en caso de bug (cada ronda debería estrictamente reducir la altura del Filter tree o ser idempotente, así que >2 rondas indicaría error). Cap se loggea pero no panic.
```rust
pub fn optimize(plan: LogicalPlan, _catalog: &StatsCatalog) -> LogicalPlan {
let mut current = plan;
for _ in 0..8 {
let next = normalize_filters(predicate_pushdown(current.clone()));
if next == current { return next; }
current = next;
}
current
}
```
### 6. EXPLAIN integration #### 6.1 EXPLAIN VERBOSE muestra el plan optimizado `explain_query_verbose(query, catalog)` ahora llama `plan(query, catalog)` internamente y renderiza el árbol post-optimize. El total estimate (header `# Estimated rows`) y per-node `(est=…)` reflejan el plan que el motor realmente correría. Esto cambia el contrato previo donde EXPLAIN VERBOSE mostraba el lowering crudo — los tests existentes que dependían de esa forma específica se actualizan. #### 6.2 EXPLAIN RAW (nueva sintaxis) `EXPLAIN RAW ` y `EXPLAIN RAW VERBOSE ` muestran el plan sin optimizar. Útil para debugging del lowering y para verificar que el optimizer hizo algo:
```plaintext
> EXPLAIN VERBOSE MATCH (a:Person) WHERE a.age > 30 RETURN a
# Estimated rows: 2
Project [a=a] (est=2)
Filter (a.age > 30) (est=2)
NodeScan label=Person alias=a (est=6)
> EXPLAIN RAW VERBOSE MATCH (a:Person) WHERE a.age > 30 RETURN a
# Estimated rows: 2
Project [a=a] (est=6)
Filter (a.age > 30) (est=2)
NodeScan label=Person alias=a (est=6)
```
En el RAW (lowering crudo) el Filter está bajo el Project, y la estimación del Project asume que el Filter ya filtró — pero el operador Project itera 6 rows con el Filter arriba siendo evaluado después, lo cual es exactamente lo que muestra el árbol. En el optimizado el Filter está debajo del Project, así el Project itera 2. #### 6.3 Join-candidate annotation Cuando un Filter inmediato sobre un CrossProduct contiene una igualdad cross-side, EXPLAIN VERBOSE agrega `[join candidate]` al final de la línea del Filter:
```plaintext
Filter (a.name = b.name) [join candidate] (est=...)
CrossProduct (est=...)
Filter (a.age > 30) (est=...)
NodeScan label=Person alias=a (est=...)
Filter (b.age < 50) (est=...)
NodeScan label=Person alias=b (est=...)
```
Un rewrite posterior detecta el flag y convierte a HashJoin. ### 7. Parser
```text
EXPLAIN [RAW] [VERBOSE]
```
* `EXPLAIN ` — lowering crudo, sin estimates. * `EXPLAIN VERBOSE ` — **optimizado**, con estimates. * `EXPLAIN RAW ` — lowering crudo, sin estimates (alias explícito del comportamiento legacy). * `EXPLAIN RAW VERBOSE ` — lowering crudo, con estimates. `RAW` es un soft-keyword reconocido sólo entre `EXPLAIN` y `VERBOSE` (o `EXPLAIN` y el inicio de la query). No es token reservado, no rompe queries con una variable llamada `raw`. `Query.explain_raw: bool` se agrega al AST junto al `explain_verbose: bool` existente. ### 8. CLI
```bash
namidb explain [--verbose] [--raw]
```
* `--raw`: alias de `EXPLAIN RAW` (skip optimize). * `--verbose`: ya existe, agrega VERBOSE. Si la query string ya contiene los prefixes, se respeta la mezcla (flag + prefix son OR’eados). ## Alternativas consideradas ### A. Selinger-style cost-based exhaustivo Enumerar todas las posiciones donde el Filter puede ir y elegir la de menor costo. Rechazado: para un plan con N operadores el espacio es O(N) posiciones por predicate; con K predicates eso es O(N×K). Para LDBC IC con N≈10, K≈5 son \~50 posiciones — tractable, pero la mejor posición siempre es “lo más bajo posible” para predicates puros (propiedad bien conocida: predicate pushdown commutes con cardinalidad reduction). El cost-based enum sólo aporta cuando los predicates tienen side effects (no en SQL/Cypher) o cuando hay correlations que podrían favorecer NO bajar (cross-column correlation — out of scope hasta que aterricen multi-column histograms). ### B. Rewrite-rule engine genérico (egg / datalog) Codificar las reglas como rewrites declarativos y dejar que un engine los aplique a fixpoint. Rechazado para v0: el catálogo inicial de reglas es 4 normalizaciones + 1 algoritmo (pushdown). Un engine genérico cuesta \~2 000 LoC de infra para 4 reglas; manual rewrite es \~500 LoC. Cuando lleguemos a 20+ reglas, evaluamos egg. ### C. Pushdown integrado en `lower` Hacer que el lowering produzca directamente el plan optimizado. Rechazado: el lowering tiene una responsabilidad clara (AST → LogicalPlan correcto). Mezclar optimizer rompe testabilidad unitaria y oculta bugs de lowering detrás de bugs de pushdown. ### D. Filter pushdown solo en `WhereClause` Procesar el WHERE en `attach_where` y bajar ahí mismo, antes de generar el Filter. Rechazado: solo cubre el WHERE explícito. Los property filters (`{key: value}` inline en patterns) producen Filters **arriba** del Expand también, y el pushdown necesita verlos todos uniformemente. Además, join reorder opera sobre el plan post-pushdown — necesita el árbol normalizado. ### E. Bajar al storage layer ahora (Parquet predicate pushdown) `scan_label(label, predicates: Vec<...>)` lee los row groups del Parquet con stats min/max + Bloom + Bitmap pushdown. Rechazado por ahora: requiere extender la API de `Snapshot` con un tipo `ScanPredicate` neutral, hacer el match en `parquet_loader.rs`, y agregar tests storage-side. \~800 LoC adicionales que duplican el costo y son ortogonales al rewrite estructural. Queda diferido, con esta RFC como pre-requisito (los predicates ya están en su posición ideal cuando un rewrite futuro los traduzca a Parquet). ## Drawbacks 1. **Cambio de contrato silencioso para callers existentes.** Los tests integration que comparan `lower(query)` con un árbol esperado siguen funcionando. Los tests que comparan `execute(plan, snapshot)` también — el plan optimizado produce el mismo set de rows. Pero tests que comparan EXPLAIN VERBOSE output cambian (el plan optimizado tiene forma diferente). Mitigación: snapshot tests existentes en `explain.rs` se actualizan. 2. **`expression_aliases` es conservador con subqueries.** Para `Filter(EXISTS((a)-[]-(b)) AND x > 0)`, el predicate `EXISTS(...)` contribuye al alias set TODOS los aliases que la pattern podría referenciar. Si el predicate tiene un sub-EXISTS sobre `a` y un `x > 0` sobre `a.x` (distintos a y x), el pushdown podría ser más fino — hoy es conservador, queda como mejora futura. 3. **No bajamos a través de `TopN` / `Distinct`.** Es una decisión conservadora; el caso “filter sobre TopN” con predicate puro es pushable, pero queremos primero verificar pureza. Trabajo trivial, queda diferido. 4. **Adjacent merge fusiona Filters con spans inconsistentes.** El nuevo `Filter` con AND-chain tiene `span` cuya extension cubre los spans originales (lo que ya hace `rebuild_and_chain` en `lower.rs`). Para error messages downstream (e.g. error en el executor), el span podría apuntar a una región más amplia que el conjunct específico que falló. Mitigación: el span de cada sub-Expression dentro del AND-tree se preserva — el reporter usa ese span, no el del Filter root. 5. **El optimizer corre sobre TODOS los queries, incluyendo write.** Los write ops son barreras (predicates no bajan a través de ellas), pero el rewriter sigue visitándolos para procesar su `input`. Costo: \~O(operadores). Para queries grandes (\~100 nodos del plan), eso es <1 ms — negligible vs la query execution time. 6. **No hay way de skip optimizer en `execute`.** Si un caller necesita evitar el optimizer (e.g. para reproducir un bug del lowering), debe llamar `lower(query)?` directamente y luego `execute(plan, ...)`. La función `plan(query, catalog)` es el atajo conveniente, no el único path. ## Open questions * **OQ1.** ¿Debería `optimize` aceptar un `OptimizerSettings` con flags individuales (`enable_pushdown`, `enable_normalize`, `enable_label_eq_cleanup`)? Hoy no — un único toggle “todo o nada” vía si se llama `optimize` o `lower` directamente. Cuando agreguemos más rewrites, evaluamos un settings struct. * **OQ2.** ¿`__label_eq` cleanup también debería eliminar el filter cuando el predicate target está bound por un `NodeScan` con label declarado? Hoy sí — el operador ya garantiza el label vía `scan_label(L)`. La regla extendida cubre ambos casos sin riesgo. * **OQ3.** Pure-predicate detection para abrir TopN/Distinct: los predicates Cypher son siempre side-effect-free en v0 (sin funciones externas). Podríamos bajarlos sin verificar. Decisión: ser conservadores hasta que aterricen funciones externas (RFC futuro). ## References * Mumick & Pirahesh, *Implementation of Magic-sets in a Relational Database System* (1994) — origen de las técnicas de pushdown. * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — cost-based optimizer fundacional. * DuckDB *Predicate Pushdown Through Joins* (Mark Raasveldt, 2022) — caso moderno de pushdown sobre join trees vectorizados. * CockroachDB optimizer notes (Andy Kimball, 2018) — pushdown a través de Cypher-shaped query trees. * `docs/rfc/010-cost-based-optimizer.md` — fundación que esta RFC usa. * `docs/rfc/008-logical-plan-ir.md` — operadores que esta RFC rewritea. ## Plan de implementación 1. **Crate `namidb-query`**: * `src/optimize/mod.rs` — re-exports + `optimize(plan, catalog)`. * `src/optimize/pushdown.rs` — `predicate_pushdown` + helpers (`expression_aliases`, `produced_aliases`, `split_and_terms`, `and_chain`, `apply_filters`). \~800 LoC + 20-25 unit tests. * `src/optimize/normalize.rs` — `normalize_filters` con las 4 reglas. \~250 LoC + 8-10 unit tests. * `src/lib.rs` — re-export `plan(query, catalog)`. 2. **`src/parser/grammar.rs`**: * Reconocer `RAW` como soft keyword entre `EXPLAIN` y `VERBOSE`/query body. * `Query.explain_raw: bool`. Display round-trips. * \~25 LoC + 3 tests. 3. **`src/plan/explain.rs`**: * `explain_query_verbose(query, catalog)` aplica `optimize` antes de renderizar. * Nueva función `explain_query_raw(query)` para `EXPLAIN RAW`. * Helper `is_join_candidate` y annotación inline. * \~80 LoC + 5 tests. 4. **CLI** (`namidb-cli/src/main.rs`): * Flag `--raw`; pasar a través el flag del query string. * \~15 LoC. 5. **Executor wiring** (`src/exec/walker.rs`, `src/exec/writer.rs`): * Cualquier call site externo que llamaba `lower(&query)?` antes de `execute(...)` ahora llama `plan(&query, &catalog)?`. CLI y integration tests son los call sites principales. 6. **Tests integration** (`tests/cost_smoke.rs` + nuevo `tests/optimize_smoke.rs`): * Filter sobre source baja debajo de Expand (LDBC IC2 micro). * Filter sobre target NO baja debajo de Expand. * Filter sobre OPTIONAL target NO baja. * CrossProduct: predicates split a left / right / top. * `__label_eq` cleanup verifiable en EXPLAIN output. * Plan optimizado y plan crudo producen el mismo result set. * Plan optimizado tiene `estimate(...) ≤ estimate(crudo)`. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 413 → \~445 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~1 100 src + \~600 tests. * Sin cambios en `namidb-storage`.
# RFC 012: HashJoin
> **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown) **Supersedes:** —
> *Mirrored from [`docs/rfc/012-hash-join.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/012-hash-join.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown) **Supersedes:** — ## Summary Cosecha el `[join candidate]` annotation que el predicate pushdown (RFC-011) dejó sembrado en EXPLAIN VERBOSE: convierte el shape `Filter(a.x = b.y) ⇒ CrossProduct(L, R)` a un `HashJoin { build, probe, on: [(a.x, b.y)] }` que el executor materializa con build/probe phases. Operación O(N×M) → O(N+M). Alcance: * Nuevo operador `LogicalPlan::HashJoin { build, probe, on, residual }`. * Rewriter `convert_cross_to_hash` en el pipeline de `optimize`, posterior al pushdown. * Estimador de cardinalidad para `HashJoin`. * Executor que materializa hash table en build-side, streams probe-side. * EXPLAIN VERBOSE renders `HashJoin on=[(a.x, b.y)]` con build/probe como children. Out-of-scope explícito: * **HashSemiJoin via decorrelation** — convertir `SemiApply` cuando el subplan tiene correlación con bindings del outer. Requiere correlation analysis del subplan (separar la equality que correlaciona del resto, ejecutar subplan no-correlado, hash por la output column de correlación, probe outer). Iteración independiente. * **Sort-merge join** — alternativa cuando los inputs ya están sorted. El executor no propaga ordering, así que solo HashJoin v0. * **Broadcast / partitioned join** — distribuido. * **Hash table spilling a disk** — cuando build no entra en memoria. v0 asume single-node build fits in memory. Documentado como drawback. * **Equality detection sin AND-root** — solo extraemos cross-side equalities del top-level AND-tree del Filter. `Filter(a.x = b.y OR ...)` queda sin convertir. ## Motivation Tras predicate pushdown el plan para `MATCH (a:Person), (b:Person) WHERE a.firstName = b.firstName` queda:
```plaintext
Filter (a.firstName = b.firstName) [join candidate]
CrossProduct (est=1000000) // 1000 * 1000
NodeScan(Person, a) (est=1000)
NodeScan(Person, b) (est=1000)
```
EXPLAIN VERBOSE flag-ea el shape como `[join candidate]` pero el plan sigue ejecutando nested-loop. Sobre LDBC SF1 (3M Person), eso son 9×10¹² pairs antes de filtrar — query nunca termina. Con HashJoin:
```plaintext
HashJoin on=[(a.firstName, b.firstName)] (est=10000)
build:
NodeScan(Person, a) (est=1000)
probe:
NodeScan(Person, b) (est=1000)
```
Build phase: \~1M Person → 1M hash table entries (\~80MB en RAM con ndv(firstName)≈1k). Probe phase: 1M Person, lookup en hash → 1k matches por probe (asumiendo distribución uniforme sobre 1k buckets). Total: \~1M matches finales. **O(N+M) en vez de O(N×M)**, factor ahorro de \~3 órdenes de magnitud. Sin HashJoin, las queries multi-pattern de LDBC SNB Interactive (IC2 con EXISTS, IC9 con sub-pattern) toman seconds/minutos sobre micro-graphs y nunca terminan sobre SF1. ## Design ### 1. IR: `LogicalPlan::HashJoin`
```rust
/// Inner hash join (RFC-012). Equivalent to
/// `Filter(on AND residual) ⇒ CrossProduct(build, probe)` but
/// executed in two phases: build a hash table over `build`'s rows
/// keyed by each `JoinKey::build_side` expression; then stream
/// `probe`, evaluating `JoinKey::probe_side` and looking up matches.
///
/// The optimizer picks the side with smaller estimated cardinality
/// as `build` so the hash table stays compact.
///
/// `residual` is any non-equi predicate that survived from the
/// pre-conversion Filter (e.g. a >= b in `WHERE a.x = b.y AND a.z >= b.w`).
/// It is evaluated on the *joined* row in 3VL.
HashJoin {
build: Box,
probe: Box,
on: Vec,
residual: Option,
},
```
```rust
#[derive(Clone, Debug, PartialEq)]
pub struct JoinKey {
/// Expression evaluated on each `build`-side row to compute the
/// hash-table key. References only aliases produced by `build`.
pub build_side: Expression,
/// Expression evaluated on each `probe`-side row. References only
/// aliases produced by `probe`.
pub probe_side: Expression,
}
```
`children()` returns `vec![build, probe]` in that order — keeps EXPLAIN rendering predictable. `operator_name()` returns `"HashJoin"`. `contains_write()` returns `false` (joins are read-side). ### 2. Conversion rewriter (`optimize::join_conversion`)
```rust
pub fn convert_cross_to_hash(
plan: LogicalPlan,
catalog: &StatsCatalog,
) -> LogicalPlan;
```
Bottom-up rewrite. The trigger shape after the post-pushdown plan is:
```plaintext
Filter { input: CrossProduct { left, right }, predicate }
```
Algorithm: 1. **Recurse** into children first (so we convert any inner joins before considering the current node). 2. **Match the trigger**. If the current plan is a `Filter` whose immediate child is a `CrossProduct`: a. **AND-split** the predicate into conjuncts. b. Compute `produced_aliases(left)` and `produced_aliases(right)`. c. For each conjunct `c`: * If `c` is `Binary { op: Eq, left: lhs, right: rhs }` and (`expression_aliases(lhs) ⊆ left_aliases ∧ expression_aliases(rhs) ⊆ right_aliases`) → push `(lhs, rhs)` to `on`. * Mirror case (`lhs ⊆ right ∧ rhs ⊆ left`) → push `(rhs, lhs)` so build\_side always lines up with `build` operand. We canonicalize. * Otherwise → push to `residual_terms`. d. If `on.is_empty()` → no conversion possible; emit the original `Filter ⇒ CrossProduct` unchanged. e. **Build vs probe decision**: compute `estimate(left, catalog).rows` and `estimate(right, catalog).rows`. Whichever has fewer rows becomes `build`. If equal, prefer left as build (deterministic). Swap `on` keys if we picked right as build. f. **Coalesce residual**: `residual = and_chain(residual_terms)` → `Option`. g. Emit `HashJoin { build, probe, on, residual }`. 3. Otherwise (or if step 2 doesn’t apply): preserve the operator and recurse over its children. #### Edge cases * **Eq with a literal on one side** (`a.x = 5`): not cross-side, falls through to residual / pushdown. Already handled by RFC-011. * **Eq with both sides referencing only one alias** (`a.x = a.y`): same-side, stays in residual. Already pushable to that side’s subtree. * **Eq with parameter** (`a.x = $param`): `expression_aliases($param) = ∅`. Falls through to residual; selectivity has it. * **AND-only predicate**: `a.x = b.y AND a.z > b.w` → `on = [(a.x, b.y)]`, `residual = Some(a.z > b.w)`. * **Multiple cross-side eqs**: `a.x = b.x AND a.y = b.y` → `on = [(a.x, b.x), (a.y, b.y)]`. Coalesced into a multi-key hash join. * **No eq at all** (`Filter(a.x > b.x)`): `on.is_empty()`, no conversion. Plan remains nested-loop. The rewriter must NOT trigger on Filters whose immediate child is not a CrossProduct — those are pushdown leftovers and irrelevant. ### 3. Conversion entry point (`optimize::optimize`) Integrates into the existing pipeline:
```rust
pub fn optimize(plan: LogicalPlan, catalog: &StatsCatalog) -> LogicalPlan {
let mut current = plan;
for _ in 0..MAX_FIXPOINT_ROUNDS {
let next = normalize_filters(predicate_pushdown(current.clone()));
let next = convert_cross_to_hash(next, catalog); // NEW
if next == current { return next; }
current = next;
}
current
}
```
Order matters: pushdown runs first so any pushable filter has been moved out of the way; the only Filters remaining above CrossProduct are by definition cross-side mixers. The rewriter then has the cleanest possible signal. ### 4. Cardinality estimate Add an arm to `cost::cardinality::estimate_inner` for `HashJoin`:
```rust
LogicalPlan::HashJoin { build, probe, on, residual } => {
let b = estimate_inner(build, catalog);
let p = estimate_inner(probe, catalog);
// Selinger '79: inner equi-join cardinality.
// rows = (|build| * |probe|) / max(ndv(build_key), ndv(probe_key))
// For multi-key, assume independence: divide by product.
let mut divisor = 1.0_f64;
for key in on {
let build_ndv = ndv_for_expr_opt(&key.build_side, catalog, &b.bindings).unwrap_or(1.0);
let probe_ndv = ndv_for_expr_opt(&key.probe_side, catalog, &p.bindings).unwrap_or(1.0);
divisor *= build_ndv.max(probe_ndv).max(1.0);
}
let mut rows = (b.rows * p.rows / divisor).max(0.0);
// Residual reduces further. Use the existing selectivity machinery.
if let Some(res) = residual {
let mut combined = b.bindings.clone();
for (k, v) in &p.bindings { combined.insert(k.clone(), v.clone()); }
let bs = make_binding_stats(catalog, &combined);
rows *= selectivity(res, &bs);
}
let mut bindings = b.bindings.clone();
for (k, v) in &p.bindings { bindings.insert(k.clone(), v.clone()); }
Cardinality {
rows,
children: vec![b, p],
bindings,
operator: "HashJoin",
}
}
```
#### Why Selinger and not “min(|L|,|R|)” A common shortcut estimate is `min(|L|, |R|)` for foreign-key joins. That’s correct when the join key is unique on one side and present in every row of the other. For graph joins on arbitrary properties the distribution is wider — Selinger captures both extremes: * Unique on both sides → `min(|L|, |R|)` (since the join key has ndv = |L| ≈ |R|). * Replicated key → much larger output. We fall back to `divisor = 1` (= no reduction → CrossProduct cardinality) when ndv is `None`. That keeps the estimate sound (never under-estimates) at the cost of being pessimistic for queries the catalog doesn’t know about. ### 5. Executor Two-phase implementation in `exec::walker`:
```rust
async fn execute_hash_join(
build: &LogicalPlan,
probe: &LogicalPlan,
on: &[JoinKey],
residual: &Option,
snapshot: &Snapshot<'_>,
params: &Params,
) -> Result, ExecError> {
// Build phase.
let build_rows = execute_inner(build, snapshot, params, /*outer=*/ None).await?;
let mut table: HashMap, Vec> = HashMap::new();
for row in build_rows {
let mut key = Vec::with_capacity(on.len());
let mut has_null = false;
for jk in on {
let v = evaluate(&jk.build_side, &row, params)?;
if matches!(v, RuntimeValue::Null) {
has_null = true;
break;
}
key.push(v);
}
if has_null { continue; } // NULL keys never match (3VL).
table.entry(key).or_default().push(row);
}
// Probe phase.
let probe_rows = execute_inner(probe, snapshot, params, None).await?;
let mut out = Vec::new();
for prow in probe_rows {
let mut key = Vec::with_capacity(on.len());
let mut has_null = false;
for jk in on {
let v = evaluate(&jk.probe_side, &prow, params)?;
if matches!(v, RuntimeValue::Null) { has_null = true; break; }
key.push(v);
}
if has_null { continue; }
if let Some(matches) = table.get(&key) {
for brow in matches {
let mut combined = brow.clone();
for (k, v) in &prow.bindings { combined.bindings.insert(k.clone(), v.clone()); }
if let Some(res) = residual {
match evaluate(res, &combined, params)? {
RuntimeValue::Bool(true) => out.push(combined),
_ => {} // False or NULL drops.
}
} else {
out.push(combined);
}
}
}
}
Ok(out)
}
```
#### NULL semantics `a.x = b.y` is `NULL` when either side is `NULL` (Cypher 3VL). `Filter` drops rows where the predicate evaluates to NULL. Our hash join replicates that: any NULL component in the join key skips both the build insert and the probe lookup. Test coverage explicitly exercises this. #### Hash key representation `Vec` as the HashMap key. `RuntimeValue` implements `Hash + Eq` through derive (numeric, string, bool variants are straightforward). `RuntimeValue::Float` requires the bit-level canonical form to make NaN sort to one bucket — already in the existing `Hash` impl since the value layer. #### Memory footprint Build hash table size: roughly `|build| * (avg_key_size + avg_row_size)`. For SF1-scale build of 3M Person × 200B/row + 50B key = \~750MB. That fits comfortably in a 8GB machine. **Drawback**: no spill, so jobs that pick the wrong build side OOM. Defended by the catalog-based build-vs-probe decision. Future RFC adds spill. #### Bindings combine When we emit a joined row, we take the build row, then `.extend()` its bindings with the probe row’s bindings. If the two sides share a binding name (shouldn’t happen in well-formed plans, but defensive), probe wins. This matches the lowering invariant: two pattern parts share no fresh aliases (lowering uses `CrossProduct` precisely when they don’t). ### 6. EXPLAIN rendering `write_header` for HashJoin:
```plaintext
HashJoin on=[(a.firstName, b.firstName)] residual=(a.id < b.id)
```
`residual` omitted when None. Multi-key:
```plaintext
HashJoin on=[(a.x, b.x), (a.y, b.y)]
```
The `[join candidate]` annotation that the predicate pushdown emitted on the original `Filter ⇒ CrossProduct` disappears post-conversion — the operator IS the join now. This is verifiable with a test. `EXPLAIN VERBOSE` cardinality numbers show the dramatic improvement: the HashJoin estimate is much smaller than the pre-conversion CrossProduct estimate. ### 7. Interaction with subsequent rewrites * **Predicate pushdown above HashJoin**: predicates that reference only build or only probe aliases can be pushed below the HashJoin into the respective subtree. The pushdown rule for HashJoin is identical to CrossProduct (split by side; mixed-eq is now in `on` or `residual`, doesn’t reach pushdown). We add an arm to `predicate_pushdown` to support this. * **Join reorder**: when reorder triggers, it can swap build and probe by re-evaluating `estimate` — the on-keys remain valid (just the symbol order in each pair swaps). * **HashSemiJoin**: a `SemiApply` with a correlated subplan is decorrelated by extracting the correlation key, executing the subplan unparametrised, hashing on the correlation key, and probing the outer. The IR delta is small (`HashSemiJoin` is just `HashJoin` - emit-only-outer + optional negation flag). ## Alternatives considered ### A. Nested-loop with Bloom filter probe DuckDB-style: build a Bloom filter on `build`’s join keys, probe each `probe` row by Bloom-checking, and fall through to nested-loop on the positives. **Rejected**: Bloom-filter false-positive rate \~1% means 99% of work is short-circuited, but the remaining 1% is still O(N×M) — for N,M = 10⁶ that’s 10¹⁰ comparisons. HashJoin is strictly better when memory fits. ### B. Sort-merge join If both sides are already sorted on the join key, a sort-merge join avoids materialising a hash table. **Rejected for v0**: the executor does not propagate ordering. Adding ordering metadata to the IR is a separate, larger change (morsel-driven executor — vectorised, often comes with sort ordering as a metadata column). ### C. Stream both sides and use partition-hash-join Modern approach (DuckDB, ClickHouse, MapReduce-style). Both inputs partition by hash; each partition joins independently. **Rejected**: parallelism over partitions is a morsel feature, out-of-scope here. Single-threaded HashJoin v0 is the simplest correct approach. ### D. Rely on graph-native joins (WCOJ / Worst-Case Optimal Join) For cyclic / multi-way joins, WCOJ (RFC-009-eve) outperforms binary hash joins. **Rejected aquí**: WCOJ es RFC-009’s concern. Binary hash joins cover the LDBC SNB interactive queries que esta RFC ataca (IC2, IC4, IC10 with cross-pattern equi-joins) — WCOJ gains kick in on truly cyclic queries (IC9 path patterns). ### E. Defer join conversion until query execution time Adaptive: run a tiny sample of both sides, pick the join algorithm at runtime. **Rejected**: adds runtime branching to the executor and defeats the EXPLAIN/PROFILE story. Adaptive execution can revisit a futuro. ## Drawbacks 1. **Unbounded hash table memory**. Build side fits in RAM is an assumption. For LDBC SF1 we’re fine (build of 3M rows × 250B = \~750MB), but pathological queries (joins on rare keys with replicated rows) could blow memory. Mitigated by the build-vs-probe decision; not eliminated. Spill to disk queda como follow-up. 2. **No correlated subquery conversion**. SemiApply with a correlated subplan (the typical `EXISTS` shape) still nested-loops. Una iteración independiente lo resuelve. 3. **No multi-pattern join graph**. With 3+ pattern parts the optimizer sees a tree of CrossProducts; converting bottom-up means we lose the chance to pick a globally-optimal join order. Join reorder (RFC-016) addresses this by enumerating join trees BEFORE the conversion rewriter. 4. **The build side is materialised in full**. For very large build, this prevents streaming output. Modern hash joins emit matched rows as the probe streams — we do too, but only after the full build is in memory. 5. **Conversion conservative on residual**: when AND-split leaves non-eq conjuncts, we keep them as `residual` on the HashJoin. This means the residual evaluates on every joined row, which can be expensive. Future rewrites (`predicate_pushdown` over HashJoin) can further push residual conjuncts below the join if they reference only one side. Already supported by the standard pushdown rules once HashJoin is a recognised plan node. ## Open questions * **OQ1**. `RuntimeValue::Float` as hash key — NaN canonical hashing must be enforced. Today’s `Hash` impl uses raw bits, which makes distinct NaN bit patterns hash differently. We normalise during `evaluate` for the join key? Or in the `Hash` impl? Decided pragmatically: normalise on insert/lookup via a helper. * **OQ2**. Should HashJoin emit a “join-key” column in the output so downstream operators can dedup cheaply? Today the executor does not — the build row’s aliases survive verbatim; downstream Distinct re-hashes. Defer. * **OQ3**. The cost-model assumes independence between multi-key components. For LDBC IC9 with `WHERE a.firstName = b.firstName AND a.lastName = b.lastName`, the two equalities are strongly correlated. The estimate over-reduces. Multi-column histograms a futuro lo arreglan. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — origin of cost-based join order and cardinality estimates we reuse here. * DuckDB’s *Push-Based Execution* (Mark Raasveldt 2022) — modern reference implementation for HashJoin. * *Worst-Case Optimal Joins* (Ngo, Porat, Ré, Rudra; PODS ‘14) — WCOJ baseline para trabajo futuro. Mentioned to contrast our scope. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/010-cost-based-optimizer.md` — cost model this RFC consumes via `StatsCatalog`. * `docs/rfc/011-predicate-pushdown.md` — the `[join candidate]` annotation que esta RFC cosechas. ## Plan de implementación 1. **`crates/namidb-query/src/plan/logical.rs`** (\~80 LoC + 3 tests): * Agregar `LogicalPlan::HashJoin` variant + `JoinKey` struct. * Actualizar `children()`, `operator_name()`, `contains_write()`, test del IR. 2. **`crates/namidb-query/src/optimize/join_conversion.rs`** (\~400 LoC + 12-15 tests): * `convert_cross_to_hash(plan, catalog)` recursivo. * Helper `extract_cross_side_equalities(predicate, left_aliases, right_aliases) -> (Vec, Vec)`. * Helper `pick_build_side(left, right, catalog) -> Side`. 3. **`crates/namidb-query/src/optimize/pushdown.rs`** (\~50 LoC + 4 tests): * Agregar HashJoin arm en `pushdown_at`: split por side, push pushable conjuncts a build o probe. Same shape as CrossProduct. 4. **`crates/namidb-query/src/optimize/mod.rs`** (\~10 LoC): * Llamar `convert_cross_to_hash` post-pushdown en `optimize` pipeline. 5. **`crates/namidb-query/src/cost/cardinality.rs`** (\~60 LoC + 4 tests): * Nuevo arm para `HashJoin` con la fórmula de §4. 6. **`crates/namidb-query/src/exec/walker.rs`** (\~150 LoC + 5 tests): * `execute_hash_join` con build/probe phases. 7. **`crates/namidb-query/src/plan/explain.rs`** (\~30 LoC + 3 tests): * `write_header` arm para HashJoin. * `plan_has_stats` arm. 8. **`crates/namidb-query/tests/cost_smoke.rs`** (+6 integration tests). Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 509 → \~555 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~800 src + \~400 tests. * Sin cambios en `namidb-storage`.
# RFC 013: Parquet predicate pushdown
> **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-002 §4 (SST stats) **Supersedes:** —
> *Mirrored from [`docs/rfc/013-parquet-predicate-pushdown.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/013-parquet-predicate-pushdown.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-002 §4 (SST stats) **Supersedes:** — ## Summary Cosecha los `min/max` que el writer ya emite por row-group (RFC-002 §4 * read-back desde el Parquet footer) para evitar decodificar row-groups que no pueden contener filas que satisfacen el WHERE. El predicate pushdown estructural (RFC-011) empujó cada Filter lo más cerca del leaf (NodeScan) que pudo; este siguiente paso convierte el `Filter` inmediatamente sobre `NodeScan` en `predicates` *del NodeScan*, que el storage layer consume durante el scan para descartar row-groups completos sin decodificar. Alcance: * Tipo `ScanPredicate` en `namidb-storage::sst::predicates` (Eq / Lt / LtEq / Gt / GtEq / Between / IsNull / IsNotNull / In) contra una sola columna y un literal canónico (`StatScalar`). * `eval_row_group(predicate, &PropertyColumnStats) -> RowGroupVerdict` conservador: `Absent` solo si los stats demuestran imposibilidad, `MaybePresent` cuando los stats faltan o no son concluyentes. * `NodeSstReader::scan_with_predicates(&[ScanPredicate])` lee el Parquet metadata, evalúa cada predicate contra los stats por row-group, skipea cualquier row-group con verdict `Absent` para CUALQUIER predicate. * `Snapshot::scan_label_with_predicates(label, &[ScanPredicate])` y `Snapshot::scan_label(label)` mantenido como wrapper de `scan_label_with_predicates(label, &[])` (compat). * `LogicalPlan::NodeScan` gana `predicates: Vec` field. Default vacío para callers existentes. * Rewriter en `optimize::pushdown`: cuando llega a `NodeScan` con `pending` no-vacío, intenta convertir cada conjunct a un `ScanPredicate` sobre `alias.property`. Los pushables van al NodeScan.predicates; los no-pushables permanecen como `Filter`. * Executor pasa los predicates al storage en el callsite de `walker::execute_node_scan`. * Cardinality evalúa selectividad de los predicates ya empujados sobre el `node_count` del catalog (consistente con RFC-010 §3.1). * EXPLAIN VERBOSE: `NodeScan label=Person alias=a predicates=[a.age > 30, a.firstName = "Alice"]`. Out-of-scope explícito: * **Parquet row-level filtering** (Arrow `filter` operator dentro del reader). El executor ya tiene `Filter` y aplicarlo dos veces duplicaría trabajo. Solo hacemos *row-group* pruning en storage. * **Predicates sobre Expand / edge SST**. La RFC-002 sí define stats por edge SST (`DegreeHistogram`) pero las edges no tienen stats por propiedad arbitraria. Se mantiene fuera del v0. * **OR predicates**. Cada `ScanPredicate` es un single-column AND conjunct. `WHERE a.x = 1 OR a.y = 2` no se empuja en v0 — requeriría unión de row-groups con bookkeeping del verdict por-row. Conjunctive-only. * **Predicates cross-alias** (`WHERE a.age = b.age`). El storage no conoce `b`; el Filter cross-alias permanece arriba del NodeScan y cuando el HashJoin rewrite ya lo está convirtiendo a HashJoin, esto NO le quita nada. * **Predicates derivados de parámetros**. Los parámetros se resuelven en runtime; el storage layer no los ve. Si el lowering conoce el valor (constante), se baja como literal; si es un parameter abierto, el Filter permanece arriba. Una posible extensión post-v0: resolver parameters en `optimize::optimize` cuando `&Params` se pase al pipeline. * **Page-index pruning** (Parquet 2.0 column index + offset index). Vale para SSTs grandes con row-groups grandes pero requiere otra layer de stats. Skip v0; el writer ya emite chunk-level stats (`EnabledStatistics::Chunk`) — suficiente. * **Bloom filter check sobre eq predicates**. El writer emite bloom filters para `node_id` pero no para propiedades arbitrarias. Extender el bloom a properties es un trade-off de espacio (bloom bytes son \~7×rows) — esto queda diferido si las stats min/max no alcanzan. * **Cardinality estimate con dependencia entre predicates**. Cada predicate aplica selectividad independiente, como RFC-010 §3.2. El catalog HLL daría correlación pero v0 mantiene la asunción de independencia. ## Motivation Sobre LDBC SNB SF1 (3M Person nodes en \~30 SSTs de 100k rows c/u, con row-groups de 8192 rows = \~366 row-groups por label), una query `MATCH (a:Person) WHERE a.creationDate > '2020-01-01' RETURN a` que hoy: * Lee TODOS los SSTs (cada uno \~10–40 MB sobre S3). * Decodifica TODOS los row-groups (\~366 Parquet decompressions). * Filtra row-level en el executor: descarta \~99% de las filas. Con predicate pushdown: * Lee TODOS los SSTs **footer + page index** (\~64 KiB c/u, no body). * Por cada SST consulta `min/max(creationDate)` por row-group. * Skip los row-groups cuyo `max(creationDate) < '2020-01-01'`. * Decodifica solo los row-groups que pueden contener matches (\~10–30 row-groups en lugar de 366). **Ahorro de IO sobre S3 (dominante en cloud):** 10× típico para queries selectivas. Ahorro de CPU en decoding (Parquet deserialización): factor \~30×. Esto es la última pieza del pushdown end-to-end del query layer al storage. Sin ella, los `min/max` que el writer emite no se usan: solo están en el catalog para el cost model. ## Design ### 1. `namidb-storage::sst::predicates`
```rust
/// A single-column conjunctive predicate that the SST reader can
/// evaluate against per-row-group stats to skip entire row-groups
/// without decoding them.
///
/// Each variant references a column **by its declared property name**
/// (not by Parquet leaf path). The reader resolves to leaf index at
/// scan time.
#[derive(Clone, Debug, PartialEq)]
pub enum ScanPredicate {
Eq { column: String, value: StatScalar },
Lt { column: String, value: StatScalar },
LtEq { column: String, value: StatScalar },
Gt { column: String, value: StatScalar },
GtEq { column: String, value: StatScalar },
/// `Between { low, high }` is INCLUSIVE both sides. Equivalent to
/// `Gte(low) AND Lte(high)`.
Between { column: String, low: StatScalar, high: StatScalar },
IsNull { column: String },
IsNotNull { column: String },
In { column: String, values: Vec },
}
```
The literal type — `StatScalar` — is the same one the writer emits into `PropertyColumnStats`. This guarantees comparison between predicate and stats lives in a single ordering. #### Verdict
```rust
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum RowGroupVerdict {
/// Stats prove no row in this row-group can satisfy the predicate.
Absent,
/// Stats are insufficient (missing min/max, type mismatch) OR
/// stats overlap — rows may or may not match. Decode the
/// row-group.
MaybePresent,
}
```
Conservatism: we only return `Absent` when the stats *prove* the row-group contains no match. Missing min/max ⇒ `MaybePresent`. Type mismatch (predicate is `Utf8` but stats are `Int32`) ⇒ `MaybePresent` (defensive; treats the predicate as inapplicable rather than asserting). #### Evaluation
```rust
pub fn eval_row_group(
predicate: &ScanPredicate,
stats: &PropertyColumnStats,
) -> RowGroupVerdict;
```
Algorithm per predicate (column known to match `stats.name`): * `Eq(v)`: * If `min ≤ v ≤ max` → MaybePresent. Else Absent. * If min/max missing → MaybePresent. * `Lt(v)`: * If `min < v` → MaybePresent (some row may be < v). Else Absent (all rows ≥ v). * `LtEq(v)`: `min ≤ v` → MaybePresent. Else Absent. * `Gt(v)`: `max > v` → MaybePresent. Else Absent. * `GtEq(v)`: `max ≥ v` → MaybePresent. Else Absent. * `Between { low, high }`: equivalent to `GtEq(low) AND LtEq(high)`. Apply both; AND of verdicts = Absent if either is Absent. * `IsNull`: `null_count > 0` → MaybePresent. Else Absent. * `IsNotNull`: row-group has at least one non-null value (`null_count < row_count`); we don’t have `row_count` per stats here but we DO have `null_count` and if `null_count == 0` we’re sure non-nulls exist (defensive: always MaybePresent in v0 when stats exist; falls back to row-level Filter for the per-row check). * `In { values }`: build the closed interval `[min(values), max(values)]`; apply same logic as `Between`. False-positives are accepted (some intermediate value may not be in the `In` list — caught by the Filter operator above the NodeScan that the rewriter leaves intact as residual when In is partial). Comparison between `StatScalar` variants follows the obvious type matching: `Int32 vs Int32`, `Float64 vs Float64`, `Utf8 vs Utf8`, etc. Cross-type comparison (e.g. `Int32 vs Float64`) returns `MaybePresent` in v0; the optimizer doesn’t generate cross-type predicates because the property type is declared in the schema. NULL handling: `min` and `max` in `PropertyColumnStats` are computed over non-null values only (writer convention; see RFC-002 §4.1). So a column where every value is NULL has `min=None, max=None, null_count=N`. We evaluate `IsNull` from null\_count alone, and any ordered predicate (`Eq/Lt/Gt/...`) on a min/max=None column returns MaybePresent (defensive — the row-group has no non-null rows that could satisfy the predicate, but evaluating the predicate at row-level will correctly drop NULLs via 3VL). ### 2. `NodeSstReader::scan_with_predicates`
```rust
impl NodeSstReader {
pub fn scan_with_predicates(
&self,
predicates: &[ScanPredicate],
) -> Result>;
}
```
Algorithm: 1. Build the Parquet reader once: `ParquetRecordBatchReaderBuilder::try_new(body)`. 2. Read the metadata; collect the leaf index for every property column referenced by any predicate. 3. For each row-group: * For each predicate, locate the column leaf in the row-group’s ColumnChunkMetaData → `cc.statistics()`. Map to a per-row-group `PropertyColumnStats` synthesizing only the fields evaluation needs (`null_count`, `min`, `max` — same coding as `compute_property_stats`). * Evaluate each predicate with `eval_row_group`. If ANY returns `Absent`, skip the row-group. 4. Collect surviving row-group indices into `keep`. 5. If `keep` is empty, return `Vec::new()` (no decode). 6. Otherwise build the reader `with_row_groups(keep)` and decode as in `scan()`. Cost: \~few µs per row-group of metadata inspection (the metadata is already in memory from `try_new`). When all row-groups survive we fall through to the same path as `scan()` and pay no extra IO. ### 3. `Snapshot::scan_label_with_predicates`
```rust
impl Snapshot<'_> {
pub async fn scan_label(&self, label: &str) -> Result> {
self.scan_label_with_predicates(label, &[]).await
}
pub async fn scan_label_with_predicates(
&self,
label: &str,
predicates: &[ScanPredicate],
) -> Result>;
}
```
The new variant: * Iterates the memtable as `scan_label` does today, but additionally evaluates each predicate against the materialised NodeView’s property map. Memtable values are decoded already, so this is cheap (no IO). * For each SST scoped to `label`, calls `reader.scan_with_predicates(predicates)` instead of `reader.scan()`. * Returns the same `BTreeMap` semantics: tombstones win; last-write-wins by LSN. Memtable predicate evaluation is row-by-row in v0 and uses the same 3VL semantics as the executor’s Filter (`Bool(true)` → keep, `Bool(false)` / `Null` → drop). ### 4. `LogicalPlan::NodeScan` change
```rust
NodeScan {
label: String,
alias: String,
/// Predicates that have been pushed into the scan from a Filter
/// directly above it. Empty for the lowering output; populated
/// by `optimize::pushdown` when conjuncts qualify (see §5).
/// The executor passes them verbatim to
/// `Snapshot::scan_label_with_predicates`.
predicates: Vec,
}
```
`PartialEq`, `Clone`, `Debug` derive over the new field. All existing constructions of `NodeScan` (lowering, tests) now use `predicates: Vec::new()`. `operator_name()` remains `"NodeScan"`. EXPLAIN VERBOSE renders the predicates inline (see §6). ### 5. Rewriter in `optimize::pushdown` The existing leaf arm:
```rust
LogicalPlan::Empty | LogicalPlan::Argument { .. } | LogicalPlan::NodeScan { .. } => {
apply_filters(plan, pending)
}
```
becomes:
```rust
LogicalPlan::NodeScan { label, alias, predicates } => {
let (pushable, residual) = classify_pending_for_scan(pending, &alias, &label_def);
let mut merged = predicates;
merged.extend(pushable);
apply_filters(
LogicalPlan::NodeScan { label, alias, predicates: merged },
residual,
)
}
```
`classify_pending_for_scan(pending, alias, label_def)` returns: * `pushable: Vec` — conjuncts that are single-column comparisons on `alias.` with a literal/parameter (only literals in v0; parameters deferred) and reference NO other alias. The property must be declared in `label_def`. * `residual: Vec` — everything else; stays as `Filter` above the NodeScan. The classification function lives in `optimize::parquet_pushdown::classify` (new module) and is unit tested independently. The integration into `pushdown_at` is a single arm change. Why fold the parquet pushdown into the same `pushdown_at` pass instead of a separate post-pass: the `pending` accumulator already carries every conjunct that the existing pushdown was about to materialise as `Filter` over `NodeScan`. Classifying them at the leaf is the natural place — it costs O(|pending|) per leaf and avoids a second tree walk. ### 6. EXPLAIN VERBOSE
```plaintext
Project [a] (est=1500)
NodeScan label=Person alias=a predicates=[a.age > 30, a.firstName = "Alice"] (est=1500)
```
Plain `EXPLAIN` (no VERBOSE) also renders the predicates — they are part of the operator shape, not annotations. `EXPLAIN RAW` shows the pre-optimize lowering, where `NodeScan` has `predicates: vec![]` and the conjuncts live in a `Filter` above. `predicates=[...]` rendering uses each predicate’s `Display`: * `Eq { column, value }` → `a.col = ` * `Lt { column, value }` → `a.col < ` * `Between { column, low, high }` → `a.col BETWEEN AND ` * `IsNull { column }` → `a.col IS NULL` * `IsNotNull { column }` → `a.col IS NOT NULL` * `In { column, values }` → `a.col IN [, , ...]` The `alias.col` prefix comes from the NodeScan’s `alias`. Literals render via their `StatScalar` Display. ### 7. Cardinality The `NodeScan` arm in `cost::cardinality::estimate_inner` becomes:
```rust
LogicalPlan::NodeScan { label, alias, predicates } => {
let base = catalog.label(label).map(|l| l.node_count as f64).unwrap_or(0.0);
let sel = predicates_selectivity(predicates, catalog, label);
let rows = base * sel;
// bindings, leaf as before
}
```
`predicates_selectivity` reuses the existing `selectivity` machinery from `cost::selectivity` by translating each `ScanPredicate` to its `Expression` analogue (`Eq → BinaryOp::Eq`, etc.) and calling `selectivity(&expr, &binding_stats)` where `binding_stats` is seeded from the property stats in the catalog. Multi-predicate combines under the independence assumption (RFC-010 §3.2). Trade-off: we double-evaluate selectivity (once at NodeScan level for the pushed predicates, once at any residual Filter above). This is correct — Filter applies on top of the already-reduced NodeScan estimate. ### 8. Edge SSTs Out of scope (see §“Out-of-scope”). `EdgesFwd/Inv` SSTs ship `DegreeHistogram` but no per-property stats. Adding edge-property stats requires the writer to track them per edge\_type — a separate RFC if/when needed. ## Alternatives considered ### A. Filter pushdown using DataFusion’s Expr DataFusion has a full Expr language with a `PhysicalPlanner` that converts pushable Expr to Parquet `RowFilter`. **Rejected**: bringing DataFusion as a dependency for the pushdown ergonomics alone is a mismatch — we’d still need a translator from our `Expression` to their `Expr`, and the rest of our executor wouldn’t share the path. A future morsel/vectorized iteration may revisit, but pushdown alone doesn’t pay it. ### B. Runtime adaptive sampling Detect that a query is selective by sampling the first N rows and deciding pushdown on the fly. **Rejected**: defeats EXPLAIN/PROFILE story (plan changes at runtime), and our static stats are good enough for the v0 regime. ### C. Encode predicates in a server-side filter pushdown to S3 Select S3 Select supports SQL filters server-side but requires the body to be in CSV/JSON. Parquet Select is not GA. **Rejected**: incompatible with our storage format. Future feasibility check goes with edge storage RFCs. ### D. Build a custom bloom filter per property column for eq pushdown The writer would emit a bloom over each property’s hashed values. Eq predicates probe the bloom before reading the row-group. **Deferred**: RFC-002 explicitly limits blooms to `node_id` (for point lookups). A property bloom is \~7×rows bytes — for 100k row SSTs that’s \~700 KiB per property column. The space cost only pays off on cardinalities that min/max-based pruning misses (which is rare — eq on high-NDV columns is already covered by min/max for the values inside a row-group and a NodeId-bloom alike). Track for follow-up if real workloads show it. ## Drawbacks 1. **No row-level filter in storage**. Surviving row-groups still decode in full; the executor’s Filter then drops non-matching rows. For row-groups with \~50% selectivity this double-touches values. Mitigated by the executor’s Filter living in the same process — it’s cheap. Row-level pushdown in storage would couple Arrow’s `filter` operator to the reader (morsel direction). 2. **Single-column predicates only**. Multi-column predicates (`a.x + a.y > 100`) stay as `Filter`. v0 accepted. 3. **No parameter substitution in v0**. `WHERE a.age > $minAge` keeps the Filter above the NodeScan since we don’t resolve `$minAge` until execution. A later optimization passes `&Params` into `optimize::optimize` to constant-fold them; deferred. 4. **OR predicates are not pushed**. `WHERE a.age > 30 OR a.firstName = "Alice"` is one conjunct in the AND-split (the OR root), and pushability requires single-column ⇒ rejected. The Filter survives. Could be added by extending `ScanPredicate` to a tree, but row-group verdict combination for OR gets messier (union of MaybePresent verdicts). 5. **`IsNotNull` is conservative**. We don’t have per-row-group row\_count to verify `null_count < row_count`. Always returns MaybePresent when stats exist; the executor’s Filter drops nulls at row level. Negligible cost. 6. **Cross-type predicate comparison returns MaybePresent**. By design (defensive). The optimizer doesn’t construct cross-type predicates because schemas declare property types — but if a future path introduces them, this is the safety net. 7. **Memtable predicate eval is row-by-row** (not vectorised). The memtable is typically small (<10k rows before flush), so this is negligible vs SST decoding. Morsel-driven execution can revisit. ## Open questions * **OQ1**. `predicates_selectivity` reuse path: should it translate `ScanPredicate` → `Expression` and call `selectivity`, or have its own simpler arm? Decided: translate (single source of truth for selectivity heuristics). * **OQ2**. Should `NodeById` also accept predicates? Today `NodeById` is a point-lookup. Predicates on the same alias COULD be applied during the lookup. v0: no — the lookup decodes one row group with at most one row anyway; Filter on top is fine. Track as follow-up if benchmarks show a hot path. * **OQ3**. Should the writer emit per-row-group HLL sketches (not just per-SST)? This would enable approx-NDV reasoning per row-group for eq pushdown. Deferred; the current per-SST HLL is sufficient for query-level cardinality estimates. ## References * *Parquet 2.0 column index + offset index* — Apache Parquet specification §6.2 (column index for page-level pruning). * DuckDB’s *predicate pushdown into Parquet readers* (Raasveldt 2022) — modern reference implementation. * `docs/rfc/002-sst-format.md` §4 — stats embedded in SST. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/010-cost-based-optimizer.md` — cost model this RFC reuses. * `docs/rfc/011-predicate-pushdown.md` — the structural pushdown this RFC builds on. ## Plan de implementación 1. **`crates/namidb-storage/src/sst/predicates.rs`** (\~250 LoC + 18 tests): * `ScanPredicate` enum + `RowGroupVerdict` + `eval_row_group` evaluator. Helpers `scalar_cmp(a, b) -> Ordering` and `scalar_eq(a, b) -> bool` (delegating to PartialOrd / PartialEq of `StatScalar`). * Module `pub` in `sst/mod.rs`. * Unit tests cubren: Eq in/out range, Lt boundary, GtEq with NULL min, IsNull positive/negative, In with single/multi values, Between, missing min/max → MaybePresent, type mismatch → MaybePresent. 2. **`crates/namidb-storage/src/sst/nodes.rs`** (\~150 LoC + 5 tests): * `NodeSstReader::scan_with_predicates(&[ScanPredicate])` implementing §2 algorithm. `scan()` becomes a wrapper of `scan_with_predicates(&[])`. * Helper `row_group_stats_for_column(rg, col_name, prop_def) -> Option` reusing the mapping from `compute_property_stats`. * Unit tests: predicate skips all row-groups, predicate skips some, no predicates fall through to full scan, `IsNull` with NULL row-group survives, multi-predicate AND, no-stats fallback keeps row-group. 3. **`crates/namidb-storage/src/read.rs`** (\~80 LoC + 3 tests): * `Snapshot::scan_label_with_predicates(label, &[ScanPredicate])` implementing §3 algorithm. `scan_label(label)` wraps it with `&[]`. * Memtable predicate eval helper using `node_view_matches_predicates` (NULL-safe 3VL). * Unit tests: memtable filtering, SST filtering, predicate over tombstoned row, ND ndv (just kidding — verifies catalog isn’t used in scan path). 4. **`crates/namidb-storage/src/sst/mod.rs` + `lib.rs`**: * `pub mod predicates` + re-export `ScanPredicate`, `RowGroupVerdict`, `eval_row_group`. 5. **`crates/namidb-query/src/plan/logical.rs`** (\~30 LoC + 1 test): * `LogicalPlan::NodeScan` adds `predicates: Vec`. * Type alias `pub use namidb_storage::sst::predicates::ScanPredicate` at module root so the rest of the query crate doesn’t need to know the storage path. * Updates to `children()` (no children added — predicates are flat), `operator_name()` (still “NodeScan”), `contains_write()` (still false). * Test ensures NodeScan with predicates equals NodeScan with same predicates and not equal when predicates differ. 6. **`crates/namidb-query/src/optimize/parquet_pushdown.rs`** (\~250 LoC + 14 tests): * `classify_pending_for_scan(pending: Vec, alias: &str, label_def: &LabelDef) -> (Vec, Vec)`. * Conversion `try_into_scan_predicate(expr, alias) -> Option` case-analysing each `Expression::kind`. Supports: BinaryOp {Eq/Lt/LtEq/Gt/GtEq} with `PropertyAccess(alias, prop)` on one side and `Literal(lit)` on the other; the literal converts to `StatScalar` via a helper. `IS NULL / IS NOT NULL`. `IN [list]` when every element is a literal. `BETWEEN` decomposes to Gte+Lte at lowering time so the AND-split already gives us two conjuncts. * Tests: eq pushable, eq with non-matching alias rejected, eq with literal-on-left, range pushable, IS NULL pushable, IS NOT NULL pushable, IN with all literals, IN with non-literal rejected, cross-alias rejected, non-declared property rejected, parameter rejected, complex arithmetic rejected, idempotency. 7. **`crates/namidb-query/src/optimize/pushdown.rs`** (\~30 LoC + 4 tests): * NodeScan arm in `pushdown_at` now consults `parquet_pushdown::classify_pending_for_scan`. The non-pushable conjuncts materialise as `Filter` above; the pushable accumulate into `predicates`. * Tests verify: filter eq on declared prop ends up in NodeScan predicates; filter on parameter stays as Filter; filter on undeclared prop stays as Filter; filter on different alias stays as Filter (and would have been pushed elsewhere by the CrossProduct arm). 8. **`crates/namidb-query/src/optimize/mod.rs`** (\~10 LoC): * `pub mod parquet_pushdown` + re-export `classify_pending_for_scan` so tests can reach it. * The `optimize` pipeline doesn’t add a new pass; the NodeScan arm change in `pushdown_at` covers it. 9. **`crates/namidb-query/src/optimize/normalize.rs`** (\~5 LoC): * `recurse_children` arm for NodeScan recurses on… nothing (NodeScan is a leaf). The change is to preserve `predicates` when the arm clones the variant — trivial. 10. **`crates/namidb-query/src/exec/walker.rs`** (\~10 LoC): * `execute_node_scan` callsite (line \~140) passes `predicates` to `snapshot.scan_label_with_predicates(label, predicates).await?`. 11. **`crates/namidb-query/src/cost/cardinality.rs`** (\~40 LoC + 3 tests): * NodeScan arm applies multiplicative selectivity over predicates using the existing `selectivity::selectivity` and `BindingStats` machinery. * Tests: NodeScan with eq predicate estimate drops below base; NodeScan with range predicate estimate proportional to range; NodeScan with empty predicates equals base. 12. **`crates/namidb-query/src/plan/explain.rs`** (\~50 LoC + 2 tests): * `write_header` arm for NodeScan with predicates renders as §6. Predicate Display uses `format_scan_predicate(p, alias)` helper, also unit-tested. 13. **`crates/namidb-query/src/plan/lower.rs`** (\~10 LoC): * Lowering creates `NodeScan { predicates: vec![] }`. Mecánica; sin test new. 14. **`crates/namidb-query/tests/cost_smoke.rs`** (+8 integration tests): * `parquet_pushdown_moves_eq_to_scan` * `parquet_pushdown_moves_range_to_scan` * `parquet_pushdown_keeps_cross_alias_in_filter` * `parquet_pushdown_keeps_undeclared_property_in_filter` * `parquet_pushdown_renders_in_explain` * `parquet_pushdown_estimate_drops_below_full_scan` * `parquet_pushdown_executes_with_parity_to_raw` * `parquet_pushdown_skips_all_row_groups_when_out_of_range` Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 528 → \~580 passed. * `cargo clippy --workspace --all-targets --exclude namidb-py -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~900 src + \~500 tests + \~650 RFC.
# RFC 014: HashSemiJoin via decorrelation
> **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-012 (HashJoin) **Supersedes:** —
> *Mirrored from [`docs/rfc/014-hash-semi-join.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/014-hash-semi-join.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-012 (HashJoin) **Supersedes:** — ## Summary Cosecha el ÚLTIMO operador nested-loop del read-path: `SemiApply`. Hoy `SemiApply { input, subplan, negated }` ejecuta `subplan` UNA VEZ por cada row de `input` — O(N·M) sobre `EXISTS` subqueries. Para LDBC SF1 (3M outer rows × 100 avg degree por subplan = 3×10⁸ ops) eso es infactible. Esta RFC decorrelaciona el subplan (sustituye el `Argument` leaf por una `NodeScan` independiente), construye un hash table una sola vez, y filtra el outer probando contra él: O(N+M). Alcance: * Nuevo operador `LogicalPlan::HashSemiJoin { outer, inner, on, negated, residual }`. Forma EXACTAMENTE como `HashJoin` excepto que: * `outer` ↔ probe semantic (NO se duplican rows del inner — máx 1 output row por outer row matching); * `negated` flag para `AntiSemiJoin` (NOT EXISTS). * Rewriter `convert_semi_apply_to_hash_semi_join(plan, &catalog)` bottom-up. Detecta `SemiApply` cuyo subplan: 1. tiene EXACTAMENTE un `Argument` leaf, 2. cuyas `bindings` son un SUBSET de 1 alias `X`, 3. cuyo label puede inferirse del outer scope (NodeScan o Expand con `target_label`). Sustituye `Argument(X)` por `NodeScan { label: , alias: X, predicates: vec![] }` y emite `HashSemiJoin` con `on=[JoinKey{ build: Property(X,"id"), probe: Property(X,"id") }]`. * Executor `execute_hash_semi_join`: build phase materializa un `BTreeSet` (no full row buffering — solo necesitamos “any match”); probe phase emite outer row si lookup acierta (semi) o si NO acierta (anti). * Cardinality `HashSemiJoin` estima rows como `outer.rows · min(1.0, inner.rows / outer_X_distinct)` (semi-join retains at most all outer rows). * EXPLAIN VERBOSE: `HashSemiJoin on=[(a.id, a.id)] negated=false` o `AntiHashSemiJoin` cuando `negated=true`. Out-of-scope: * **SemiApply cuyo subplan tiene Argument con MÚLTIPLES bindings**. Requiere multi-column hash key — diferente shape. Iteración futura. * **SemiApply cuyo subplan no contiene Argument** (subplan independiente del outer). Es trivialmente “ejecutar una vez”, pero requiere otro rewrite path (cache+broadcast). Iteración futura. * **PatternList decorrelation**. Mismo shape pero materializa lista en vez de boolean. Iteración futura. * **Pushdown sobre HashSemiJoin**. Heredado del existing pushdown (`hash_semi_join` arm en `optimize::pushdown::pushdown_at`): conjuncts del outer pueden bajar al outer-side, conjuncts del inner-only no aplican (el inner no contribuye bindings al output). * **Multi-pattern EXISTS** (`EXISTS { (a)-[]->(b)-[]->(c) }`). El subplan tiene un solo Argument leaf; el cuerpo es un chain de Expands. Funciona automáticamente — el rewriter solo reemplaza el Argument; el resto del subplan se mantiene literal y se ejecuta como inner en build phase. ## Motivation Sin decorrelation, una query como:
```cypher
MATCH (a:Person)
WHERE EXISTS { (a)-[:KNOWS]->(b:Person) }
RETURN a.firstName
```
produce el plan:
```plaintext
Project [a.firstName]
SemiApply { negated: false }
NodeScan { Person, a } (outer)
Expand { source=a, edge_type=KNOWS, target=b } (subplan)
Argument { bindings: [a] }
```
Sobre micro-graph 6 Persons / 6 KNOWS, esto ya cuesta 6 × scan\_label = 6 evaluaciones del subplan (que itera todos los Persons + edges per outer row). Sobre LDBC SF1 (3M Persons / 100 avg degree), cuesta 3M × scan\_label = infinito. Con decorrelation el plan optimizado es:
```plaintext
Project [a.firstName]
HashSemiJoin on=[(a.id, a.id)]
NodeScan { Person, a } (outer)
Expand { source=a, edge_type=KNOWS, target=b } (inner)
NodeScan { Person, a } (decorrelated leaf)
```
Build phase ejecuta el inner UNA vez: 6M edges. Probe phase: 3M outer rows × O(1) lookup = 3M ops. Total: 6M build + 3M probe = 9M ops. **Mejora de 3M / 30 = 100 000× para este caso típico**. ## Design ### 1. IR: `LogicalPlan::HashSemiJoin`
```rust
HashSemiJoin {
/// The "probe" side. Bindings from `outer` are the ones that
/// survive into the output.
outer: Box,
/// The "build" side. Bindings introduced by `inner` are
/// DROPPED — only used to decide whether each outer row matches.
inner: Box,
/// Equi-join keys. `build_side` is evaluated on each `inner`
/// row at build time, `probe_side` on each `outer` row at
/// probe time. Single-key in v0 (multi-key is OK if needed).
on: Vec,
/// `false`: keep outer rows that have at least one inner match
/// (`EXISTS`). `true`: keep outer rows with NO inner match
/// (`NOT EXISTS`).
negated: bool,
/// Residual predicate evaluated on the joined row (outer
/// bindings + inner bindings, 3VL). Optional — most simple
/// EXISTS lower to no residual.
residual: Option,
}
```
Semantics: * `outer.bindings` ⊆ output bindings. Inner bindings are dropped (the semi-join semantics). * Build phase: for each `inner` row, evaluate `JoinKey::build_side` expressions; if any key component is NULL, skip the row (3VL). Build a `BTreeSet>` (canonical key fingerprint, reusing `dedup_rows`’s helper). * Probe phase: for each `outer` row, evaluate `JoinKey::probe_side` expressions; if any is NULL, skip (3VL). Lookup in the set. Emit outer row iff `(matched, negated)` matches the desired truth table: * `(true, false)` → keep (EXISTS). * `(false, true)` → keep (NOT EXISTS). * else → drop. * Residual: when present, evaluate on the JOINED row (outer ∪ build’s full row recovered from a secondary `Vec` map). For v0 we default `residual = None` since the lowering of bare EXISTS doesn’t generate one. ### 2. Rewriter `optimize::decorrelation::convert_semi_apply_to_hash_semi_join` Pre-walk the plan to populate `outer_labels: BTreeMap>` — alias → declared label for every NodeScan and labeled Expand target in scope. Then walk top-down. For each `SemiApply { input, subplan, negated }`: 1. Recurse into `input` and `subplan` (independent decorrelation). 2. Detect `subplan` shape: * Has exactly ONE `Argument { bindings: [X] }` leaf (depth-first descent through Expand/Filter/NodeById; reject if multiple Arguments or any operator that’s not in the decorrelation-safe list). * The Argument’s `bindings` is exactly `[X]` (a single alias). * `outer_labels[X] == Some(L)` for some label `L`. 3. Build `inner` by walking the subplan, replacing the unique `Argument` with `NodeScan { label: L, alias: X, predicates: vec![] }`. 4. Emit `HashSemiJoin { outer: input, inner: new_subplan, on: vec![JoinKey { build_side: Property(X, "id"), probe_side: Property(X, "id") }], negated, residual: None }`. Decorrelation-safe operators (descend into to find Argument): * `Expand` * `Filter` (residual conjuncts stay on the inner; semantics is “the subplan still filters its rows; HashSemiJoin probes whether any filtered row matches the outer key”) * `NodeById` (only if its `input` is the Argument leaf) * `Project` with `discard_input_bindings: false` (rare in subplans but possible) * `NodeScan`, `Empty` — never contain Argument; the Argument has to be at the leaf. If any other operator appears (`Aggregate`, `TopN`, `Distinct`, `Union`, `CrossProduct`, `HashJoin`, write ops, `SemiApply` itself), the rewriter bails and the original SemiApply is kept. v0 keeps the detection conservative — false negatives leave performance on the table but never produce incorrect plans. Idempotency: `HashSemiJoin` is not a `SemiApply`, so the second pass of the fixpoint won’t re-trigger. Verified in unit tests. ### 3. Executor
```rust
async fn execute_hash_semi_join(
outer: &LogicalPlan,
inner: &LogicalPlan,
on: &[JoinKey],
negated: bool,
residual: &Option,
snapshot: &Snapshot<'_>,
params: &Params,
outer_bindings: Option<&Row>,
) -> Result, ExecError>
```
Phase 1 (build): execute `inner` once (no outer context). For each inner row, evaluate every `JoinKey::build_side` expression. If ANY is NULL, skip the row (3VL). Otherwise, push the fingerprint into a `BTreeSet>`. Phase 2 (probe): execute `outer`. For each outer row, evaluate every `JoinKey::probe_side`. Compute matched = `set.contains(fingerprint)`. Keep iff `(matched, negated)` is `(true, false)` or `(false, true)`. Residual: when `residual.is_some()`, the build phase additionally stores the full inner row alongside the fingerprint, and the probe phase iterates matching inner rows to evaluate the residual on the joined binding map. v0 ships `residual = None` from the rewriter so this path is exercised only by future RFC iterations. ### 4. Cardinality
```rust
LogicalPlan::HashSemiJoin { outer, inner, on, negated, residual: _ } => {
let o = estimate_inner(outer, catalog);
let i = estimate_inner(inner, catalog);
// P(at least one inner match for an outer row) ≈
// 1 - (1 - i.rows/distinct(inner_key))^(o.rows/distinct(outer_key))
// Simplification: i.rows / max(distinct_outer, 1.0) treated as the
// probability a random outer row matches.
let frac_match = (i.rows / o.rows.max(1.0)).min(1.0);
let rows = if negated {
o.rows * (1.0 - frac_match)
} else {
o.rows * frac_match
};
...
}
```
The estimate is folklore for now — multi-key correlation and inner NDV are revisited a futuro. The output is clamped to `[0, o.rows]`. ### 5. EXPLAIN VERBOSE
```plaintext
HashSemiJoin on=[(a.id, a.id)] (est=4)
NodeScan label=Person alias=a (est=6)
Expand source=a edge_type=KNOWS target=b (est=12)
NodeScan label=Person alias=a (est=6)
```
When `negated=true`, the operator name is `AntiHashSemiJoin` (mirrors the existing `AntiSemiApply` rendering). ### 6. Integration with the pipeline * `optimize::optimize` (in `optimize::mod`) runs `convert_semi_apply_to_hash_semi_join` AFTER `convert_cross_to_hash` in the same fixpoint round. Order: pushdown → normalize → cross-to-hash → semi-to-hashsemi. * `optimize::pushdown::pushdown_at` gets a `HashSemiJoin` arm: conjuncts that reference only `outer` bindings push to the outer side; conjuncts that reference inner-only bindings are nonsensical (inner bindings are dropped) so they stay above the HashSemiJoin defensively. * `optimize::normalize` gets a `HashSemiJoin` arm: recurse on outer and inner. ### 7. Bindings analysis A subtle point: `Argument { bindings: [X] }` may carry multiple names when the subplan needs more than one outer variable. v0 rejects those (`arg_bindings.len() != 1`). They will be common in deeper LDBC queries with chained patterns; una iteración futura lifts the restriction via multi-key joins. When the subplan introduces NEW bindings (e.g. `Expand` introduces `target_alias`), those are local to the subplan and DO NOT leak to the outer output — same as the existing SemiApply semantics. ## Alternatives considered ### A. Hash table over outer + iterate inner Build over outer (keyed by `X.id`), iterate inner and emit `outer_row` once when matched. **Rejected**: requires deduplication on the inner side (the same outer might match multiple inner rows emitting duplicates). Cheaper to build over inner. Actually we go the other way: **build over INNER** (because inner is the small EXISTS side typically — friends-of-friend, etc.) and probe outer. The build side stores the SET of key values; each outer is checked once. Result preserves outer row order. ### B. Adaptive at runtime Decide per-query whether to decorrelate based on the actual sizes of outer/inner. **Rejected**: defeats EXPLAIN/PROFILE story. The cost model picks the side (build vs probe) statically. ### C. Apply pushdown into the subplan A more aggressive rewrite: pushing outer predicates INTO the inner subplan so the inner produces only the relevant subset. **Rejected for v0**: requires correlation analysis beyond the equality on the `X.id` (parameter propagation). Una iteración futura may revisit. ## Drawbacks 1. **Restricted to single-Argument-binding subplans**. Multi-binding correlation is common in real LDBC queries — `EXISTS { (a)-[]->(b) ... b.x = a.y }`. v0 keeps these as nested-loop SemiApply. 2. **Inner duplication**. The decorrelated inner enumerates every X in the corpus (not just those referenced by the outer). When the outer is much smaller than the inner’s universe (e.g. outer is a single row), nested-loop SemiApply is cheaper — `inner.rows / outer.rows` becomes lopsided. v1 could compare estimates and choose accordingly; v0 always decorrelates when shape matches. 3. **NULL on join key drops outer / inner rows silently**. Same as `HashJoin` 3VL semantics. Documented inline. For typical `EXISTS { (a)-[]->(b) }` this never triggers because node ids are never NULL. 4. **Cost-model estimate is folklore**. The independence and uniform- distribution assumptions over-simplify. RFC-010 §“Drawbacks 1” tracks the broader observation; a futuro puede refinarse. ## Open questions * **OQ1**. Should the rewriter try to lift `Filter` arms from inside the subplan to the outer-side when the conjunct references only outer bindings? Today the subplan is rewritten verbatim. Defer. * **OQ2**. How does `HashSemiJoin` interact with `PatternList`? `PatternList` is semantically a multi-row apply that materialises a list. Decorrelation produces a `HashJoin` (NOT semi-join) with array aggregation per outer key. Separate RFC. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — semi-join cardinality. * Galindo-Legaria & Joshi, *Orthogonal Optimization of Subqueries and Aggregation* (SIGMOD ‘01) — formal decorrelation rewrites. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/012-hash-join.md` — HashJoin executor this RFC mirrors. ## Plan de implementación 1. **`crates/namidb-query/src/plan/logical.rs`** (\~50 LoC + 2 tests): * `LogicalPlan::HashSemiJoin` variant. * `operator_name` returns `"HashSemiJoin"` / `"AntiHashSemiJoin"`. * `children` → `[outer, inner]`. `contains_write` → false (rewriter never touches subtrees with writes). 2. **`crates/namidb-query/src/optimize/decorrelation.rs`** (\~250 LoC + 8 tests): * `convert_semi_apply_to_hash_semi_join(plan, &catalog)`. * `outer_label_map(plan)` walks NodeScan/Expand collecting alias → label. * `find_unique_argument(plan)` returns `Option<(&Argument bindings, parent path)>`. * `replace_argument(plan, x, label)` substitutes the unique Argument with a fresh `NodeScan { label, alias: X, predicates: vec![] }`. * Tests cover: simple EXISTS → decorrelates, NOT EXISTS → negated, no-Argument subplan → kept as SemiApply, multi-binding Argument → kept, label unknown → kept, EXISTS with extra Filter → still decorrelates (filter remains in inner), nested SemiApply → outer SemiApply NOT touched if its subplan has SemiApply, idempotency. 3. **`crates/namidb-query/src/optimize/mod.rs`** (\~10 LoC): * `pub mod decorrelation`. * `optimize` pipeline runs `convert_semi_apply_to_hash_semi_join(plan, catalog)` AFTER `convert_cross_to_hash`. 4. **`crates/namidb-query/src/optimize/pushdown.rs`** (\~30 LoC + 3 tests): * HashSemiJoin arm (split pending by outer/inner-aliases). 5. **`crates/namidb-query/src/optimize/normalize.rs`** (\~5 LoC): * HashSemiJoin arm in `recurse_children`. 6. **`crates/namidb-query/src/cost/cardinality.rs`** (\~40 LoC + 3 tests): * HashSemiJoin arm with §4 formula. 7. **`crates/namidb-query/src/exec/walker.rs`** (\~80 LoC + 2 tests): * `execute_hash_semi_join` build/probe phases. 8. **`crates/namidb-query/src/exec/writer.rs`** (\~10 LoC): * HashSemiJoin arm (defensive, never produced by writes). 9. **`crates/namidb-query/src/plan/explain.rs`** (\~30 LoC + 2 tests): * `HashSemiJoin` rendering. `negated` flag selects `AntiHashSemiJoin`. 10. **`crates/namidb-query/tests/cost_smoke.rs`** (+5 integration tests): * `decorrelation_converts_simple_exists`, * `decorrelation_preserves_results`, * `decorrelation_handles_not_exists`, * `decorrelation_keeps_multi_binding_subplan_as_semi_apply`, * `decorrelation_renders_hash_semi_join_in_explain`. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 596 → \~625 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~500 src + \~250 tests + \~400 RFC.
# RFC 015: Projection pushdown / column pruning
> **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-013 (Parquet predicate pushdown) **Supersedes:** —
> *Mirrored from [`docs/rfc/015-projection-pushdown.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/015-projection-pushdown.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-013 (Parquet predicate pushdown) **Supersedes:** — ## Summary Hoy `NodeSstReader::scan()` decodifica TODAS las columnas Parquet declaradas en el `LabelDef`, aunque la query referencie sólo una fracción. Sobre Person en LDBC SF1 (\~12 columnas, \~3M filas), un `RETURN a.firstName` decodifica 12× más datos del necesario. Esta RFC cierra el end-to-end del pushdown: 1. **Analyze** — walk del plan top-down recolectando, por alias, el conjunto de propiedades que las expresiones referencian (RETURN, WHERE residual, ORDER BY, predicados de filtro intermedios). 2. **Annotate** — `LogicalPlan::NodeScan` gana `projection: Option>` (None = todas las columns, default para back-compat). El rewriter lo populates con el set inferido del analyze step. 3. **Storage** — `NodeSstReader::scan_with_predicates_and_projection` construye un `ProjectionMask` de Parquet que sólo lee las column leafs necesarias. Las engine columns (`node_id`, `tombstone`, `lsn`, `__schema_version`, `__overflow_json`) se incluyen siempre. 4. **Reader** — el resto del path (`Snapshot::scan_label_*`) se adapta a transparentar la projection. 5. **EXPLAIN VERBOSE** — `NodeScan label=Person alias=a projection=[firstName]` cuando hay projection no-trivial. Sobre LDBC SF1 con `RETURN a.firstName` esperamos: 12× menos bytes leídos desde S3 + 12× menos decoding CPU. Alcance: * Property-column pruning para NodeScan. Edge SSTs y NodeById quedan out-of-scope v0. * Análisis conservador: cuando una expresión usa `Variable(a)` sin PropertyAccess (e.g. `RETURN a`), la projection es `None` (lee todas las columnas, incluso `__overflow_json`). * Análisis de subplans de SemiApply/PatternList/HashSemiJoin va recursivo dentro de cada scope (los inner emiten su propio projection). * Predicates ya pushados a `NodeScan.predicates` (RFC-013) cuentan como referencias a sus columnas — el storage los necesita para filtrarlas. Out-of-scope: * **EdgesFwd/Inv property streams**. Edges aún no emiten per-property streams (RFC-002 §3.2.7 follow-up). Sin streams separados no hay granularidad de proyección. * **NodeById**. Decodifica un row group con max 1 row; el ahorro de IO es marginal y el overhead de la projection mask para un point-lookup no se justifica v0. * **Overflow column elision**. Cuando la query NO referencia propiedades no-declaradas, podríamos omitir `__overflow_json`. v0 lo mantiene siempre (defensivo). * **Projection pushdown dentro de Project**. Cuando un `Project` deja bindings vivas (`discard_input_bindings: false`), todas las columnas downstream son potencialmente referenciadas. v0 trata Project no-discarding como barrera. * **Pruning de schema-version / lsn columns**. Solo de propiedades declaradas. Las engine columns son baratas (UInt64 chunks RLE-comprimidos) y removerlas rompería la semántica de tombstone/winner. ## Motivation Plan ejemplo pre-rewrite:
```plaintext
Project [a.firstName] (est=3000000)
NodeScan label=Person alias=a predicates=[] (est=3000000)
```
`NodeScan` decodifica `prop_firstName`, `prop_lastName`, `prop_birthday`, `prop_creationDate`, `prop_locationIP`, `prop_browserUsed`, `prop_gender`, `prop_email`, `prop_speaks`, … 12+ columns Parquet. El executor luego accede solo `row[a].get("firstName")`. Con projection pushdown:
```plaintext
Project [a.firstName] (est=3000000)
NodeScan label=Person alias=a projection=[firstName] predicates=[] (est=3000000)
```
`NodeSstReader::scan_with_predicates_and_projection` construye un `ProjectionMask::leaves(schema, &[firstName_leaf])` y Parquet sólo lee las column pages relevantes. **Reducción de bytes leídos: \~10× sobre Person SF1**. La mejora se acumula con el parquet predicate pushdown (ya descartó row-groups; ahora descartamos columnas dentro de los row-groups que sobreviven). ## Design ### 1. IR change
```rust
NodeScan {
label: String,
alias: String,
predicates: Vec,
/// Optional projection: only these property columns are
/// materialised. `None` = include every declared property
/// (back-compat). The rewriter populates this from analysis;
/// lowering emits `None`.
projection: Option>,
}
```
`PartialEq`, `Clone`, `Debug` derive over the new field. All existing constructions of `NodeScan` upgrade to `projection: None`. Two NodeScans with different projection are considered different plans (matters for the `optimize` fixpoint termination check). ### 2. Analysis Walk the plan TOP-DOWN with a `RequiredSet`:
```rust
#[derive(Default, Clone)]
struct RequiredSet {
/// Properties accessed for each alias still in scope.
by_alias: BTreeMap