This is the abridged developer documentation for NamiDB # NamiDB Documentation > Reference documentation for NamiDB — a cloud-native graph database that stores its state in an S3-compatible bucket. Embedded, server, and cloud deployments share one engine and one storage layout. NamiDB is a graph database engine. Its state — manifest, WAL, SSTs, schema — lives as plain objects in an S3-compatible bucket. The engine ships in three shapes (embedded library, HTTP daemon, managed cloud); all three speak the same Cypher and write to the same bucket layout. These docs cover the engine end-to-end. If you’ve never used NamiDB before, the **[30-second quickstart](/en/get-started/quickstart)** is the fastest taste; the **[S3 walkthrough](/en/get-started/your-graph-in-s3)** shows the headline use case. ## Documentation map ### [Get started](/en/get-started/what-is-namidb) What NamiDB is, the three deployment shapes, how to install each client, and the two quickstart paths. * [What is NamiDB](/en/get-started/what-is-namidb) * [30-second quickstart](/en/get-started/quickstart) * [Your graph in S3](/en/get-started/your-graph-in-s3) * [Choose a deployment](/en/get-started/choose-deployment) * [Install](/en/get-started/install) ### [Concepts](/en/concepts/bucket-is-the-database) The mental models behind the engine — why object storage as source of truth works, what a snapshot is, how single-writer-per-namespace is enforced, where the caches live. * [The bucket is the database](/en/concepts/bucket-is-the-database) * [Three deployments, one engine](/en/concepts/three-deployments) * [Namespaces & multi-tenancy](/en/concepts/namespaces) * [Snapshots & epoch fencing](/en/concepts/snapshots-and-epoch-fencing) * [LSM on object storage](/en/concepts/lsm-on-object-storage) * [Caches](/en/concepts/caches) ### [Cypher reference](/en/cypher/supported-subset) The exact GQL (ISO/IEC 39075:2024) + openCypher 9 subset NamiDB parses, plans, and executes today. Includes the 12 in-scope LDBC SNB Interactive queries. * [Supported subset](/en/cypher/supported-subset) * [Read queries](/en/cypher/read-queries) * [Write queries](/en/cypher/write-queries) * [Functions & operators](/en/cypher/functions-and-operators) * [LDBC SNB IC01–IC12](/en/cypher/ldbc-snb) ### [SDK reference](/en/sdk/python) One engine, four surfaces. Same Cypher across all. * [Python](/en/sdk/python) — `pip install namidb` * [Rust (embedded)](/en/sdk/rust) — `cargo add namidb` * [CLI](/en/sdk/cli) — `namidb parse / explain / run` * [HTTP API](/en/sdk/http) — `namidb-server` REST endpoints ### [Operations](/en/operations/configuration) Environment variables, the URI grammar, all six storage backends, the Docker Compose self-host recipe, observability, backups, and tuning knobs. * [Configuration](/en/operations/configuration) · [URI grammar](/en/operations/uri-grammar) * Storage backends: [AWS S3](/en/operations/storage/aws-s3) · [R2](/en/operations/storage/cloudflare-r2) · [GCS](/en/operations/storage/gcs) · [Azure](/en/operations/storage/azure) · [MinIO/Tigris/LocalStack](/en/operations/storage/minio-tigris-localstack) · [Local](/en/operations/storage/local-filesystem) * [Self-host with Docker Compose](/en/operations/self-host-docker-compose) * [Observability](/en/operations/observability) · [Backup & restore](/en/operations/backup-restore) · [Tuning](/en/operations/tuning) ### [Cloud](/en/cloud/what-is-namidb-cloud) Managed multi-tenant SaaS on `namidb.com`, per-namespace scale-to-zero, closed beta. * [What is NamiDB Cloud](/en/cloud/what-is-namidb-cloud) * [Request access](/en/cloud/request-access) * [Cloud vs self-hosted](/en/cloud/cloud-vs-self-hosted) ### [Internals (RFCs)](/en/internals/rfcs) 18 design RFCs covering the storage engine, SST format, query engine, cost-based optimizer, factorization, and caches. The canonical source of “why the engine looks like it does.” * [RFC index](/en/internals/rfcs) * Storage engine: [RFC-001](/en/internals/rfcs/001-storage-engine) · [RFC-002](/en/internals/rfcs/002-sst-format) · [RFC-003](/en/internals/rfcs/003-read-path-ranged-reads) * Query language: [RFC-004](/en/internals/rfcs/004-cypher-subset) · [RFC-008](/en/internals/rfcs/008-logical-plan-ir) · [RFC-009](/en/internals/rfcs/009-write-clauses) * Optimizer: [RFC-010 → 016](/en/internals/rfcs) * Executor & caches: [RFC-017](/en/internals/rfcs/017-factorization) · [RFC-018](/en/internals/rfcs/018-csr-adjacency) · [RFC-019](/en/internals/rfcs/019-node-view-cache-shared) · [RFC-020](/en/internals/rfcs/020-edge-sst-caches) ### [Community](/en/community/contributing) How to engage with NamiDB development, the RFC process, security reporting, license terms. * [Contributing](/en/community/contributing) * [RFC process](/en/community/rfc-process) * [Security](/en/community/security) * [License (BSL 1.1)](/en/community/license) ### [Changelog](/en/changelog) All notable changes across the engine, Python bindings, server, and CLI. ## Common entry points * **I want to write Cypher right now** → [30-second quickstart](/en/get-started/quickstart) * **I want to use my own S3 bucket** → [Your graph in S3](/en/get-started/your-graph-in-s3) * **I want a REST endpoint to call from my app** → [HTTP API](/en/sdk/http) + [Self-host recipe](/en/operations/self-host-docker-compose) * **I want to know what Cypher works** → [Supported subset](/en/cypher/supported-subset) * **I want to read how the storage engine works** → [RFC-001](/en/internals/rfcs/001-storage-engine) → [RFC-002](/en/internals/rfcs/002-sst-format) * **I want to migrate from Kùzu / Neo4j** → [Cypher reference](/en/cypher/supported-subset) and open an issue if anything is missing * **I want the managed product** → [NamiDB Cloud](/en/cloud/what-is-namidb-cloud) ## Project info | Resource | Where | | ------------- | -------------------------------------------------------------------------- | | Engine repo | [github.com/namidb/namidb](https://github.com/namidb/namidb) | | PyPI package | [pypi.org/project/namidb](https://pypi.org/project/namidb) | | Issues & RFCs | [github.com/namidb/namidb/issues](https://github.com/namidb/namidb/issues) | | Security | `security@namidb.com` | | General | `hello@namidb.com` | | Website | [namidb.com](https://namidb.com) | | License | [BSL 1.1 → Apache 2.0 after 3y](/en/community/license) | # Claude Code skill (namidb-guide) > Install the official NamiDB skill so Claude Code auto-fires the right context whenever you write Cypher, open a tg.Client, or call /v0/cypher. The docs repo ships a Claude Code Skill — **`namidb-guide`** — that loads automatically when Claude Code detects a NamiDB-related prompt. It contains: * A tight system-level summary of the engine (the bucket-as-database model, single-writer-per-namespace, six URI schemes, v0.3 breaking `_id` change, common pitfalls) * A Cypher subset reference (every clause, function, operator) * A URI grammar reference (every scheme, every credential mechanism) * An SDK reference (Python sync + async, Rust, CLI, HTTP) * Copy-pasteable Cypher snippets (CRUD, LDBC IC01/IC02 shapes, bulk ingest, EXPLAIN) ## What triggers it The skill’s `description` field is tuned to fire when you mention or write any of: * Cypher keywords (`MATCH`, `CREATE`, `MERGE`, `RETURN`, …) * `tg.Client(...)`, `import namidb`, `client.cypher`, `merge_nodes`, `merge_edges` * `parse_uri`, `WriterSession`, `commit_batch`, `namidb::storage` * NamiDB URI schemes (`memory://`, `file://`, `s3://`, `gs://`, `az://`) — especially with `?ns=` * `namidb-server`, `/v0/cypher`, `NAMIDB_*` env vars * Manifest CAS / epoch fencing errors (`412 Precondition Failed`, `epoch fenced`) * LDBC SNB Interactive queries (IC01–IC12) It also auto-activates inside files matching `**/*.cypher`, `**/*.py`, `**/*.rs`, `**/Cargo.toml`, `**/pyproject.toml`, `**/docker-compose.y*ml`. ## Install — three paths The skill is hosted directly on this site at [`docs.namidb.com/skill/namidb-guide/`](https://docs.namidb.com/skill/namidb-guide/SKILL.md), with a downloadable archive at [`docs.namidb.com/skill/namidb-guide.tar.gz`](https://docs.namidb.com/skill/namidb-guide.tar.gz). Both are rebuilt on every push to the docs site. * One-line install (recommended) Downloads and unpacks the skill into your user-level Claude Code directory so it fires in every Claude Code session: ```bash mkdir -p ~/.claude/skills curl -fsSL https://docs.namidb.com/skill/namidb-guide.tar.gz \ | tar -xzC ~/.claude/skills/ ``` Verify: ```bash ls ~/.claude/skills/namidb-guide/ # SKILL.md references/ examples/ ``` To update later, re-run the same command — `tar -xz` overwrites in place. * Per-file (inspect first) If you want to read the files before installing them, pull each one individually: ```bash BASE=https://docs.namidb.com/skill/namidb-guide DEST=~/.claude/skills/namidb-guide mkdir -p "$DEST"/{references,examples} curl -fsSL "$BASE/SKILL.md" -o "$DEST/SKILL.md" curl -fsSL "$BASE/references/cypher-subset.md" -o "$DEST/references/cypher-subset.md" curl -fsSL "$BASE/references/uris.md" -o "$DEST/references/uris.md" curl -fsSL "$BASE/references/sdks.md" -o "$DEST/references/sdks.md" curl -fsSL "$BASE/examples/queries.md" -o "$DEST/examples/queries.md" ``` You can also browse the files in your browser at the same URLs before installing. * Plugin (when available) Once published to a Claude Code plugin marketplace, install via: ```bash /plugin install namidb-guide ``` The `.claude-plugin/plugin.json` ships with the skill bundle. Marketplace publication is on the roadmap; for now prefer the one-line install above. ## Verify it’s loaded In a Claude Code session, run: ```plaintext /skills ``` You should see `namidb-guide` listed. Or test the auto-fire by typing: ```plaintext write a Cypher query that returns the 10 oldest Person nodes from my s3 namespace ``` Claude should respond with a v0.3-correct query using `_id` (not `id`) and reference `tg.Client("s3://...?ns=...")`. ## What it looks like inside ```plaintext .claude/skills/namidb-guide/ ├── SKILL.md # entry point (frontmatter + body) ├── references/ │ ├── cypher-subset.md # every clause / function / operator │ ├── uris.md # URI grammar, credential matrix │ └── sdks.md # Python / Rust / CLI / HTTP └── examples/ └── queries.md # copy-pasteable snippets ``` The `SKILL.md` body is kept under 500 lines (it counts toward token cost for the entire session once loaded). Heavier reference content sits in the supporting files — Claude pulls them in on demand. ## Frontmatter — what we set and why ```yaml --- name: namidb-guide description: | Use when the user is working with NamiDB, the cloud-native graph database whose state lives in an S3-compatible bucket. Triggers on Cypher queries, tg.Client(...) calls, /v0/cypher endpoint calls, NamiDB URI schemes, and NAMIDB_* env vars. when_to_use: | Also fire when troubleshooting manifest CAS errors, tuning the cost-based optimizer or caches, or migrating from Kùzu / Neo4j. paths: - "**/*.cypher" - "**/*.cql" - "**/*.py" - "**/*.rs" - "**/Cargo.toml" - "**/pyproject.toml" - "**/docker-compose.y*ml" --- ``` | Field | Why | | ------------- | ---------------------------------------------------------------------- | | `name` | Stable identifier; `kebab-case`, ≤ 64 chars | | `description` | What Claude reads to decide whether to fire — **specificity matters** | | `when_to_use` | Secondary triggers appended to `description` (1,536-char combined cap) | | `paths` | File globs that auto-activate the skill in those contexts | We deliberately do **not** set `disable-model-invocation` or `user-invocable: false` — we want both auto-fire and manual `/skills` invocation. ## Feedback Spot something wrong in `SKILL.md` or the references? The skill is the **engine team’s recommended LLM context for NamiDB** — keeping it sharp matters. Email **`hello@namidb.com`** with the diff or the issue. Engine-specific issues (parser, optimizer, storage) belong on the public engine repo at [github.com/namidb/namidb](https://github.com/namidb/namidb). ## See also * [Overview](/en/agents/overview) * [llms.txt](/en/agents/llms-txt) — for non-Claude-Code agents * [Cursor, Codex, AGENTS.md](/en/agents/cursor-codex-rules) — for other AI front-ends * Anthropic’s [Claude Code Skills spec](https://code.claude.com/docs/en/skills) — authoritative reference # Cursor, Codex, AGENTS.md > Drop-in rule files for Cursor, OpenAI Codex, GitHub Copilot, and any agent that reads an AGENTS.md / CLAUDE.md context file. If your AI front-end is not Claude Code, you can still get high-quality NamiDB context with a small rule file in your project root. The patterns below are all opinionated, NamiDB-tuned templates — drop them in your repo and they take effect immediately. ## Universal template: `AGENTS.md` The emerging [**AGENTS.md** convention](https://agentsmd.org/) (adopted by Codex, Cursor, Continue, Aider, and others) is the most portable starting point. Drop this at your repo root: AGENTS.md ```md This project uses **NamiDB** — a cloud-native graph database with Cypher / GQL on top of an S3-compatible bucket. ## Read this first Pull the canonical engine context once per session: https://docs.namidb.com/llms-full.txt ## NamiDB ground truth (do not deviate) - The bucket is the database. State lives as plain S3 objects. No Raft. No etcd. No DynamoDB lock table. - `_id` is the internal NodeId since v0.3. `id` is a plain user property. Older training data mixes these — always flag. - Six URI schemes: memory://, file://, s3://, gs://, az://. The `?ns=` parameter is required everywhere except memory://. - Cypher writes commit on return (WAL + manifest CAS). Call flush() periodically to push memtable → L0. - Single writer per namespace. Two writers racing → loser gets 412 PreconditionFailed. That's the fencing protocol. - Bulk ingest: use `merge_nodes`/`merge_edges` (Python) or the Rust `upsert_node`/`upsert_edge` API, not per-row Cypher CREATE. ## Do NOT generate - Unsupported Cypher: CALL {} subqueries, LOAD CSV, CREATE INDEX/CONSTRAINT, unbounded *paths (use *1..N). - Suggestions for external lock services (DynamoDB, etcd, ZooKeeper). - Code using `id(n)` semantics from v0.2 — always use `_id` or the function form `id(n)` for the internal NodeId. ## Project conventions [Add your team's specifics here: namespace naming, IAM template, preferred storage backend, deploy region, etc.] ``` ## Cursor: `.cursorrules` Cursor reads `.cursorrules` at the repo root. Same content as `AGENTS.md`, just renamed: ```bash cp AGENTS.md .cursorrules ``` Or keep them in sync with a symlink: ```bash ln -s AGENTS.md .cursorrules ``` ## GitHub Copilot: `.github/copilot-instructions.md` Copilot reads a markdown file from your repo. Drop the same content at `.github/copilot-instructions.md` — it’ll be inlined into every completion request. ## Codex CLI: `~/.codex/AGENTS.md` or per-repo `AGENTS.md` Codex CLI uses `AGENTS.md` directly at the repo root (or `~/.codex/AGENTS.md` for user-global). The universal template above is plug-and-play. ## Continue / Aider / Cline All three support a markdown rules file at the repo root. Use `AGENTS.md` or their tool-specific filename: | Tool | File | | -------- | ------------------------------------------ | | Continue | `.continue/context.md` or `AGENTS.md` | | Aider | `.aider.conf.yml` (with `read: AGENTS.md`) | | Cline | `.clinerules` | ## Pulling fresh context each session For ephemeral environments (CI, code reviewers, batch jobs), pull fresh context at the top of every run instead of pinning a static snapshot: ```bash # Bash curl -s https://docs.namidb.com/llms-full.txt > /tmp/namidb-context.md ``` ```python # Python import urllib.request context = urllib.request.urlopen( "https://docs.namidb.com/llms-full.txt" ).read().decode() ``` Then prepend `context` to the agent’s system prompt. ## Pattern: prompt-caching the context Both Anthropic and OpenAI support prompt caching. Cache the `llms-full.txt` block once and pay only the delta on subsequent calls in the same session. * Anthropic ```python client.messages.create( model="claude-opus-4-7", max_tokens=2048, system=[ { "type": "text", "text": namidb_context, "cache_control": {"type": "ephemeral"}, } ], messages=[...], ) ``` See [Anthropic prompt caching docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching). * OpenAI ```python client.responses.create( model="gpt-5", instructions=namidb_context, # auto-cached on repeated calls input=user_question, ) ``` OpenAI’s Responses API auto-caches large `instructions` blocks. ## Validating that the rules took effect Quick sanity check across any tool: > Write a Cypher query that creates 5 Person nodes with random UUIDs as internal NodeId, then a KNOWS edge between consecutive pairs. A correctly-rule’d agent will: * Use `{_id: ...}` (NOT `{id: ...}`) * Use `MERGE` or `CREATE` per the v0.3 grammar * Use `tg.Client("s3://...?ns=...")` (if Python) or `parse_uri(...)` (if Rust) * Not invent unsupported clauses like `CREATE INDEX` or `CALL { ... }` If any of those fail, the rules aren’t being read by the agent — check the file name, location, and that your tool’s “rules” feature is enabled. ## See also * [Overview](/en/agents/overview) * [llms.txt](/en/agents/llms-txt) — the corpus those rules reference * [Claude Code skill](/en/agents/claude-code-skill) — the recommended path for Claude Code users # llms.txt and llms-full.txt > Machine-readable digests of the entire docs site, following the llmstxt.org standard. One URL, ingest everything. The docs site generates three machine-readable digests at build time via the [`starlight-llms-txt`](https://delucis.github.io/starlight-llms-txt/) plugin. All three follow the [llmstxt.org](https://llmstxt.org/) proposal. ## The three URLs | URL | Format | Use for | | ----------------------------------------------------------- | -------------------------------------- | ------------------------------------------ | | [`/llms.txt`](https://docs.namidb.com/llms.txt) | Manifest of pages + short descriptions | Agent picks which pages to fetch on demand | | [`/llms-full.txt`](https://docs.namidb.com/llms-full.txt) | All EN docs concatenated as markdown | One-shot ingestion of the whole corpus | | [`/llms-small.txt`](https://docs.namidb.com/llms-small.txt) | Filtered subset (core concepts only) | Small-context models (≤ 32k) | ## When to use which * One-shot context Pull `llms-full.txt` into the model’s prompt or system prompt at the start of the session. Works for Opus, GPT-5, Gemini 2.5 — any model with ≥ 128k context. ```bash curl -s https://docs.namidb.com/llms-full.txt > namidb-context.md # Then feed namidb-context.md into your prompt / system message ``` * Agent / RAG Pull `llms.txt` to get the page manifest, then have the agent fetch individual `.md` URLs on demand (every page also serves raw markdown at the same URL with `.md` appended). ```bash curl -s https://docs.namidb.com/llms.txt # Inspect the page list, then for any page e.g.: curl -s https://docs.namidb.com/en/cypher/write-queries.md ``` * Small context Use `llms-small.txt` when the model has ≤ 32k context. It keeps the conceptual core (the bucket, snapshots, namespaces, Cypher subset, SDKs) and drops the deep RFCs and operational reference. ```bash curl -s https://docs.namidb.com/llms-small.txt | wc -w # roughly ~12k words ``` ## What’s in `llms-full.txt` The English locale, organized by section: * **Get started** (5 pages) * **Concepts** (6 pages) * **Cypher reference** (5 pages) * **SDK reference** (4 pages: Python, Rust, CLI, HTTP) * **Operations** (12 pages: config, URI grammar, 6 storage backends, self-host, observability, backup, tuning) * **Cloud** (3 pages) * **Internals (RFCs)** (17 RFCs) * **Community** (4 pages) * **Changelog** The Spanish locale is **excluded** from the LLM digest by design — having both locales in one corpus just dilutes the signal for agents that don’t speak Spanish anyway. The Spanish HTML site stays fully browsable. ## Provider-specific ingestion notes ### Claude Code If you’re using Claude Code in a repo that touches NamiDB, **install the [Claude Code skill](/en/agents/claude-code-skill) instead** — it’s a tighter, in-context summary tuned for IDE workflows. Use `llms-full.txt` only when the skill isn’t installed or you’re working with a different LLM front-end. ### Anthropic API directly ```python import anthropic, urllib.request namidb_context = urllib.request.urlopen( "https://docs.namidb.com/llms-full.txt" ).read().decode() client = anthropic.Anthropic() resp = client.messages.create( model="claude-opus-4-7", max_tokens=2048, system=[ { "type": "text", "text": namidb_context, "cache_control": {"type": "ephemeral"}, } ], messages=[{"role": "user", "content": "Write a Cypher query that …"}], ) ``` The `cache_control` block enables prompt caching so subsequent calls in the same session don’t re-pay the ingestion cost. ### Cursor / Cody / Continue Most editor-embedded agents support a “rules” or “context” file. See [Cursor, Codex, AGENTS.md](/en/agents/cursor-codex-rules) for drop-in templates that reference `llms-full.txt`. ### Codex (OpenAI) ```python from openai import OpenAI import urllib.request namidb_context = urllib.request.urlopen( "https://docs.namidb.com/llms-full.txt" ).read().decode() client = OpenAI() resp = client.responses.create( model="gpt-5", instructions=f"You have full NamiDB documentation in context:\n\n{namidb_context}", input="Write a Cypher query that …", ) ``` ## How it’s built astro.config.mjs ```js import starlightLlmsTxt from 'starlight-llms-txt'; starlight({ plugins: [ starlightLlmsTxt({ projectName: 'NamiDB', description: 'Cloud-native graph database…', exclude: ['es/**'], customSets: [ { label: 'Cypher reference', paths: ['en/cypher/**'] }, // … ], }), ], }) ``` Re-runs on every `pnpm build` and on every Vercel push. Always matches the live site. ## See also * [Overview](/en/agents/overview) — why we ship all this * [Claude Code skill](/en/agents/claude-code-skill) — the recommended path for Claude Code users * [llmstxt.org](https://llmstxt.org/) — the canonical spec # NamiDB for AI agents > How Claude Code, Codex, Cursor, and any LLM-powered tool can ingest the NamiDB docs and the official Claude Code skill that ships with this site. These docs are designed to be **machine-readable as a first-class output**, not just human-readable. There are three artifacts to know about: [llms.txt + llms-full.txt ](/en/agents/llms-txt)One-URL ingestion of the entire docs corpus, following the llmstxt.org standard. [Claude Code skill ](/en/agents/claude-code-skill)A namidb-guide Skill that auto-fires whenever you write Cypher or open a NamiDB client. [Cursor / Codex / AGENTS.md ](/en/agents/cursor-codex-rules)Copy-paste rule files for the most common code-editor AI integrations. ## Why this matters AI coding assistants are how most developers will *actually touch* NamiDB for the first time — through Claude Code, Cursor’s tab completion, Copilot, Codex, or any of the embedded-LLM IDEs. If those tools don’t know about NamiDB, they’ll generate plausible- looking Cypher that doesn’t run, mix up `_id` (engine NodeId) with `id` (user property), or invent URI schemes we don’t support. We treat that as a **docs failure**, not a model failure. So the docs site exposes: | Artifact | Public URL | For | | ------------------------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | | `llms.txt` | [`docs.namidb.com/llms.txt`](https://docs.namidb.com/llms.txt) | Manifest of every page, with descriptions. Lightweight. | | `llms-full.txt` | [`docs.namidb.com/llms-full.txt`](https://docs.namidb.com/llms-full.txt) | Entire EN docs corpus concatenated as one markdown file. | | `llms-small.txt` | [`docs.namidb.com/llms-small.txt`](https://docs.namidb.com/llms-small.txt) | Filtered subset for smaller context windows. | | **Claude Code skill (tarball)** | [`docs.namidb.com/skill/namidb-guide.tar.gz`](https://docs.namidb.com/skill/namidb-guide.tar.gz) | One-line install via `curl ... \| tar -xz`. | | **Claude Code skill (browse)** | [`docs.namidb.com/skill/namidb-guide/SKILL.md`](https://docs.namidb.com/skill/namidb-guide/SKILL.md) | Inspect or cherry-pick individual reference files. | | Raw markdown | Each docs page also serves Markdown at the same URL with `.md` appended | Single-page agent ingestion. | ## What an agent should know about NamiDB When you tell an agent to work with NamiDB, the bare minimum it needs to internalize is: 1. **The bucket is the database.** State lives as plain objects in an S3-compatible store. No control plane, no Raft. 2. **`_id` is the engine’s internal NodeId** since v0.3. `id` is now a plain user property. **Most old training data still mixes these up.** Always flag. 3. **Six URI schemes**: `memory://`, `file://`, `s3://`, `gs://`, `az://`. The `?ns=` query parameter is required for everything except `memory://`. 4. **Cypher writes commit on return.** No explicit transaction begin / commit. Call `flush()` periodically to push memtable → L0 SSTs. 5. **Single writer per namespace.** Two writers racing → loser gets `412 Precondition Failed`. This is correctness, not a bug. The Claude Code skill encodes all of this so you don’t have to re-explain it every session. ## Per-deployment quick orientation for agents Python Surface to look for: `import namidb as tg`, `tg.Client(uri)`, `client.cypher(...)`, `client.acypher(...)`, `client.merge_nodes(...)`, `client.flush()`. Prefer `merge_nodes/edges` over per-row `CREATE` for ingestion. Rust Surface: `use namidb::storage::{parse_uri, WriterSession}`, `WriterSession::open(...)`, `commit_batch()`, `flush()`. `namidb-query::{parse, lower, execute}` for Cypher paths. HTTP `POST /v0/cypher` with bearer auth + JSON body `{"query": "...", "params": {...}}`. `GET /v0/health`, `POST /v0/admin/flush`. CLI `namidb run --store "..."`, `namidb explain --verbose "..."`, `namidb parse "..."`. ## What we don’t expect agents to do * **Synthesize unsupported Cypher.** `CALL { ... }` subqueries, user-defined procedures, `LOAD CSV`, `CREATE INDEX`/`CONSTRAINT`, unbounded `*` paths are out of scope. If the agent generates them, it’s drift from older Neo4j training data. * **Recommend Raft / external lock services.** NamiDB is coordination-free via manifest CAS. Don’t suggest DynamoDB lock tables, ZooKeeper, etcd. * **Suggest mock-database tests for storage logic.** The engine has `memory://` and `file://` for ephemeral / local testing — those share the same CAS protocol as `s3://`. ## See also * [llms.txt and friends](/en/agents/llms-txt) — every URL, every format * [Install the Claude Code skill](/en/agents/claude-code-skill) — three install paths * [Cursor, Codex, AGENTS.md](/en/agents/cursor-codex-rules) — drop-in rule files # Changelog > All notable changes across the NamiDB engine, Python bindings, server, and CLI. This page mirrors [`CHANGELOG.md`](https://github.com/namidb/namidb/blob/main/CHANGELOG.md) in the engine repo. The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); versioning loosely follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). Pre-1.0 While the engine is pre-1.0, **breaking changes can land in minor versions**. They are always called out in the **Breaking** section of the release. ## v0.3.0 — 2026-05-18 · Cypher v0.2.1 limitation sweep Closes the six query-engine limitations documented in the v0.2.1 README (`MATCH (n)` rejected, MERGE with relationship broken, `id` reserved, etc.). One of them — the `id` reservation — is breaking. ### Fixed * `lower::combine` now emits `CrossProduct` between two non-Empty plans instead of dropping the earlier one, so `MATCH (a:A) MATCH (b:B) CREATE (a)-[:R]->(b)` propagates both bindings to `CREATE`. * `find_merge_matches` indexes `Vec` by alias so `MERGE (a)-[r:R]->(b)` works against the CREATE-shaped pattern. * `execute_expand` accepts `edge_type=None`; `MATCH (a)-[r]->(b)` and `-[*1..N]->` work without an explicit relationship type. * `LogicalPlan::NodeScan.label` is now `Option`; `MATCH (n)` without a label fans out across every observed label. ### Breaking * **`id` is now a user property; the internal NodeId moves to `_id`.** Previously `id` hijacked Cypher map literals as the internal NodeId sigil. After this release, `id` is a plain user property; the internal NodeId is `_id`. The Cypher `id(n)` function keeps returning the internal NodeId. **Migration.** Rename `{id: $uuid}` → `{_id: $uuid}` anywhere the intent is the internal NodeId. Use `n._id` (accessor) or `id(n)` (function) to read it. `n.id` now reads the user property (or `Null`). ## v0.2.1 — 2026-05-18 · CI fix Tag `py-v0.2.0` built every wheel but the publish step was skipped. v0.2.1 ships the same code with the test expectations brought up to date. No engine changes. ## v0.2.0 — 2026-05-18 · self-host story ### Added * **`file://` storage backend** with full manifest CAS via per-namespace `flock` + atomic `rename(2)`. * **`gs://` storage backend** for Google Cloud Storage. * **`az://` storage backend** for Azure Blob Storage. * **`namidb-server` crate and binary** — Rust HTTP daemon exposing a REST API. Endpoints: `POST /v0/cypher`, `GET /v0/health`, `GET /v0/version`, `POST /v0/admin/flush`. Bearer-token auth, periodic flush, Dockerfile, full JSON ↔ Cypher type mapping. * **`docker-compose.yml`** at the repo root — MinIO + bucket-init + `namidb-server`. * **Shared URI parser** used by the Python client, the CLI, and the server. * **Architecture and deployment diagrams** with dark-mode variants. ### Changed * **CLI `namidb run` learns `--store `** for durable runs against any backend. Defaults to `memory://default` when omitted. * **Python `tg.Client(uri)`** delegates URI parsing to the shared Rust implementation. ### Fixed * Plan-explain indent expectation aligned with the tree-renderer. ## v0.1.0 — initial public release First public release under [Business Source License 1.1](/en/community/license) (Change Date: 2029-05-18, Change License: Apache License 2.0). ### Engine * Cypher / GQL parser covering a strict subset of GQL (ISO/IEC 39075:2024) + openCypher 9. End-to-end execution of LDBC SNB Interactive Complex Read queries IC01–IC12. * Writes via Cypher: `CREATE`, `MERGE`, `SET`, `DELETE`, `DETACH DELETE`, `REMOVE`. Durable on `commit_batch`. * Cost-based optimizer with predicate pushdown, projection pushdown, join reorder, hash-join conversion, hash semi-join, and Parquet row-group pruning. * Morsel-driven vectorized executor with optional factorized intermediate representation (RFC-017). ### Storage * Columnar storage on object storage: Parquet node SSTs, custom edge-SST format with CSR adjacency (RFC-002), zstd compression, bloom filters, fence-pointer indices. * Coordination-free correctness via manifest CAS. * Tiered caches (`AdjacencyCache`, `NodeViewCache`, `SstCache`). ### Clients * Python bindings (`pip install namidb`), abi3 wheels. * CLI: `namidb parse`, `namidb explain --verbose`, `namidb run`. ### Project * Workspace of 8 crates. * 18 design RFCs in [`docs/rfc/`](/en/internals/rfcs). * LDBC-shaped synthetic benchmark harness with a paired Kùzu runner. # Cloud vs self-hosted > When to pick managed NamiDB Cloud and when to self-host the engine. NamiDB is the **same engine** whether you run it yourself or consume the managed Cloud. The decision is operational, not technical. ## Pick **Cloud** when… * You want **zero operational work** — no daemons, no buckets, no flush schedules, no Docker images to chase. * You have **many tenants** that need isolated graphs. * You want **per-namespace scale-to-zero** so you only pay when active. * You’d rather **call when something breaks** than page yourself. ## Pick **self-hosted** when… * You have a **strong data-residency requirement** — your bucket has to live in your account, in a specific region. * You’re **already running heavy infrastructure** and prefer one more service over one more vendor. * You want **deep control** over caches, flush cadence, compaction thresholds, and observability tooling. * Your workload is **single-tenant** with predictable cost — the self-host overhead is small and Cloud’s per-tenant pricing isn’t a fit. * You’re **air-gapped** or have a regulator that doesn’t accept SaaS. ## You can switch later The bucket layout is identical. To migrate **self-hosted → Cloud**, we provide a one-time `aws s3 sync` recipe + DNS/URI cutover. To migrate **Cloud → self-hosted**, we provide the same in reverse. **No lock-in.** ## Hybrid is fine too A common pattern: * **Cloud** for the production multi-tenant SaaS that needs isolated graphs per customer. * **Self-hosted** `namidb-server` for the internal data team’s ad-hoc analytics, pointing at the same bucket layout in a read-only mirror. Same engine. Same Cypher. Same client SDKs. ## See also * [What is NamiDB Cloud](/en/cloud/what-is-namidb-cloud) · [Request access](/en/cloud/request-access) * [Choose a deployment](/en/get-started/choose-deployment) # Request access > How to join the NamiDB Cloud closed beta. NamiDB Cloud is in **closed beta**. Onboarding is rolling out incrementally to keep the early-access experience hands-on. ## Apply The fastest path: 1. Sign up on [namidb.com](https://namidb.com). 2. Mention your use case in one line — agent memory, knowledge graph, social graph, fraud, recommendations, etc. 3. Approximate your scale — nodes / edges / write QPS. Or email **** directly with the same information. ## What happens next 1. We review applications weekly. 2. Approved accounts get a provisioning email with API keys and a pinned region. 3. You can be **writing Cypher within minutes** — same Python SDK, pointing at the Cloud-issued URI. ## What we’re optimising the beta for * **Agent memory** workloads (knowledge graph + vector hybrid). * Teams replacing **Neo4j or Kùzu** in production. * **Multi-tenant SaaS** that needs one graph per customer. If you’re elsewhere on the map, we still want to hear from you — the beta cohort is mixed by design. ## Reach out * Web: [namidb.com](https://namidb.com) * Email: `hello@namidb.com` * GitHub: [github.com/namidb/namidb](https://github.com/namidb/namidb) ## See also * [What is NamiDB Cloud](/en/cloud/what-is-namidb-cloud) * [Cloud vs self-hosted](/en/cloud/cloud-vs-self-hosted) # What is NamiDB Cloud > Managed multi-tenant SaaS on namidb.com — per-namespace scale-to-zero, encrypted-at-rest tenants, hosted control plane. **NamiDB Cloud** is the hosted, multi-tenant SaaS form of NamiDB, operated on `namidb.com`. Same engine as the open-source library. **What you get on top** is the operational story: zero-ops onboarding, per-namespace scale-to-zero, encrypted-at-rest tenants, regional availability, observability, and on-call humans. ## What it gives you Scale-to-zero per namespace Compute is allocated when a namespace is actively queried and released when it isn’t. You pay for what you use, not for a reserved cluster. Multi-tenant by default First-class tenancy. One namespace per tenant, isolated at the storage + compute + auth layers. Managed bucket The bucket and its lifecycle policies, regional placement, and encryption-at-rest keys are managed by the NamiDB team. Zero-ops No daemons to bump, no flush schedules to tune, no Docker images to chase. The team that wrote the engine runs it. ## What’s the same as self-hosted * **Same Cypher**, byte-for-byte. Queries you tested locally work as-is. * **Same SDK surface** — Python, Rust, HTTP, CLI. * **Same bucket layout** — you can `aws s3 sync` to take your data home. No lock-in. ## What’s different | Aspect | Self-hosted | NamiDB Cloud | | ------- | ------------------------ | --------------------------------- | | Auth | Bearer token | API keys + IAM | | Tenants | You provision namespaces | First-class, billed per namespace | | Compute | You run `namidb-server` | NamiDB team runs it | | Storage | Your bucket | Managed bucket (data exportable) | | Region | Wherever your bucket is | Multi-region offerings | | On-call | You | NamiDB team | ## How billing works (preview) NamiDB Cloud’s pricing is **per active namespace** with a free tier for development. The metering dimensions: * **Active namespace-hours** (compute scale-to-zero metric) * **Storage GB-months** * **Query CPU-seconds** (above the included plan amount) * **Egress GB** (where applicable per region) Closed-beta participants get full visibility into projected billing before public pricing lands. ## Request access [Sign up here](https://namidb.com) or email `hello@namidb.com`. ## See also * [Cloud vs self-hosted](/en/cloud/cloud-vs-self-hosted) * [Request access](/en/cloud/request-access) # Contributing > How to engage with NamiDB development. Workflow, coding standards, the RFC process. We develop in the open. This page is the public mirror of [`CONTRIBUTING.md`](https://github.com/namidb/namidb/blob/main/CONTRIBUTING.md) in the engine repo. ## TL;DR 1. **Read the RFCs** — [`docs/rfc/`](/en/internals/rfcs). They are the canonical source of design decisions on the storage engine, query engine, and surrounding subsystems. 2. **Open issues to discuss** before sending large PRs. 3. **Small PRs are welcome any time** — typo fixes, docs improvements, perf tweaks, test additions. ## Workflow * `main` is the development branch. Releases are tagged. * Every PR runs `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, and `cargo test --workspace`. * Commits should be signed (`git commit -S`). * PR titles follow Conventional Commits: `feat:`, `fix:`, `docs:`, `refactor:`, `test:`, `bench:`, `chore:`. ## Coding standards * **Rust edition 2021**, MSRV 1.85 (kept current). * `unsafe` only in hot paths with documented invariants and `// SAFETY:` comments. * All public APIs documented (`cargo doc --workspace --no-deps` must succeed). * Errors via `thiserror`; avoid `anyhow` in library crates (OK in binaries / tests). * Tracing instrumentation on `pub` async functions (`#[tracing::instrument]`). * Tests live next to code (`#[cfg(test)] mod tests`) for unit; integration tests in the crate’s `tests/` directory. ## Testing * Property tests with `proptest` for invariants. * Loom for concurrency-critical code paths where appropriate. * Local integration with an S3-compatible endpoint via `docker compose -f tests/docker-compose.s3.yml up` (LocalStack). * Benchmarks with `criterion`; results expected to be reproducible. ## Communication * Email: `hello@namidb.com` * Security: `security@namidb.com` * GitHub Issues: [github.com/namidb/namidb/issues](https://github.com/namidb/namidb/issues) ## See also * [RFC process](/en/community/rfc-process) * [Security](/en/community/security) * [License](/en/community/license) # License > BSL 1.1, auto-converting to Apache 2.0 after three years. Plus a commercial license for embedded redistribution. NamiDB is licensed under the **[Business Source License 1.1 (BSL)](https://github.com/namidb/namidb/blob/main/LICENSE)**. ## What this means in practice * **Free** for development, testing, internal production use, and any use that does not compete with a hosted NamiDB offering from the Licensor (LESAI, Corp.). * **Automatically converts to Apache License 2.0** three years after each release — the Change Date is per-release. * A **separate commercial license** is available for teams that need to embed or redistribute NamiDB outside the bounds of BSL, **including offering it as a hosted database service**. ## When to email about a commercial license * You are building a **competing managed-NamiDB SaaS**. * You are **redistributing NamiDB embedded inside a closed-source product** at scale. * You need **OSI-approved-license-only** procurement and want Apache 2.0 applied earlier than the BSL Change Date. For any of the above: **`info@namidb.com`**. ## Why BSL The same reason [CockroachDB](https://www.cockroachlabs.com/blog/oss-relicensing-cockroachdb/), [MariaDB MaxScale](https://mariadb.com/bsl11/), and [Sentry](https://blog.sentry.io/relicensing-sentry/) chose it: keep the engine open and forkable, but protect the cloud business model that funds the open development. The engine is open. **The cloud is the business.** ## See also * [LICENSE in the engine repo](https://github.com/namidb/namidb/blob/main/LICENSE) * [Contributing](/en/community/contributing) # RFC process > How design decisions land in NamiDB. For anything bigger than a bug fix or a few-line refactor, we ask for an **RFC** (Request For Comments). ## How to write one 1. Copy [`docs/rfc/_template.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/_template.md) to `docs/rfc/NNN-short-name.md` (next free `NNN`). 2. Open a PR with **only the RFC**, in **Draft** state. 3. Solicit feedback. Iterate. 4. Maintainers mark `accepted` or `rejected`. Implementation PRs reference the RFC number in their description. ## Why we run this way * **The RFC is the design.** The code is the implementation of the design. When the two diverge, we update the RFC. * **The RFC is the changelog of decisions.** Future maintainers can read the RFCs and understand why the engine looks like it does. * **Reviewers can argue with text** before anyone has written a thousand lines they’re emotionally invested in. ## What lives in `docs/rfc/` Every accepted RFC. See [the index](/en/internals/rfcs) for the full list (we’re at RFC-020 today, with RFC-021 in flight for the read fan-out work). ## See also * [Internals (RFCs)](/en/internals/rfcs) * [Contributing](/en/community/contributing) # Security > How to report a vulnerability and our disclosure policy. ## Reporting a vulnerability **Email `security@namidb.com`** with details. Please do **not** open a public GitHub issue for security reports. ## What we ask for * A clear description of the vulnerability. * Steps to reproduce. * Affected versions (engine, Python, server, Docker image). * Suggested mitigation if you have one. ## What we’ll do 1. **Acknowledge within 72 hours.** 2. Triage and confirm. 3. Develop a fix on a private branch. 4. Coordinate a release window with you. 5. Publish a security advisory + patched releases. 6. Credit you in the advisory (unless you’d rather stay anonymous). ## Scope In-scope: * The NamiDB engine (all crates in [`github.com/namidb/namidb`](https://github.com/namidb/namidb)) * The Python bindings * `namidb-server` and its Docker image * The CLI Out-of-scope (please report to the relevant vendor): * Third-party dependencies (`tokio`, `object_store`, etc.) — report upstream first; we’ll bump versions on disclosure. * NamiDB Cloud (`namidb.com`) — covered under our Cloud bug bounty; contact `security@namidb.com` for scope. ## Disclosure NamiDB follows **coordinated disclosure**. We aim to ship a patched release before public disclosure, then publish a CVE + advisory once patched versions are available. ## See also * Full `SECURITY.md` in the engine repo: [github.com/namidb/namidb/blob/main/SECURITY.md](https://github.com/namidb/namidb/blob/main/SECURITY.md) # The bucket is the database > Why NamiDB stores all engine state — manifest, WAL, SSTs — as plain objects in an S3-compatible bucket, and what that buys you. NamiDB has **no external control plane**. No Raft cluster. No ZooKeeper. No DynamoDB lock table. No etcd. The bucket is the database — every byte of engine state is a plain object in the S3-compatible store you opened with `tg.Client("s3://...")`. ## What lives in the bucket ```plaintext s3://my-bucket/data/{namespace}/ ├── manifest.json # CAS root: epoch, current SST list, LSN watermark ├── wal/ # Write-ahead log segments │ ├── 0000-0042.wal │ └── 0043-current.wal ├── sst/ # Sorted-string tables │ ├── node/L0/... # Parquet node SSTs │ ├── node/L1/... │ ├── edge/L0/... # Custom edge SSTs with CSR adjacency │ └── edge/L1/... └── schema/ # Label & property schemas └── current.json ``` Three categories: 1. **The manifest** — a single, tiny JSON object that names every SST currently live for the namespace, plus the epoch, plus the LSN watermark. **All writes coordinate through manifest CAS.** 2. **The WAL** — append-only segments. Every write is durable as soon as a `commit_batch` call returns. 3. **SSTs** — immutable columnar files. Nodes go to Parquet; edges go to a custom CSR-aware format ([RFC-002](/en/internals/rfcs/002-sst-format)). ## What replaces the consensus tier **S3 conditional writes.** Since 2024, S3 honours `If-Match` / `If-None-Match` headers on `PutObject`. NamiDB writes a new manifest with `If-Match: `; the first writer wins, the rest get a `412 Precondition Failed` and retry. That single primitive replaces: | Without conditional writes | With conditional writes | | ------------------------------------------- | --------------------------------- | | External lock service (DynamoDB, ZooKeeper) | Manifest CAS on the object itself | | Raft / Paxos quorum for the manifest | Conditional `PutObject` | | A separate metadata DB | A `manifest.json` per namespace | ## What this buys you * **Durability** is whatever S3 already gives you. 99.999999999%, multi-AZ. * **Backups** are `aws s3 sync`. There is no separate metadata to capture. * **Restore** is `aws s3 sync` in the other direction. * **Cost scales to zero** when no client opens the namespace. No compute is running. No DynamoDB capacity is reserved. * **Tenants are folders.** Each `?ns=...` is a sub-tree in the bucket. * **Two processes** can open the same namespace. The one that wins the manifest CAS at commit time gets to write; the other fences cleanly (epoch increment) and re-reads. ## What you give up * **Write throughput per namespace** is bounded by one writer at a time. This is a feature for correctness but a ceiling for raw write rate. Sharding by namespace is the answer when you need more. * **Read latency** is bounded below by the S3 GET latency for the hot path. Cross-snapshot caches ([RFC-018](/en/internals/rfcs/018-csr-adjacency), [RFC-019](/en/internals/rfcs/019-node-view-cache-shared), [RFC-020](/en/internals/rfcs/020-edge-sst-caches)) hide most of it for repeated queries. * **Strong cross-namespace transactions** are out of scope. Each namespace is an isolated unit. ## See also * [Snapshots & epoch fencing](/en/concepts/snapshots-and-epoch-fencing) * [LSM on object storage](/en/concepts/lsm-on-object-storage) * [RFC-001 — Storage engine](/en/internals/rfcs/001-storage-engine) # Caches > NamiDB's tiered, cross-snapshot caches: AdjacencyCache (CSR), NodeViewCache, and SstCache. NamiDB ships three process-wide caches that **share data across snapshots**. All are `Arc`-shared and byte-budgeted. | Cache | What it holds | RFC | Env var | | ------------------ | ----------------------------------------------------------------- | -------------------------------------------------------- | ------------------- | | **AdjacencyCache** | CSR adjacency arrays per edge SST | [RFC-018](/en/internals/rfcs/018-csr-adjacency) | `NAMIDB_ADJACENCY` | | **NodeViewCache** | Decoded `NodeView` (id → label, props) lookups | [RFC-019](/en/internals/rfcs/019-node-view-cache-shared) | `NAMIDB_NODE_CACHE` | | **SstCache** | Decoded SST body + edge property streams + parsed `EdgeSstReader` | [RFC-020](/en/internals/rfcs/020-edge-sst-caches) | `NAMIDB_SST_CACHE` | All three default to **ON**. Set the env var to `0` or `off` to disable. ## Why cross-snapshot A long-running write workload produces a new manifest version on every commit. Without cross-snapshot caches, every snapshot transition would invalidate the entire cache and re-pay the decode cost. NamiDB’s caches are keyed by the **immutable SST id**, not by the snapshot. As long as an SST is referenced by any live snapshot, its decoded artifacts stay in cache. ## Sizing The default budgets are tuned for a “comfortable laptop”: \~512 MiB per cache. For server workloads, bump them: ```bash export NAMIDB_ADJACENCY_BUDGET_MB=4096 export NAMIDB_NODE_CACHE_BUDGET_MB=2048 export NAMIDB_SST_CACHE_BUDGET_MB=8192 ``` For embedded use, you may want them smaller — they’re all eviction-bounded. ## Observability The Python client exposes: ```python print(client.cache_stats()) # { # "adjacency": {"hits": ..., "misses": ..., "bytes": ...}, # "node_view": {...}, # "sst": {...} # } ``` Hook this into your dashboards to spot working-set vs budget mismatches. ## Hybrid cache (optional) NamiDB embeds [`foyer`](https://crates.io/crates/foyer) for an optional **memory + NVMe** tier — keeps the hot working set in RAM and spills warm pages to local NVMe. Useful when bucket round-trips are slow (cross-region) or expensive (egress fees). ## See also * [RFC-018 — CSR adjacency](/en/internals/rfcs/018-csr-adjacency) * [RFC-019 — NodeView cache](/en/internals/rfcs/019-node-view-cache-shared) * [RFC-020 — Edge SST caches](/en/internals/rfcs/020-edge-sst-caches) * [Operations / Tuning](/en/operations/tuning) # LSM on object storage > How NamiDB's log-structured merge tree maps onto the S3 object model — memtable, WAL, levels, compaction. NamiDB uses an **LSM tree** (log-structured merge) layered on top of an object store. The classic LSM building blocks — memtable, WAL, immutable SSTs, leveled compaction — all map naturally onto the object storage model. ## The layers ```plaintext ┌──────────────────────────────────────────────┐ │ Memtable │ in RAM, write side │ (concurrent skip-list, latest per key) │ ├──────────────────────────────────────────────┤ │ WAL │ in the bucket │ (append-only segments, durable on commit) │ ├──────────────────────────────────────────────┤ │ L0 SSTs ← memtable flushes │ in the bucket │ L1 SSTs ← compaction of L0 │ in the bucket │ L2 SSTs ← compaction of L1 │ in the bucket │ ... │ └──────────────────────────────────────────────┘ ``` ## Write path 1. **Cypher mutation** parses, plans, and executes against the current snapshot. 2. **WAL append** — every record is appended to the current WAL segment and `flush`ed before the call returns. This is what gives you durability on `commit_batch`. 3. **Memtable insert** — the record lands in the in-RAM memtable so subsequent reads see it. 4. **Manifest CAS** — the LSN watermark advances. A new manifest version is published. When the memtable grows past a threshold (or the flush interval ticks), it becomes an **L0 SST** — a Parquet (nodes) or custom-format (edges) file, uploaded to the bucket, then referenced from the next manifest version. The WAL segments that contributed to that flush become eligible for truncation. ## Read path 1. Open a **snapshot** at the current manifest version. 2. Plan the query (cost-based optimizer, predicate pushdown, factorization). 3. For each node-pattern: probe **memtable**, then **L0**, then **L1**, …, merging the visible versions per LSN. 4. For each edge-pattern: walk the **CSR adjacency** (in-memory cache keyed by SST id), reading edge property streams on demand. 5. Apply filters, projections, joins. Return rows. ## Compaction Compaction is **leveled**: * L0 SSTs may overlap on key range. * L1+ SSTs have non-overlapping key ranges. * When L\_n gets too many files, NamiDB merges them into L\_n+1, dropping tombstones and superseded versions. Compaction outputs are uploaded as new objects, then atomically swapped into the manifest via CAS. The old files stay readable until no snapshot references them — then GC reclaims them. ## What’s different from a disk-resident LSM * **No fsync on the file** — the durability primitive is “the PUT returned 200 OK”. The object store handles physical durability. * **No write amplification on small WAL appends** — each WAL append is one PUT (or a streaming upload for large batches). * **Reads are GETs**, not file-system reads. **Ranged reads** with `Range:` headers let us pull only the row groups, only the property columns, only the bloom filter pages we need ([RFC-003](/en/internals/rfcs/003-read-path-ranged-reads)). ## See also * [RFC-001 — Storage engine](/en/internals/rfcs/001-storage-engine) * [RFC-002 — SST format](/en/internals/rfcs/002-sst-format) * [RFC-003 — Read-path ranged reads](/en/internals/rfcs/003-read-path-ranged-reads) # Namespaces & multi-tenancy > A NamiDB namespace is the unit of isolation, durability, and write serialisation. Multi-tenancy is "one namespace per tenant". A **namespace** is the fundamental unit of isolation in NamiDB. Every URI carries a `?ns=…` query parameter that names it. ```python client_a = tg.Client("s3://my-bucket?ns=tenant-acme") client_b = tg.Client("s3://my-bucket?ns=tenant-globex") ``` These two clients share **nothing** at the engine level — different manifests, different SSTs, different WALs, different schema. The only thing they share is your bucket prefix. ## What lives at a namespace * **Schema** — labels, property types, indexes * **Data** — every node, edge, property, vector * **WAL** — append-only log segments * **SSTs** — immutable columnar files (Parquet for nodes, custom for edges) * **Manifest** — the CAS root that names everything currently live * **Epoch** — a fencing counter for single-writer enforcement ## Single-writer per namespace Each namespace has **one active writer at a time**, fenced by epoch CAS. If two processes try to mutate the same namespace, only one commits — the loser sees a `412 Precondition Failed` on the manifest PUT, re-reads, and either retries or backs off. This is what gives you correctness without a consensus tier. It’s also why “more write throughput” means “more namespaces”, not “more writers per namespace”. ## Multi-tenancy is namespacing The canonical NamiDB tenancy model: ```plaintext s3://my-bucket/data/ ├── tenant-acme/ ← one namespace ├── tenant-globex/ ← another └── tenant-initech/ ← another ``` Each tenant: * Has its own manifest, WAL, SSTs * Can be backed up, restored, deleted as a single S3 prefix * Scales to zero independently * Is fenced from every other tenant This is the same pattern that’s worked for [turbopuffer](https://turbopuffer.com/) (vectors) and other object-storage-native systems. ## When to make a new namespace * **New tenant** — always. * **New environment** (dev / staging / prod) — usually. * **Schema divergence** — when two parts of your app have unrelated graphs. * **You want independent backups** — namespaces back up cleanly with `aws s3 sync`. ## When NOT to * “I want more write throughput on this one workload.” Sharding writes across multiple namespaces only helps if your queries also fit inside one shard. Cross-namespace joins are not a thing in NamiDB. ## See also * [The bucket is the database](/en/concepts/bucket-is-the-database) * [Snapshots & epoch fencing](/en/concepts/snapshots-and-epoch-fencing) * [URI grammar](/en/operations/uri-grammar) # Snapshots & epoch fencing > How NamiDB serialises writers without a consensus tier — manifest CAS, epoch counters, and snapshot reads. ## Snapshots Every read in NamiDB happens against a **snapshot** — an immutable view of the namespace at a specific manifest version. The snapshot pins: * The exact set of SSTs that were live at that version * The schema at that version * A floor LSN You get a snapshot from a `WriterSession`: ```rust let snap = writer.snapshot(); let rows = execute(&plan, &snap, &Params::new()).await?; ``` Or implicitly per Cypher call from the Python client: ```python result = client.cypher("MATCH (p:Person) RETURN p.name") # read against an internally-captured snapshot ``` **Multiple snapshots can coexist.** A long-running analytical query and a hot write can run concurrently — the query reads from its snapshot, the writer mutates the memtable and produces a new manifest version. When the query finishes, GC can reclaim SSTs no longer referenced. ## The manifest A small JSON object at `{namespace}/manifest.json` that names everything currently live for the namespace. The schema (simplified): ```json { "version": 42, "epoch": 7, "lsn_watermark": 18374, "schema_id": "...", "ssts": { "node": { "L0": [...], "L1": [...] }, "edge": { "L0": [...], "L1": [...] } } } ``` **Every write commit produces a new manifest version.** The previous version stays addressable (it’s referenced by in-flight snapshots) until GC removes it. ## Manifest CAS Mutators do: ```text 1. Read manifest.json, capture its ETag. 2. Build the new version locally (apply WAL segments, list new SSTs). 3. PUT manifest.json with If-Match: . 4. If 412 Precondition Failed → re-read, rebuild, retry. 5. If 200 OK → broadcast new version, increment local epoch. ``` This is the same recipe that works for vectors and analytics on object storage. The S3 conditional write replaces the consensus tier. ## Epoch fencing Each manifest carries an **epoch counter** that increments on every commit. A writer that has been idle while another writer advanced the epoch will fail its next commit attempt and must re-bootstrap from the latest manifest. This fences out **stale writers** — for example, a process that lost network connectivity, then reconnected, can no longer commit against the old epoch. It must re-read and re-plan its writes. ## What you don’t get (yet) * **Cross-namespace transactions.** Each namespace is its own CAS unit. There is no two-phase commit across namespaces. Use one namespace if you need transactional consistency across data. * **Reader scale beyond one process.** Today, `namidb-server` serialises requests behind a tokio `Mutex` (single-writer-per-namespace lifted to the request layer). [RFC-021](https://github.com/namidb/namidb/blob/main/docs/rfc/) removes the mutex from the read path so a single daemon can fan out reads to every core. ## See also * [The bucket is the database](/en/concepts/bucket-is-the-database) * [RFC-001 — Storage engine](/en/internals/rfcs/001-storage-engine) * [RFC-002 — SST format](/en/internals/rfcs/002-sst-format) # Three deployments, one engine > The same Rust core ships as an embedded library, an HTTP daemon, and a managed cloud. They write to the same bucket layout. NamiDB ships one engine in three shapes. They all converge on a single S3-compatible bucket as the source of truth. ```plaintext ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Embedded │ │ Server │ │ Cloud │ │ (Python / │ │ (REST API │ │ (managed │ │ Rust lib) │ │ daemon) │ │ multi- │ │ │ │ │ │ tenant) │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ └─────────────────┴─────────────────┘ │ ▼ ┌──────────────────────┐ │ Object storage │ │ (S3 / R2 / GCS / │ │ Azure / local FS) │ └──────────────────────┘ ``` ## What’s shared * **The Cypher / GQL parser** — identical query language across all three deployments. * **The cost-based optimizer** — same plan rewrites, same selectivity estimates, same `EXPLAIN VERBOSE` output. * **The vectorized executor** — morsel-driven, optionally factorized. * **The storage format** — SST layout, manifest schema, WAL framing are identical across deployments. * **The bucket layout** — Embedded and Server write to the same paths. You can boot an embedded notebook against the same `s3://…` URI a production daemon is serving. ## What differs | Aspect | Embedded | Server | Cloud | | -------------- | ---------------------------------- | --------------------------------- | ------------------------------ | | Transport | In-process function calls | HTTP REST (`/v0/cypher`) | HTTP REST + per-tenant routing | | Auth | None (in-process) | Bearer token | API keys + IAM | | Concurrency | 1 writer per process, many readers | 1 writer per daemon, many readers | Managed | | Failure domain | Your process | The daemon | Managed cell | | Scale-to-zero | Trivial (stop your app) | Stop the daemon | Built-in, per-namespace | | Ops surface | None | Docker / systemd / k8s | None (managed) | ## Why this matters * You **don’t have to choose early**. Start embedded, migrate to a daemon when you need a network boundary. Same code path, same data. * You can **mix them**: a long-lived `namidb-server` for ad-hoc query work, an embedded library in your hot service. * **Cloud is opt-in**. The engine is open and self-hostable forever. ## See also * [Choose a deployment](/en/get-started/choose-deployment) * [Python SDK](/en/sdk/python) · [Rust SDK](/en/sdk/rust) · [HTTP API](/en/sdk/http) # Functions & operators > Built-in functions, scalar/aggregate, type coercions, string and list operations. ## Identifier & metadata functions | Function | Returns | | --------------------------------- | ---------------------------------------- | | `id(n)` | Internal `NodeId` (i64-flavoured opaque) | | `n._id` | Same as `id(n)`; accessor form | | `labels(n)` | `list` of the node’s labels | | `type(r)` | `string` edge type | | `properties(n)` / `properties(r)` | `map` | | `keys(n)` | `list` of property names | ## Aggregations | Function | Notes | | ------------------------------------------------- | ----------------------- | | `count(*)`, `count(expr)`, `count(DISTINCT expr)` | | | `sum`, `avg`, `min`, `max` | numeric / lexicographic | | `collect(expr)` | list aggregation | ## String | Function | Example | | ------------------------------------------------------ | ---------------------------- | | `toUpper(s)`, `toLower(s)` | | | `trim(s)`, `ltrim(s)`, `rtrim(s)` | | | `substring(s, start[, length])` | | | `replace(s, search, replacement)` | | | `split(s, delim)` | returns `list` | | `startsWith(s, p)`, `endsWith(s, p)`, `contains(s, p)` | | | `s + t` | string concatenation via `+` | ## Numeric | Function | Notes | | --------------------------------------- | ----- | | `abs`, `sign`, `floor`, `ceil`, `round` | | | `sqrt`, `log`, `exp`, `pow(base, exp)` | | | `toInteger`, `toFloat` | | ## Date & time | Function | Notes | | ------------------------------------- | ----------------- | | `date()`, `date($iso_string)` | `Date` | | `datetime()`, `datetime($iso_string)` | UTC µs `DateTime` | | `duration({days: 1, hours: 2})` | | ## List & map | Function | Notes | | ---------------------------------------- | ------------------ | | `size(list)` | length | | `head(list)`, `tail(list)`, `last(list)` | | | `range(start, end[, step])` | inclusive end | | `[x IN list WHERE pred]` | list comprehension | | `keys(map)` | `list` | | `coalesce(a, b, c, ...)` | first non-null | ## Logical & comparison operators | Operator | Notes | | ------------------------------- | --------------------------- | | `AND`, `OR`, `NOT`, `XOR` | short-circuit on `AND`/`OR` | | `=`, `<>`, `<`, `<=`, `>`, `>=` | | | `IS NULL`, `IS NOT NULL` | | | `IN [list]` | hash-set membership | | `=~ ''` | regex match | | `+`, `-`, `*`, `/`, `%` | numeric | | `+` | also string concat | ## CASE ```cypher RETURN CASE WHEN p.age < 18 THEN 'minor' WHEN p.age < 65 THEN 'adult' ELSE 'senior' END AS bucket ``` ## Path functions | Function | Notes | | --------------------- | --------------- | | `length(path)` | number of edges | | `nodes(path)` | `list` | | `relationships(path)` | `list` | ## See also * [Read queries](/en/cypher/read-queries) * [Write queries](/en/cypher/write-queries) # LDBC SNB Interactive — IC01–IC12 > The 12 in-scope Complex Read queries from the LDBC Social Network Benchmark, with the canonical NamiDB shape. NamiDB targets the **12 in-scope Complex Read** queries from the [LDBC Social Network Benchmark — Interactive](https://ldbcouncil.org/benchmarks/snb/) workload. All twelve parse, plan, and execute end-to-end on the v0.3 engine. ## Quick reference | # | Name | Pattern | | ---- | --------------------------------------------------- | ----------------------------- | | IC01 | Friends with a given name | 1–3 hop `KNOWS` traversal | | IC02 | Recent messages from friends | `KNOWS` + `POST` + ORDER BY | | IC03 | Friends and friends-of-friends in countries X and Y | 2-hop + filter | | IC04 | New topics | tag aggregation over a window | | IC05 | New groups | forum membership window | | IC06 | Tag co-occurrence | edge join + group by | | IC07 | Recent likers | `LIKES` + recency | | IC08 | Recent replies | `REPLY_OF` traversal | | IC09 | Recent messages by friends and FoF | 1–2 hop `KNOWS` + posts | | IC10 | Friend recommendation | 2-hop similarity | | IC11 | Friend’s job referrals | `KNOWS` + `WORKS_AT` | | IC12 | Expert search | `KNOWS` + tag intersection | ## IC01 — Find friends with a given name ```cypher MATCH (p:Person {_id: $personId})-[:KNOWS*1..3]-(friend:Person {firstName: $name}) WHERE friend._id <> $personId RETURN friend._id AS friendId, friend.lastName AS friendLastName, length(path) AS distance ORDER BY distance ASC, friendLastName ASC, friendId ASC LIMIT 20 ``` ## IC02 — Recent messages from friends ```cypher MATCH (p:Person {_id: $personId})-[:KNOWS]-(friend) MATCH (friend)<-[:HAS_CREATOR]-(message) WHERE message.creationDate <= $maxDate RETURN friend._id AS friendId, friend.firstName AS firstName, message._id AS messageId, message.content AS content, message.creationDate AS creationDate ORDER BY creationDate DESC, messageId ASC LIMIT 20 ``` ## (IC03–IC12 in the bench harness) The full canonical query texts live in [`crates/namidb-query/tests/fixtures/`](https://github.com/namidb/namidb/tree/main/crates/namidb-query/tests/fixtures) in the engine repo. The bench harness under [`bench/`](https://github.com/namidb/namidb/tree/main/bench) runs all twelve against a synthetic dataset and prints a comparison vs Kùzu. ## Running the bench locally ```bash git clone https://github.com/namidb/namidb.git cd namidb/bench python kuzu_runner.py # generate dataset cargo run --release -p namidb-bench ``` See [`bench/README.md`](https://github.com/namidb/namidb/blob/main/bench/README.md) for the full reproduction recipe. ## See also * [Supported subset](/en/cypher/supported-subset) * [RFC-008 — Logical plan IR](/en/internals/rfcs/008-logical-plan-ir) * [RFC-010 — Cost-based optimizer](/en/internals/rfcs/010-cost-based-optimizer) * [RFC-017 — Factorization](/en/internals/rfcs/017-factorization) # Read queries > MATCH, WHERE, RETURN, ORDER BY, LIMIT, OPTIONAL MATCH, EXISTS — with concrete examples. This page walks the read side of Cypher in NamiDB, with examples that all run on the current engine. ## MATCH ```cypher MATCH (p:Person) RETURN p.name LIMIT 10 ``` `MATCH (n)` without a label fans out across every label observed in the namespace (memtable + persisted SSTs + declared schema). ```cypher MATCH (n) RETURN labels(n), count(*) AS n ``` ## Pattern matching ```cypher MATCH (a:Person {name: $name})-[:KNOWS]->(b:Person) RETURN b.name AS friend ``` Property predicates lifted into the curly-brace pattern are pushed down to the SST scan and used for bloom-filter / fence-pointer pruning. ## Variable-length paths ```cypher MATCH (a:Person {_id: $start})-[:KNOWS*1..3]->(b) RETURN DISTINCT b.name AS friend_of_friend ``` The depth bounds (`1..3`) are required — unbounded `*` is not in scope. ## Filters ```cypher MATCH (p:Person) WHERE p.age >= $min AND p.country IN ['EC', 'SV'] RETURN p.name AS name, p.age AS age ORDER BY p.age DESC LIMIT 100 ``` Predicates are pushed down through joins ([RFC-011](/en/internals/rfcs/011-predicate-pushdown)) and through Parquet row-group pruning ([RFC-013](/en/internals/rfcs/013-parquet-predicate-pushdown)). ## Optional match ```cypher MATCH (p:Person {_id: $id}) OPTIONAL MATCH (p)-[:OWNS]->(c:Car) RETURN p.name AS name, collect(c.model) AS cars ``` `c.model` will be `null`/empty if no `OWNS` edge exists. ## Existence subqueries ```cypher MATCH (p:Person) WHERE EXISTS { MATCH (p)-[:WORKS_AT]->(:Company {country: $country}) } RETURN p.name ``` `EXISTS { ... }` decorrelates into a **hash semi-join** ([RFC-014](/en/internals/rfcs/014-hash-semi-join)) — no nested loop required. ## Aggregations ```cypher MATCH (p:Person)-[:KNOWS]->(b) RETURN p.country AS country, count(DISTINCT b) AS reach ORDER BY reach DESC LIMIT 20 ``` Vectorised aggregation runs morsel-at-a-time; group-by uses a hash-aggregator. ## EXPLAIN ```cypher EXPLAIN VERBOSE MATCH (a:Person)-[:KNOWS]->(b) RETURN b ORDER BY b.id LIMIT 20 ``` prints the chosen logical plan with **selectivity** and **cost** annotations, plus the physical-operator tree. ```bash namidb explain --verbose "MATCH (a:Person)-[:KNOWS]->(b) RETURN b LIMIT 5" ``` ## See also * [Supported subset](/en/cypher/supported-subset) * [Write queries](/en/cypher/write-queries) * [RFC-008 — Logical plan IR](/en/internals/rfcs/008-logical-plan-ir) # Supported subset > The exact GQL (ISO/IEC 39075:2024) + openCypher 9 surface NamiDB parses, plans, and executes today. NamiDB targets a **strict subset** of GQL (ISO/IEC 39075:2024) plus openCypher 9. Every query in this section parses, plans, and executes end-to-end on the v0.3 engine. ## Read clauses | Clause | Status | Notes | | ------------------------------- | ------ | ---------------------------------------------------- | | `MATCH (n)` | ✅ | label-less match fans out across all observed labels | | `MATCH (n:Label {prop: $p})` | ✅ | property predicates lift into the scan | | `MATCH (a)-[r:TYPE]->(b)` | ✅ | typed and untyped edges | | `MATCH (a)-[r:TYPE*1..N]->(b)` | ✅ | bounded variable-length paths | | `OPTIONAL MATCH` | ✅ | | | `WHERE` | ✅ | full predicate language; pushdown to SST + Parquet | | `RETURN` | ✅ | aliases, expressions, aggregations | | `ORDER BY ... LIMIT n / SKIP n` | ✅ | top-K pushdown into the executor | | `WITH ... AS ...` | ✅ | pipeline composition | | `UNION ALL` | ✅ | | | `EXISTS { ... }` | ✅ | decorrelated to hash semi-join | ## Write clauses | Clause | Status | Notes | | ---------------------- | ------ | ------------------------------------ | | `CREATE (n:L {props})` | ✅ | new NodeId per row | | `CREATE (a)-[:T]->(b)` | ✅ | requires both endpoints bound | | `MERGE (n:L {props})` | ✅ | upsert semantics | | `MERGE (a)-[r:T]->(b)` | ✅ | both endpoints must be bound (v0.3) | | `SET n.prop = expr` | ✅ | per-property mutation | | `SET n += {map}` | ✅ | bulk property update | | `DELETE n` | ✅ | deletes the node (requires no edges) | | `DETACH DELETE n` | ✅ | deletes node + its edges | | `REMOVE n.prop` | ✅ | per-property tombstone | | `REMOVE n:Label` | ✅ | label tombstone | Every write is durable on commit: WAL append + manifest CAS happen before the call returns. ## Built-in functions | Function | Status | | ---------------------------------------------------------------- | --------------------------------- | | `id(n)`, `n._id` | ✅ — returns the internal `NodeId` | | `labels(n)`, `type(r)` | ✅ | | `count(*)`, `count(expr)`, `sum`, `avg`, `min`, `max`, `collect` | ✅ | | `coalesce`, `case ... when ... end` | ✅ | | `size(list)`, `length(path)` | ✅ | | `toString`, `toInteger`, `toFloat`, `toBoolean` | ✅ | | `startsWith`, `endsWith`, `contains` | ✅ | | `properties(n)` | ✅ | ## Internal `_id` and the `id` property Since v0.3, **`_id` is the engine’s internal `NodeId`** and `id` is a plain user property. ```cypher // Address the internal NodeId MATCH (n:Person {_id: $uuid}) RETURN n RETURN id(n) // function form, same value RETURN n._id // accessor form // Use `id` as a user property CREATE (n:Person {id: 'external-42', name: 'Alice'}) MATCH (n:Person) WHERE n.id = 'external-42' RETURN n ``` See the [v0.3.0 release notes](/en/changelog#v030) for the migration story from v0.2. ## Parameters Parameters are positional via `$name` placeholders: ```cypher MATCH (p:Person) WHERE p.age >= $min AND p.country = $country RETURN p.name AS name ``` Driver-side (Python): ```python result = client.cypher(query, params={"min": 18, "country": "EC"}) ``` ## Not supported (yet) * `CALL { ... }` subqueries * User-defined procedures / functions * Index hints (`USING INDEX`) * `LOAD CSV` (use `merge_nodes` / `merge_edges` bulk APIs instead) * Schema-defining clauses (`CREATE CONSTRAINT`, `CREATE INDEX`) — schema is currently inferred from writes; explicit DDL is on the roadmap. * Path patterns longer than the LDBC IC09 / IC11 shapes — variable-length paths beyond `[*1..N]` are out of scope today. ## See also * [Read queries](/en/cypher/read-queries) · [Write queries](/en/cypher/write-queries) * [LDBC SNB IC01–IC12 walkthrough](/en/cypher/ldbc-snb) * [RFC-004 — Cypher subset](/en/internals/rfcs/004-cypher-subset) · [RFC-009 — Write clauses](/en/internals/rfcs/009-write-clauses) # Write queries > CREATE, MERGE, SET, DELETE, DETACH DELETE, REMOVE — durable on commit, WAL + manifest CAS. NamiDB’s write side is **Cypher-native**. Every write call inside the Python / Rust / HTTP clients commits as it returns — WAL append + manifest CAS happen synchronously. Durability vs visibility A successful write is **durable** when the call returns. It is **visible to other writers** when the next manifest version is published (immediately, since the writer just published it). New readers see it on their next snapshot acquisition. ## CREATE ```cypher CREATE (a:Person {name: 'Alice', age: 30}) ``` Allocates a fresh internal `NodeId`. To control the NodeId explicitly, pass `_id`: ```cypher CREATE (a:Person {_id: $uuid, name: 'Alice'}) ``` ### Edges ```cypher MATCH (a:Person {_id: $a}), (b:Person {_id: $b}) CREATE (a)-[r:KNOWS {since: 2020}]->(b) ``` The pattern requires both endpoints to be bound; NamiDB does not auto-create endpoints in a `CREATE` for an edge. ## MERGE `MERGE` is the upsert primitive — match if exists, create otherwise. ```cypher MERGE (p:Person {_id: $uuid}) ON CREATE SET p.created_at = datetime() ON MATCH SET p.last_seen = datetime() ``` ```cypher MATCH (a:Person {_id: $a}), (b:Person {_id: $b}) MERGE (a)-[r:KNOWS]->(b) ON CREATE SET r.since = datetime() ``` ## SET ```cypher MATCH (p:Person {_id: $id}) SET p.age = $age, p.last_seen = datetime() ``` Bulk update via map: ```cypher MATCH (p:Person {_id: $id}) SET p += {age: $age, country: $country, last_seen: datetime()} ``` ## DELETE / DETACH DELETE ```cypher MATCH (p:Person {_id: $id}) DELETE p // errors if p has edges MATCH (p:Person {_id: $id}) DETACH DELETE p // deletes p + its edges ``` ## REMOVE ```cypher MATCH (p:Person {_id: $id}) REMOVE p.old_field MATCH (n:Person) REMOVE n:Pending // remove a label ``` ## Combining writes with reads ```cypher MATCH (a:Person {_id: $a}) WITH a MATCH (b:Person {_id: $b}) CREATE (a)-[r:KNOWS {since: 2020}]->(b) RETURN r ``` ## Bulk inserts (Python) For high-volume ingestion, prefer the bulk staging APIs over per-row `CREATE`: ```python import uuid import namidb as tg client = tg.Client("s3://my-bucket?ns=prod®ion=us-east-1") client.merge_nodes( "Person", [{"id": str(uuid.uuid4()), "name": f"p{i}", "age": 20 + i} for i in range(10_000)], ) client.merge_edges( "KNOWS", [{"src": "uuid-a", "dst": "uuid-b", "since": 2020}], ) client.commit() # WAL + manifest CAS client.flush() # memtable -> L0 SSTs ``` These stage into the current batch (same lifecycle as `upsert_*`) and amortise a single tokio-runtime + mutex round-trip across thousands of rows. ## See also * [Supported subset](/en/cypher/supported-subset) * [Python SDK](/en/sdk/python) · [Rust SDK](/en/sdk/rust) · [HTTP API](/en/sdk/http) * [RFC-009 — Write clauses](/en/internals/rfcs/009-write-clauses) # Choose a deployment > Embedded vs Server vs Cloud. Decision matrix and the questions to ask yourself. NamiDB ships one engine in three shapes. They write to the same bucket layout, so you can mix and match. ## TL;DR | If you… | Use | | ------------------------------------------------------------------- | ------------ | | are building a notebook, a single-process service, or a CI fixture | **Embedded** | | want a network boundary with bearer-token auth between app and DB | **Server** | | want zero-ops, per-namespace scale-to-zero, multi-tenant by default | **Cloud** | ## The three shapes Embedded Your application imports NamiDB directly and points at a bucket you control. **The “DuckDB for graphs” mode.** * Lowest latency, zero network overhead * Works in Python (`pip install namidb`) and Rust * Two replicas of your service can independently open the same namespace — epoch-CAS fences out stale writers automatically * No new auth surface to wire Server `namidb-server` opens a namespace and exposes it over HTTP. * One Rust binary, one Docker image * Bearer-token auth (`--auth-token`) * Periodic memtable → L0 flush loop * Right when DB and app live on different machines * REST endpoints: `/v0/cypher`, `/v0/health`, `/v0/admin/flush` Cloud Hosted multi-tenant SaaS on `namidb.com`. * Per-namespace scale-to-zero * Encrypted-at-rest tenants * Managed control plane, NamiDB team on-call * Closed beta — [request access](/en/cloud/request-access) ## Decision matrix | Dimension | Embedded | Server | Cloud | | ---------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------ | ----------------------- | | **Setup effort** | Pip install | Docker image + bucket | Sign up | | **Latency** | Lowest (in-process) | + 1 network hop | + 1 hop, managed | | **Concurrent writers** | 1 / namespace (your process) | 1 / namespace (the daemon) | 1 / namespace (managed) | | **Concurrent readers** | Many | Many (1 mutex today, [RFC-021](https://github.com/namidb/namidb/blob/main/docs/rfc/) for full fan-out) | Many | | **Auth model** | None (in-process) | Bearer token | API keys + IAM | | **Storage** | Any URI you own | Any URI you own | Managed bucket | | **Multi-tenant** | Yes (one namespace per tenant) | Yes (one daemon per tenant) | First-class | | **Pay model** | Storage + your compute | Storage + your compute | Per-namespace metering | | **Self-host?** | Yes | Yes | No | ## Questions to ask yourself 1. **Where does the data live, and who else needs it?** If only your single Python service ever reads or writes, **Embedded** is the lowest-friction call. If multiple services need the same namespace and you want one authoritative endpoint, **Server** gives you that network boundary. 2. **Is read fan-out your bottleneck today?** `namidb-server` serialises requests behind a tokio `Mutex` today ([RFC-021](https://github.com/namidb/namidb/blob/main/docs/rfc/) removes that). If you need horizontal read scale right now, run multiple `namidb-server` processes against the same bucket — each can serve reads off the same manifest version. Only one will be allowed to commit writes. 3. **Do you want to operate any infrastructure?** **Cloud** is the right answer when the answer is “no”. Multi-tenant from day one, scale-to-zero, no flush schedules to tune, no Docker images to bump. 4. **Are tenants first-class in your app?** Embedded and Server both let you do “one namespace per tenant” manually. Cloud makes it the unit of billing, isolation, and scale-to-zero. ## You can mix them Same engine, same bucket layout, **same `s3://…` URI**. A common pattern: * **Embedded** in your application for the hot read path. * A **`namidb-server`** in the same VPC for ad-hoc analytics from a notebook or a teammate’s laptop, both pointing at the same namespace. `namidb-server` will fence the embedded writer if it tries to mutate at the same time — single-writer-per-namespace is enforced via manifest CAS, regardless of how the writer was reached. ## Next * [Embedded — Python SDK](/en/sdk/python) * [Embedded — Rust SDK](/en/sdk/rust) * [Server — HTTP API](/en/sdk/http) * [Self-host with Docker Compose](/en/operations/self-host-docker-compose) * [Cloud — what it is](/en/cloud/what-is-namidb-cloud) # Install > Install NamiDB as a Python package, a Rust dependency, a CLI, or a Docker image. NamiDB ships in four flavours, all driven by the same Rust core. ## Python Pre-built wheels (abi3) for Python ≥ 3.9 on Linux (x86\_64 + aarch64), macOS (arm64), and Windows (x86\_64). Intel macOS falls back to sdist. ```bash pip install namidb # core pip install 'namidb[pandas]' # + DataFrame interop pip install 'namidb[polars]' # + Polars interop ``` `pyarrow >= 14` is a hard transitive dependency. ### From source ```bash pip install maturin git clone https://github.com/namidb/namidb.git cd namidb/crates/namidb-py maturin develop --release --extras test ``` [→ Full Python SDK reference](/en/sdk/python) ## Rust (embedded) Add the umbrella crate to your `Cargo.toml`: ```toml [dependencies] namidb = "0.3" tokio = { version = "1", features = ["full"] } ``` MSRV is **Rust 1.85**. The workspace exposes a stable facade crate so downstream code only needs one line. [→ Full Rust SDK reference](/en/sdk/rust) ## CLI From source (one-shot install): ```bash git clone https://github.com/namidb/namidb.git cd namidb cargo install --path crates/namidb-cli ``` ```bash namidb run "CREATE (a:Person {name: 'Alice'})" namidb run --store s3://my-bucket?ns=prod \ "MATCH (p:Person) RETURN count(*) AS n" namidb explain --verbose \ "MATCH (a:Person)-[:KNOWS]->(b) RETURN b ORDER BY b.id LIMIT 20" ``` [→ Full CLI reference](/en/sdk/cli) ## HTTP server (`namidb-server`) * Cargo ```bash cargo install --path crates/namidb-server ``` * Docker ```bash docker build -t namidb-server:0.3 \ -f crates/namidb-server/Dockerfile . ``` Run it: ```bash namidb-server \ --store 's3://my-bucket?ns=prod®ion=us-east-1' \ --listen 0.0.0.0:8080 \ --auth-token "$NAMIDB_AUTH_TOKEN" \ --flush-interval 30s ``` Every flag is also an env var: `NAMIDB_STORE`, `NAMIDB_LISTEN`, `NAMIDB_AUTH_TOKEN`, `NAMIDB_FLUSH_INTERVAL`. No token, no production If `--auth-token` is unset, the server boots in **unauthenticated** mode and prints a loud warning. Do not expose that port to the public internet. [→ Full HTTP API reference](/en/sdk/http) · [→ Self-host with Docker Compose](/en/operations/self-host-docker-compose) ## Verify the install * Python ```python import namidb as tg client = tg.Client("memory://test") print(client.cypher("RETURN 1 AS n").rows()) # [{'n': 1}] ``` * CLI ```bash namidb run "RETURN 1 AS n" ``` * HTTP ```bash namidb-server --store memory://test --listen 127.0.0.1:8080 & curl -s http://127.0.0.1:8080/v0/health | jq . ``` ## Supported platforms | Surface | Linux x86\_64 | Linux aarch64 | macOS arm64 | macOS x86\_64 | Windows x86\_64 | | ---------------------- | -------------- | -------------- | ----------- | ------------- | --------------- | | Python wheel | ✅ | ✅ | ✅ | sdist | ✅ | | Rust crate | ✅ | ✅ | ✅ | ✅ | ✅ | | `namidb-server` binary | ✅ | ✅ | ✅ | ✅ | ✅ | | Docker image | ✅ (multi-arch) | ✅ (multi-arch) | — | — | — | # 30-second quickstart > Six lines of Python against an in-memory namespace. No credentials, no setup — just the engine. The fastest possible taste of NamiDB. Ephemeral, in-process, no setup. ## Install * Python ```bash pip install namidb ``` * Rust ```bash cargo add namidb ``` * CLI ```bash cargo install --git https://github.com/namidb/namidb namidb-cli ``` ## Hello, graph * Python ```python import namidb as tg client = tg.Client("memory://acme") client.cypher("CREATE (a:Person {name: 'Alice'})") client.cypher("CREATE (b:Person {name: 'Bob'})") client.cypher( "MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'}) " "CREATE (a)-[:KNOWS {since: 2020}]->(b)" ) result = client.cypher("MATCH (p:Person) RETURN p.name AS name") print(result.rows()) # [{'name': 'Alice'}, {'name': 'Bob'}] ``` * Rust ```rust use namidb_query::{execute, lower, parse, Params}; use namidb_storage::{parse_uri, WriterSession}; #[tokio::main] async fn main() -> anyhow::Result<()> { let (store, paths) = parse_uri("memory://acme")?; let mut writer = WriterSession::open(store, paths).await?; // ... upsert nodes / edges, then commit_batch + flush ... let snap = writer.snapshot(); let query = parse("MATCH (a:Person) RETURN count(*) AS n")?; let plan = lower(&query)?; let rows = execute(&plan, &snap, &Params::new()).await?; println!("{rows:?}"); Ok(()) } ``` * CLI ```bash namidb run "CREATE (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'})" namidb run "MATCH (p:Person) RETURN p.name" ``` * HTTP ```bash # In one shell: start the server namidb-server --store memory://acme --listen 127.0.0.1:8080 # In another: curl -s -X POST http://127.0.0.1:8080/v0/cypher \ -H 'Content-Type: application/json' \ -d '{"query": "CREATE (a:Person {name: \"Alice\"}) RETURN a.name"}' ``` ## Make it persistent Swap the URI. The same six lines of code work against any backend: * Local file ```python client = tg.Client("file:///var/lib/namidb?ns=prod") ``` * AWS S3 ```python client = tg.Client("s3://my-bucket/data?ns=prod®ion=us-east-1") ``` * Cloudflare R2 ```python client = tg.Client( "s3://my-bucket?ns=prod" "&endpoint=https://.r2.cloudflarestorage.com" "®ion=auto" ) ``` * GCS ```python client = tg.Client("gs://my-bucket/data?ns=prod") ``` * Azure Blob ```python client = tg.Client("az://account/container?ns=prod") ``` ## Next steps * [Your graph in S3](/en/get-started/your-graph-in-s3) — the headline use case, end-to-end. * [Choose a deployment](/en/get-started/choose-deployment) — Embedded vs Server vs Cloud. * [Cypher reference](/en/cypher/supported-subset) — exactly which Cypher / GQL the engine understands today. * [Operations / URI grammar](/en/operations/uri-grammar) — all backends, every query-string flag. # What is NamiDB > A graph database engine for the era of object storage, columnar execution, and AI agents. One engine. Three deployments. The bucket is the database. NamiDB is a **graph database engine** built from first principles for the era of object storage, columnar execution, and AI agents. It is **embedded** like DuckDB, **multi-tenant** by namespace, and runs the same engine whether you import it as a library, run it as a daemon, or consume it on our hosted cloud. **Object storage is the source of truth.** ## Why now Three things changed. They changed everything. 1. **Object storage grew up.** In 2024, S3 shipped conditional writes (`If-Match` / `If-None-Match`) — the last missing primitive. For the first time, you can build a coordinated, durable system where object storage *is* the database — no Raft, no ZooKeeper, no etcd. 2. **The best columnar graph engine left the market.** In October 2025, Apple acquired Kùzu and archived the repository. The most thoughtful columnar graph engine ever published went quiet. A hole opened. 3. **Agents need graphs.** Vector search is necessary. It is not sufficient. Knowledge graphs are the substrate of agent memory, deep retrieval, and reasoning under uncertainty. The next decade of AI will run on relationships. So we are building the database for that decade. NamiDB is open under [BSL 1.1](/en/community/license) (converts to Apache 2.0 after three years per release). **The engine is open; the cloud is the business** — a hosted multi-tenant product on [`namidb.com`](https://namidb.com) is the funding model that keeps the engine moving. ## The shape **NamiDB writes Cypher to your S3 bucket.** No control plane to provision. No Raft to tune. No etcd to babysit. Conditional writes (`If-Match` / `If-None-Match`) on object storage replace the consensus tier — the bucket itself is the source of truth. Your graph database is *just files in your bucket*: * **Durability** is whatever S3, R2, GCS, or Azure already give you * **Cost** scales to zero when nobody queries * **Backups** are `aws s3 sync` * **Tenants** are folders The engine is the same whether you run it as a library, as a Rust daemon over HTTP, or on our hosted multi-tenant cloud — and it works equally well against **AWS S3**, **Cloudflare R2**, **GCS**, **Azure Blob**, **MinIO**, or your local disk. ## Three deployments, one engine Embedded `pip install namidb`. The “DuckDB for graphs” mode — your application imports the library and talks to a bucket directly. Lowest latency, no extra hop, no network boundary, no auth surface. Server `namidb-server` opens a namespace and exposes it over HTTP with bearer-token auth. Right when the DB lives on a different machine than the app, or you want a network boundary. Cloud Hosted multi-tenant SaaS on `namidb.com` with per-namespace scale-to-zero, encrypted-at-rest tenants, and a managed control plane. Closed beta — [request access](/en/cloud/request-access). All three speak the same Cypher, return the same types, and write to the same bucket layout. **You can boot an embedded notebook against the same `s3://…` URI a production daemon is serving.** ## What’s in the engine today * **Cypher + GQL parsing** — strict subset of GQL (ISO/IEC 39075:2024) + openCypher 9. The 12 in-scope LDBC SNB Interactive Complex Read queries (IC01–IC12) parse, plan and execute end-to-end. * **Writes via Cypher** — `CREATE`, `MERGE`, `SET`, `DELETE`, `DETACH DELETE`, `REMOVE`. Durable on `commit_batch` (WAL append + manifest CAS). * **Cost-based optimizer** — predicate pushdown, projection pushdown, join reorder, hash-join conversion, hash semi-join (`EXISTS` decorrelation), Parquet row-group pruning. `EXPLAIN VERBOSE` prints the chosen plan with selectivity and cost annotations. * **Vectorized execution** — morsel-driven executor with optional **factorized intermediate representation** ([RFC-017](/en/internals/rfcs/017-factorization)) for path-heavy queries. * **Columnar storage on object storage** — Parquet node SSTs, custom edge-SST format with CSR adjacency ([RFC-002](/en/internals/rfcs/002-sst-format)), zstd compression, bloom filters, fence-pointer indices. * **Coordination-free correctness** — single-writer-per-namespace with epoch fencing via manifest CAS. * **Tiered caches** — process-wide `AdjacencyCache`, `NodeViewCache`, and `SstCache`. Cross-snapshot reuse with `Arc`-shared, byte-budgeted memory. * **Six storage backends** — `memory://`, `file://`, `s3://`, `gs://`, `az://`, and any S3-compatible endpoint (R2, MinIO, Tigris, LocalStack). * **Python bindings**, **CLI**, and **`namidb-server`** HTTP daemon. ## Where to next 30-second taste [Quickstart](/en/get-started/quickstart) — six lines of Python, no credentials. Your graph in S3 [Your graph in S3](/en/get-started/your-graph-in-s3) — point at an AWS bucket. Restart the process. The graph is still there. Choose a deployment [Choose a deployment](/en/get-started/choose-deployment) — Embedded vs Server vs Cloud, with a decision matrix. Deep dive [The RFCs](/en/internals/rfcs) — 18 design documents covering the storage engine, SST format, optimizer, factorization, caches. # Your graph in S3 > Point at an AWS S3 bucket. Write Cypher. Restart the process. The graph is still there. The bucket is the database. The headline use case. **Your graph database lives in your S3 bucket.** 1. **Install the client** * Python ```bash pip install namidb ``` * Rust Cargo.toml ```toml [dependencies] namidb = "0.3" tokio = { version = "1", features = ["full"] } ``` 2. **Export AWS credentials** (or rely on an EC2 / EKS / Lambda IAM role) ```bash export AWS_ACCESS_KEY_ID=AKIA... export AWS_SECRET_ACCESS_KEY=... export AWS_DEFAULT_REGION=us-east-1 ``` The only IAM permissions NamiDB needs on the bucket are `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket`. That’s it. No DynamoDB lock table, no separate metadata service. 3. **Open (or bootstrap) a namespace** * Python ```python import namidb as tg client = tg.Client("s3://my-bucket/data?ns=prod®ion=us-east-1") ``` * Rust ```rust use namidb::storage::{parse_uri, WriterSession}; let (store, paths) = parse_uri("s3://my-bucket/data?ns=prod®ion=us-east-1")?; let mut writer = WriterSession::open(store, paths).await?; ``` 4. **Write Cypher** * Python ```python client.cypher("CREATE (a:Person {name: 'Alice', age: 30})") client.cypher("CREATE (b:Person {name: 'Bob', age: 25})") client.cypher( "MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'}) " "CREATE (a)-[:KNOWS {since: 2020}]->(b)" ) result = client.cypher( "MATCH (p:Person) WHERE p.age >= $min RETURN p.name AS name, p.age AS age", params={"min": 18}, ) print(result.to_pandas()) ``` 5. **Restart the process. Open a notebook on another machine with the same URI.** The graph is still there. **The bucket is the database.** ## Why this works * **Conditional writes (`If-Match` / `If-None-Match`)** on S3 replace external consensus. The manifest object is mutated by compare-and-swap, so two writers can race and only one wins — the loser retries against the new version. * **Single-writer-per-namespace** with epoch fencing. Multiple readers scale freely; only one mutator at a time per namespace. * **Durability is whatever S3 gives you**: 99.999999999% (11 nines) of object durability, multi-AZ replication. * **Cost scales to zero** when no client opens the namespace — no compute is running, no DynamoDB table is provisioned. ## Use Cloudflare R2 for zero egress R2 charges no egress and has full S3-compatible conditional writes. Same scheme, with the R2 endpoint and `region=auto`: ```python import namidb as tg client = tg.Client( "s3://my-bucket?ns=prod" "&endpoint=https://.r2.cloudflarestorage.com" "®ion=auto" ) ``` If you’re running NamiDB **outside AWS** — on Cloudflare Workers, Fly.io, your VPS, your laptop — R2 is almost always the right call. ## What about backups? ```bash aws s3 sync s3://my-bucket/data/ ./backup-2026-05-19/ ``` That’s it. NamiDB never writes to anywhere else. ## Multi-tenancy Each namespace is a folder under your bucket prefix: ```plaintext s3://my-bucket/data/ ├── tenant-acme/ │ ├── manifest.json │ ├── wal/ │ └── sst/ ├── tenant-globex/ │ ├── manifest.json │ ├── wal/ │ └── sst/ └── tenant-initech/ ├── ... ``` Each `?ns=…` opens an isolated namespace. Operationally: **tenants are folders.** ## Next * [Cypher reference](/en/cypher/supported-subset) — what you can query today. * [Operations / URI grammar](/en/operations/uri-grammar) — every backend and flag. * [Self-host with Docker Compose](/en/operations/self-host-docker-compose) — network-bounded REST API in front of the bucket. # Internals (RFCs) > 18 design documents on NamiDB's storage engine, SST format, query engine, optimizer, and caches. NamiDB’s design lives in 18 (and counting) Request-For-Comments documents. Each captures the **why** of a major engine decision — the context, the alternatives considered, and the rationale. For a high-level orientation, start with [**RFC-001 — Storage engine**](/en/internals/rfcs/001-storage-engine) and [**RFC-002 — SST format**](/en/internals/rfcs/002-sst-format). ## Storage engine | RFC | Topic | | ---------------------------------------------------- | ---------------------- | | [001](/en/internals/rfcs/001-storage-engine) | Storage engine | | [002](/en/internals/rfcs/002-sst-format) | SST format | | [003](/en/internals/rfcs/003-read-path-ranged-reads) | Read-path ranged reads | | [018](/en/internals/rfcs/018-csr-adjacency) | CSR adjacency cache | | [019](/en/internals/rfcs/019-node-view-cache-shared) | NodeView cache | | [020](/en/internals/rfcs/020-edge-sst-caches) | Edge SST caches | ## Query language | RFC | Topic | | --------------------------------------------- | --------------- | | [004](/en/internals/rfcs/004-cypher-subset) | Cypher subset | | [008](/en/internals/rfcs/008-logical-plan-ir) | Logical plan IR | | [009](/en/internals/rfcs/009-write-clauses) | Write clauses | ## Optimizer | RFC | Topic | | -------------------------------------------------------- | -------------------------- | | [010](/en/internals/rfcs/010-cost-based-optimizer) | Cost-based optimizer | | [011](/en/internals/rfcs/011-predicate-pushdown) | Predicate pushdown | | [012](/en/internals/rfcs/012-hash-join) | Hash join | | [013](/en/internals/rfcs/013-parquet-predicate-pushdown) | Parquet predicate pushdown | | [014](/en/internals/rfcs/014-hash-semi-join) | Hash semi-join | | [015](/en/internals/rfcs/015-projection-pushdown) | Projection pushdown | | [016](/en/internals/rfcs/016-join-reorder) | Join reorder | ## Executor | RFC | Topic | | ------------------------------------------- | ------------- | | [017](/en/internals/rfcs/017-factorization) | Factorization | ## Read fan-out (in flight) | RFC | Topic | | --- | --------------------------------------------- | | 021 | Read-path mutex removal (in flight on `main`) | ## How RFCs work For any change bigger than a few-line refactor, the contributor writes an RFC, opens a Draft PR with **only the RFC**, and gets feedback before writing any code. See [RFC process](/en/community/rfc-process) for the full workflow. The canonical source is [`docs/rfc/`](https://github.com/namidb/namidb/tree/main/docs/rfc) in the engine repo. The pages in this section are mirrors with a docs-friendly nav. # RFC 001: Storage Engine Architecture > **Status:** draft **Author(s):** NamiDB founding team > *Mirrored from [`docs/rfc/001-storage-engine.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/001-storage-engine.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** NamiDB founding team ## Summary NamiDB stores property-graph data in an LSM-tree-shaped storage engine whose **source of truth is object storage** (S3 and compatibles). Coordination among writers and readers is provided conditional writes (`If-Match` / `If-None-Match` / ETag) rather than by an external consensus service. The engine is single-writer per namespace, with **epoch fencing** enforced by manifest CAS. Reads are served from a three-tier hybrid cache (memory + NVMe + optional S3 Express One Zone) over the immutable, columnar SSTs that the storage layer produces. This RFC defines the on-disk layout, the manifest protocol, the write path (WAL → memtable → SST flush), the read path, and the compaction strategy. ## Motivation Existing graph databases fall into two camps: 1. **Single-node embedded / RAM-bound** (Kuzu, LadybugDB, HelixDB, Memgraph, FalkorDB). These store data on local disk or RAM. None can scale a single namespace beyond one machine, none has a “your S3 bucket is the source of truth” story, and none supports scale-to-zero pricing. 2. **Shared-nothing distributed** (Neo4j, TigerGraph, NebulaGraph, Neptune). These require operational expertise to run, are expensive, and tie compute to storage. We want the architecture proven by **turbopuffer** (vector), **SlateDB** (KV), **WarpStream** (Kafka), and **Neon** (Postgres) — compute and storage fully separated, object storage as durable substrate, single-writer fencing via cloud CAS — applied to **property graphs**. As of May 2026 nobody has shipped this. Hard requirements: * **Durability of S3** (11 nines) with no extra coordination service. * **Scale-to-zero per namespace** for SaaS economics. * **Cold query < 500 ms p50** at 10M edges. * **Warm query < 10 ms p50.** * **Snapshot isolation** for reads + ability to branch a graph at a point in time. * **Single binary** to run in embedded, server, and SaaS modes. ## Design ### Tiered storage and access model ```plaintext ┌─────────────────────────────────────┐ │ Memory cache (Arrow batches) │ Sub-ms p50, w-TinyLFU / SIEVE ├─────────────────────────────────────┤ │ NVMe disk cache (foyer-rs) │ ~1-10 ms p50 ├─────────────────────────────────────┤ │ S3 Express One Zone (optional) │ Single-digit ms, hot tier ├─────────────────────────────────────┤ │ S3 Standard / R2 / GCS / MinIO │ Source of truth, 11-nines durability └─────────────────────────────────────┘ ``` ### Logical layout in object storage ```plaintext // ├── manifest/ │ ├── current.json # tiny pointer file: { "version": v, "etag": "..." } │ └── v00000001.json # immutable manifest snapshot │ └── v00000002.json │ └── ... ├── wal/ │ ├── 00000001.wal # 64MB segments, append-only │ ├── 00000002.wal │ └── ... ├── sst/ │ ├── level0/ │ │ ├── 01J5XY...-nodes-Person.parquet │ │ ├── 01J5XY...-edges-KNOWS.csr │ │ └── 01J5XY...-vector-Document.lance │ ├── level1/ │ │ └── ... │ └── ... └── snapshots/ └── 2026-01-01T00:00:00Z.json # optional named snapshots / branches ``` Within a namespace: * Filenames are **ULIDs** (`uuid::Uuid::now_v7()`) for natural ordering by creation time. * Each SST belongs to one **(label, edge type, or vector index, level)** combination. * Manifests are **fully self-describing**: they list every SST currently part of version `v`, along with statistics (row count, byte size, key range, bloom filter, partition tag, histogram). ### Manifest protocol (the heart of the design) The manifest is the single object that determines “what is the current state of the database”. All writers race to update it; only one wins per epoch. #### Manifest file format (`v.json`) ```jsonc { "version": 42, "epoch": 7, "writer_id": "uuid-of-writer-process", "created_at": "2026-01-15T10:00:00.000Z", "schema_version": 11, "labels": [ { "name": "Person", "node_id_type": "Uuid", "properties": [ { "name": "name", "type": "Utf8", "nullable": false }, { "name": "age", "type": "Int32", "nullable": true } ] } ], "edge_types": [ { "name": "KNOWS", "src_label": "Person", "dst_label": "Person", "properties": [{ "name": "since", "type": "Date32", "nullable": true }] } ], "ssts": [ { "id": "01J5XY7K...", "kind": "Nodes", "label": "Person", "level": 0, "path": "sst/level0/01J5XY7K...-nodes-Person.parquet", "size_bytes": 134217728, "row_count": 1048576, "min_key": "00...", "max_key": "ff...", "created_at": "2026-01-15T10:00:00Z" } ], "wal_segments": [ { "id": "00000042.wal", "path": "wal/00000042.wal", "last_lsn": 1234567 } ] } ``` #### `current.json` (pointer file) ```jsonc { "version": 42, "manifest_path": "manifest/v00000042.json", "manifest_etag": "etag-of-v42-object" } ``` #### CAS protocol for committing a new manifest When the writer wants to advance the database from version `v` to `v+1`: 1. Read `current.json` (with its ETag) → `current_etag`. 2. Read `manifest/v.json` → previous state. 3. Build the new manifest in memory, incrementing `version` to `v+1` and updating `epoch` if needed. 4. **PUT** `manifest/v.json` with `PutMode::Create` (i.e. `If-None-Match: *`). If this fails, another writer has the same version assigned; abort, reload, retry. 5. **PUT** `current.json` with `PutMode::Update(version = current_etag)` (i.e. `If-Match: `). If this fails, another writer raced ahead; abort, reload, retry. This sequence ensures: * The manifest file for any given version `v` is written **at most once** (write-once contents). * The `current.json` pointer is the linearization point. Whoever wins the `If-Match` swap is the sole owner of version `v+1`. #### Epoch fencing Each writer process picks an `epoch` at startup, taken from `current_manifest.epoch + 1` and immediately committed via the CAS protocol above (a “zero-op” manifest update that only bumps epoch). After that, every other writer that tries to advance the version against this `current.json` will lose the CAS race and discover the new epoch on retry — at which point its operations must be rejected by the local writer’s epoch check. In code: ```rust pub struct WriterFence { pub epoch: u64, pub writer_id: Uuid, } impl WriterFence { pub fn assert_alive(&self, current_epoch: u64) -> Result<()> { if current_epoch > self.epoch { return Err(Error::Fenced { mine: self.epoch, current: current_epoch }); } Ok(()) } } ``` Every WAL append, every memtable flush, every SST commit calls `assert_alive` against the latest known manifest before its CAS. This is how single-writer is enforced **without** Raft, ZooKeeper, or any local file lock. ### Write path ```plaintext client write API │ ▼ [1] WriterFence.assert_alive(current_epoch) │ ▼ [2] Buffer in WAL batcher (group commit, 100ms or 1MB) │ ▼ [3] WAL flush: PUT wal/.wal segment to object store │ (PutMode::Create on first byte to detect concurrent writer) │ ▼ [4] Apply to in-memory memtable (Arrow-backed skiplist) │ ▼ [5] Acknowledge to client (durability == WAL acknowledged) │ ▼ (in background, when memtable > threshold) [6] Freeze memtable → flush to SST(s) in level 0 │ ▼ [7] Manifest CAS: add new SSTs, mark WAL segments as flushed │ ▼ (in background, scheduled) [8] Compaction worker: merge L0 → L1, manifest CAS, GC obsolete SSTs ``` WAL durability is the user-facing acknowledgement. After a WAL group commit returns success, the data is durable. ### Read path ```plaintext query API │ ▼ [1] Read current.json → snapshot version v │ ▼ [2] Optimizer builds plan against manifest v (immutable for this query) │ ▼ [3] Operators issue async fetches against cached SSTs / WAL │ (foyer tries memory → disk → S3 Express → S3 Standard) ▼ [4] Stream Arrow batches to client ``` The **manifest is immutable for the lifetime of the query**, giving snapshot isolation for free. SSTs are immutable until GC. The only mutable thing in the system is `current.json`. ### Branching (Neon-style) Branching is “named manifest aliases”: ```plaintext manifest/branches/my-branch.json → { "version": 42, "manifest_path": "manifest/v00000042.json" } ``` A branch shares SSTs with its parent (CoW); new writes go to SSTs owned by the branch. GC respects branch references. ### Compaction strategy * **Default: leveled compaction.** Level 0 = output of memtable flush (overlapping ranges allowed). Level `L > 0` is partitioned by key range (no overlap). * **Trigger:** L0 has > 4 SSTs, or `bytes(L_i) > 10 * bytes(L_{i-1})`. * **Worker:** stateless. A compaction worker reads SSTs, merges them, writes new SSTs, then does a manifest CAS to swap. If the CAS fails (manifest moved underneath), the worker discards its output and retries. * **GC:** SSTs unreferenced by `current.json` for > `retention_window` (default 24h) are deleted. Branches extend retention for their snapshots. ### Failure modes | Failure | Behavior | | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | | Writer crashes mid-WAL flush | WAL segment is partial; recovery treats the last malformed record as torn and discards it. Manifest unchanged → no data visible. | | Writer crashes after WAL flush, before manifest CAS | WAL segment is durable; on next startup, recovery replays WAL segments not referenced by manifest into a fresh memtable. | | Two writers race | Loser fails CAS, refreshes manifest, discovers higher epoch, fences itself, fails subsequent writes. | | Stale reader | Sees old manifest version `v`; queries are still consistent against `v` (snapshot isolation). New `current.json` reads pick up newer versions. | | Corrupted SST | CRC mismatch detected on read; query fails with `CorruptedSst { sst_id }`. SSTs can be re-derived from WAL replay if WAL retention permits. | ### Concurrency for readers Readers are **lock-free**: * Each query opens a snapshot of the manifest at query start. * All SSTs referenced by that manifest are guaranteed to exist until GC reclaims them (GC respects active queries via ref counting at the cache layer). * No locking against the writer; manifest swap is atomic via CAS. ### Why not just use SlateDB? SlateDB is a KV store. We need: * **Property graph schema** (typed nodes, typed edges, label-scoped indexes). * **CSR adjacency layout** in SSTs (not generic KV). * **Vector index integration** (Lance v2 format). * **Graph-shaped statistics** (degree distributions, label histograms) for the optimizer. * **Multi-SST commit atomicity** so a single graph mutation can update nodes + edges + vector index in one manifest. We borrow SlateDB’s protocol shape (WAL → memtable → SST → manifest CAS) and reimplement it with graph-aware SST format. SlateDB will be used in tests as a baseline. ## Alternatives considered ### A. Local disk + replication (Memgraph / Neo4j shape) Rejected: forces operators to run replicas, lose cloud economics, lose scale-to-zero. ### B. Postgres + extension (Apache AGE, pg-ivm) Rejected: inherits Postgres operational story; no path to S3-native; column store via Citus is awkward. ### C. ClickHouse-shaped MergeTree Rejected: MergeTree is optimized for OLAP scans, not multi-hop graph traversals. CSR adjacency does not fit naturally. ### D. Raft / paxos for coordination Rejected: adds operational complexity and a separate failure domain. S3 conditional writes give us linearizable CAS for free. ## Drawbacks 1. **S3 PUT latency floor (\~30-100ms Standard).** Writes feel slower than local-disk databases. Mitigations: group commit (100ms/1MB), optional S3 Express One Zone tier (single-digit ms). 2. **Single-writer per namespace.** Genuinely multi-master writes are not supported in v1. For most workloads (especially analytical / KG / RAG) this is fine; for high-throughput OLTP it isn’t. 3. **Compaction write amplification on S3 costs money.** Tuning compaction policy + columnar compression is critical; we will benchmark continuously. 4. **CAS livelock under high writer contention** for the same namespace. Mitigation: only one writer per namespace by design; concurrent CAS losers fence themselves quickly. ## Open questions * Bloom filter format inside the manifest vs as side-car files per SST. * WAL segment size: 64MB or 16MB or 4MB? Smaller = lower commit latency, more PUTs (more $). Bench-driven. * Compression level for WAL: `zstd -3` (default) vs uncompressed (latency win, $ loss). * Whether to support multi-writer with merge semantics for CRDT-friendly use cases (agent memory). Probably v2. * Manifest format: JSON now (simple) vs Arrow IPC (smaller, faster) — switch when manifest hits \~10MB. ## References * Verbitski et al., **Amazon Aurora** (SIGMOD 2017). * Dageville et al., **Snowflake** (SIGMOD 2016). * Armbrust et al., **Delta Lake** (VLDB 2020). * **SlateDB design overview**, . * **turbopuffer architecture**, . * **turbopuffer object-storage queue** blog, . * Jin et al., **Kùzu** (CIDR 2023). * Leis et al., **Morsel-driven parallelism** (SIGMOD 2014). * Neumann & Freitag, **Umbra** (CIDR 2020). * AWS, **S3 conditional writes** (Aug + Nov 2024 launches). * AWS, **Kafka KIP-1150 Diskless Topics** (Mar 2026). # RFC 002: SST Format — Property Columnar + CSR Adjacency > **Status:** draft **Author(s):** NamiDB founding team **Implements:** (links to PRs land here) **Supersedes:** — **Depends on:** [RFC-001](./001-storage-engine.md) (manifest CAS, namespace layout, write/read paths) > *Mirrored from [`docs/rfc/002-sst-format.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/002-sst-format.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** NamiDB founding team **Implements:** (links to PRs land here) **Supersedes:** — **Depends on:** [RFC-001](./001-storage-engine.md) (manifest CAS, namespace layout, write/read paths) ## Summary This RFC defines the **on-disk format of the SSTs** that the NamiDB LSM emits when it flushes a memtable. There are two physical kinds of SST in v1: 1. **Node SSTs** — Apache **Parquet** files, one per `(label, level)` bucket produced by a flush. They hold the property columns of nodes plus a tombstone column, an `lsn` column, the `node_id` key column, and a mandatory `__overflow_json: Utf8` column that captures any property the schema did not declare at flush time. They use Parquet’s standard page index + Zstd compression + dictionary encoding. 2. **Edge SSTs** — a NamiDB-native **CSR binary** format, one per `(edge_type, level)` bucket. Each flush emits two physical files per bucket: a **forward** SST (sorted by `src_id`) and an **inverse** partner SST (sorted by `dst_id`). Both share the same wire format, differentiated by a single header flag. They hold the adjacency for a fixed `(src_label, edge_type, dst_label)` triple, with bitpacked offsets, split-encoded neighbour lists, a fence-pointer index for large SSTs, and parallel Zstd-compressed property streams. They are designed for `O(deg(v))` neighbour expansion in **two ranged GETs warm / four to six cold** against object storage (counts derived in §3.4). Following the resolution of the RFC-001 open question on bloom filters and after a sizing review (revision 2): * **Property stats, degree histograms, and key ranges** are embedded in the manifest’s `SstDescriptor` — they fit in JSON and gate scans without extra GETs. * **Bloom filters** live as **side-car files** (`.bloom`, raw binary, fetched on first probe and cached by foyer). They are too large (≈1.25 MiB for 1 M keys at 10 bits/key) to inline in a JSON manifest at production scale. The read path therefore needs **two GETs** to gate by min/max + property stats and **at most one extra GET per candidate SST** when a bloom probe is needed (and that GET is foyer-cacheable across queries). This RFC defines: * Path conventions for SST objects. * Schema-to-Parquet mapping for node SSTs. * Byte-level wire format for edge SSTs. * The extended `SstDescriptor` struct, including the embedded statistics and the bloom side-car pointer. * Forward-compatibility rules that bind reader and writer. * The read-path access patterns and the GETs they imply per neighbour expansion. It does **not** define the flush orchestration, recovery / WAL replay, or the read-side snapshot merge. Those are the subjects of follow-up RFCs (RFC-003 flush + recovery, RFC-004 read snapshot). ## Motivation After RFC-001 the storage engine can take writes (WAL + memtable) and commit linearisable manifest versions. It still cannot **durably materialise data into queryable columnar files**. Until SSTs exist: * The memtable is the only home of accepted writes; a flush of any meaningful size cannot complete. * The manifest’s `ssts: []` is permanently empty, so the read path has no cold tier to fall back on. * We cannot bench the **§14.1 thresholds** (cold <500 ms p50, warm <10 ms p50, ingest ≥10 k nodes/s) because there is nothing to read . We need a format that satisfies six non-negotiables: 1. **Immutable, write-once.** Required by manifest CAS — an SST that is referenced by manifest *v* must not change underneath a reader of *v*. 2. **Random-access ranged reads.** Object storage charges per GET and per byte; we want to fetch only the column slices we need. 3. **CSR-shaped edges, both directions.** Multi-hop traversals must do `O(deg(v))` work per expansion regardless of whether the expansion is by `src` or by `dst`. Single-direction CSR forces linear scans for the missing direction, which kills the cold-query budget. 4. **Stats good enough for the optimizer.** Without min/max key, bloom filters and degree histograms the read path turns every “look up a neighbour” into a fan-out across every SST in scope. 5. **Sane wire stability story.** v1 of the file will outlive v0.1.0 of the binary. The format needs a version byte and explicit forward-compat rules. 6. **No data loss on schema drift.** Open-schema ingest at the SDK is the normal mode of operation for GraphRAG / agent-memory; properties the schema does not yet declare must round-trip through the SST intact, never be silently dropped. Parquet covers (1), (2), (4), (5), and (6) out of the box for nodes. For edges no off-the-shelf format gives us (3) without paying for orthogonal machinery we do not need (e.g. Parquet repetition levels for a CSR-shaped list-of-list adds metadata overhead and an indirection on every neighbour lookup). Hence: Parquet for nodes, custom CSR for edges. ## Design ### 1. Naming conventions and paths A single flush emits a set of SSTs and a parallel set of bloom side-cars. Each SST is identified by a UUIDv7; its on-disk path is determined by the namespace, level, kind, and scope: ```text //sst/level/--. //sst/level/--.bloom where: ∈ {0, 1, 2, …} ∈ {"nodes", "edges-fwd", "edges-inv"} is the label name (nodes) or edge type name (edges-fwd/-inv) is "parquet" for nodes, "csr" for edges is the writer's `Uuid::now_v7()` rendered hex (no dashes) ``` Examples: ```text acme/sst/level0/01959a3f7b...-nodes-Person.parquet acme/sst/level0/01959a3f7b...-nodes-Person.bloom acme/sst/level0/01959a3f7c...-edges-fwd-KNOWS.csr acme/sst/level0/01959a3f7c...-edges-fwd-KNOWS.bloom acme/sst/level0/01959a3f7d...-edges-inv-KNOWS.csr acme/sst/level0/01959a3f7d...-edges-inv-KNOWS.bloom ``` The UUIDv7 is monotonically time-ordered, so a `list` on `sst/level/` yields candidate SSTs in creation order — which lets the read-side merger apply “newest wins per key” without first sorting by some metadata field. Forward and inverse partners produced by the same flush carry **distinct UUIDv7s**: this preserves the rule that one object\_store path identifies exactly one immutable artefact, and lets compaction retire one partner without the other when necessary. `namidb-storage::paths::NamespacePaths` will grow two helpers: ```rust impl NamespacePaths { pub fn sst_file( &self, level: u32, id: Uuid, kind: SstKind, scope: &str, ) -> Path { /* …--. */ } pub fn sst_bloom_file( &self, level: u32, id: Uuid, kind: SstKind, scope: &str, ) -> Path { /* …--.bloom */ } } ``` Scope strings appear in object keys, so the writer **must** validate them against the same DNS-safe ruleset that `NamespaceId` already enforces. Schema declarations gate this upstream, but the writer asserts it again to defend against bypasses. ### 2. Node SST — Parquet layout #### 2.1 Logical schema For each node label `L` with declared properties `p_1: T_1, …, p_k: T_k` (see [`namidb-core::schema::LabelDef`](../../crates/namidb-core/src/schema.rs)), a node SST has the following Arrow schema: | Column | Arrow type | Nullable | Notes | | ------------------ | --------------------- | -------- | --------------------------------------------------------------------------------------------------- | | `node_id` | `FixedSizeBinary(16)` | no | UUIDv7 of the node, big-endian. | | `tombstone` | `Boolean` | no | `true` = node deleted as of `lsn`. | | `lsn` | `UInt64` | no | LSN at which this row was applied. | | `prop_` | `.to_arrow()` | per def | One column per `PropertyDef` declared in the schema at write time. | | `__overflow_json` | `Utf8` | **yes** | Mandatory column. Stores undeclared properties as a JSON object string; `null` when there are none. | | `__schema_version` | `UInt64` | no | Snapshot of the manifest’s schema version at flush time. Lets the reader pin its decode rules. | The `__overflow_json` column is **always present** in the Parquet schema (even when every row is null), so a reader can rely on it being there unconditionally without consulting the manifest first. This closes the revision-1 open question about open-schema ingest: undeclared properties flow through ingest → memtable → SST → reader without any silent drop, and the SDK layer (Python / TS) reconstructs them on read into the caller’s native map type. `__schema_version` is the manifest version the writer used to map property names → columns. Two SSTs written under different schema versions can co-exist; the reader chooses how to reconcile them (typically: newer schema\_version wins; older SST’s columns are mapped back to their names via the manifest of the version that produced them — handled by RFC-004). **Reserved column names.** Every declared property `p` is materialised as the Parquet column **`prop_

`** (i.e. the `prop_` prefix is part of the on-disk column name, not an editorial shorthand). Names that would collide with the engine-managed columns — `node_id`, `tombstone`, `lsn`, `__overflow_json`, `__schema_version` — are reserved. The writer **rejects** any `PropertyDef` whose `name`: * starts with the prefix `prop_` (would double-prefix on disk), * starts with the prefix `__` (engine-private namespace), or * equals one of `{node_id, tombstone, lsn}`. Enforcement happens at schema-declaration time in `namidb-core::schema::PropertyDef::new` *and* is asserted again at flush time with `Error::SchemaConflict`. A reader that observes a column whose name violates the namespace rules treats the SST as corrupted. #### 2.2 Sort order Rows inside a node SST are sorted by **`(node_id ascending)`**. The memtable already gives us this order — `MemKey::Node { label, id }` sorts lexicographically by `id` inside a single label scope. The Parquet writer asserts this invariant; out-of-order rows are a writer bug and abort the flush. This sort order matters because: * The read path’s merge needs an ordered stream per SST to do a k-way merge of `(SST_0, …, SST_n, frozen_memtable, live_memtable)` without buffering everything in memory. * The Parquet page index gives O(log n) lookup for a target `node_id` using just min/max of each page, which is the basis of the warm point-lookup path. Within an SST a `node_id` appears at most once: the memtable already collapses repeated upserts of the same key. #### 2.3 Encodings and compression | Setting | Value | Rationale | | ------------------------ | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Compression | `Zstd` level `6` | Sweet spot per Parquet benchmarks: \~25–35 % ratio on string-heavy graph data, \~250 MB/s decode. | | Dictionary encoding | enabled (per column) | Mandatory for low-cardinality `Utf8` (e.g. country, status). `parquet-rs` falls back automatically when the dictionary stops paying. | | Row group size | **128 K rows** (target) | Page-index granularity per §7.5 of the plan. **Assumed average row size 256 B → 32 MiB row groups warm.** Rotation algorithm: writer closes a row group as soon as **either** 128 K rows have been buffered **or** the in-memory uncompressed byte size of the buffered rows would exceed 256 MiB (whichever fires first). Short flushes that would produce a row group smaller than 16 MiB are merged with the previous group of the same SST when possible (policy in RFC-003). | | Data page size | `1 MiB` | Largest page object\_store will fetch with a single ranged GET. | | Write batch size | `8192` rows | Standard Arrow batch sentinel. | | Statistics | column min/max **on** | Per-page and per-row-group; used by reader for predicate pushdown. | | Page index | **on** | Required by `parquet-rs 55` to enable per-page min/max-driven row-group pruning. | | Bloom (Parquet built-in) | **disabled** | We bring our own side-car bloom over `node_id`. Parquet’s bloom adds bytes for no win here. | | Page checksums | enabled | Cheap, defends against torn S3 reads. | These are defaults; the writer accepts a `NodeSstWriterOptions` to override compression level (e.g. `Zstd-18` for cold archives), so a future compaction worker can re-compress L≥2 SSTs without changing the file format. **NaN / `±Inf` handling.** `f32` / `f64` columns may contain `NaN`, `+Inf`, and `-Inf` freely — the raw bytes always land in the page data, the row is **never rejected** by the writer. The contract is only at the stats layer: a column whose page contains any `NaN` or `±Inf` produces `min = None` and `max = None` in `PropertyColumnStats`. Predicate pushdown gracefully falls back to per-row evaluation when min/max is unavailable. The page’s `null_count` remains accurate. #### 2.4 Tombstones Deletes are stored as rows with `tombstone = true` and **null** property columns. The reader treats a tombstone as “this `node_id` does not exist at `>= lsn`” — overriding any older SST that contains the same `node_id`. Tombstones live until they are absorbed by a compaction that proves no older SST or WAL segment still references the key (full-tree compaction; will be defined in RFC-005 compaction policy). We **do not** use Parquet’s “delete files” (Iceberg-style positional deletes). They require a separate index and a second manifest hop; we get the same semantics by treating `tombstone` as a regular column. **Nullable invariant.** Tombstone rows carry `null` in every declared property column. To make that representable regardless of the `PropertyDef.nullable` flag, the SST-level Arrow field for every declared property is **always nullable** in the Parquet schema, even when the schema declares `nullable = false`. Non-null contracts on declared properties are an **ingest-time** invariant (enforced before a row reaches the memtable), not an SST-time invariant. The writer implementation rejects a non-tombstone row with a null value in a `nullable = false` column at the ingest API boundary; once the row is in the SST, only the `tombstone` flag determines whether the column is “validly null”. #### 2.5 Footer + statistics extraction When the writer closes a Parquet file it emits the standard Parquet footer plus a sidecar `NodeSstStats` struct that goes into the manifest’s `SstDescriptor` (see §4 below): ```rust pub struct NodeSstStats { pub row_count: u64, pub tombstone_count: u64, pub min_node_id: [u8; 16], pub max_node_id: [u8; 16], pub min_lsn: u64, pub max_lsn: u64, pub property_stats: Vec, pub schema_version_min: u64, pub schema_version_max: u64, } ``` `PropertyColumnStats` carries `null_count`, `min`, `max` for ordered types, and `ndv_estimate` (HyperLogLog++ sketch, 1 KiB) for cardinality hints. These are read from the Parquet footer column stats — no extra scan required. The bloom filter is **not** part of this struct (it goes to the side-car file; see §4.2). ### 3. Edge SST — CSR binary format #### 3.1 Why custom A property graph SST for edges has structure that Parquet does not natively express well: * The natural unit of read is “give me all neighbours of node `s` reached by `edge_type`”, which is a **variable-length list** keyed by `s`. * We want **O(1)** access from `src_id` to the offset of its neighbour list; Parquet’s repetition-level / definition-level list encoding gives us O(row\_group) at best. * Edge property values are co-located with the neighbour they describe. Co-locating them in independent Parquet files would lose the joint ordering invariant we rely on to make a single ranged GET hot. A custom format also lets us implement two specific NamiDB optimisations (open from PDF gap #9 / RFC-001 §“contribución propia”): * **Power-law-aware encoding.** High-degree source nodes get a separate block layout (large delta, dense bitmap of neighbours when degree exceeds a threshold). Long-tail sources use the default split-top64/bottom64 encoding. * **Edge-direction packing.** The same edge data is stored twice per edge type per flush: a **forward** SST sorted by `src_id`, and an **inverse** SST sorted by `dst_id`. Both use the same wire format, distinguished by `flags.INVERSE_PARTNER`. The inverse SST stores the *pair* `(dst, src)` in its neighbour positions, so reading “all in-edges of `v`” is the same code path as reading “all out-edges of `v`”. The choice of “always write inverse partner at flush time” (vs “only at compaction”) trades **2× write amplification** for **bounded in-edge-query latency on freshly-flushed data**. Without inverse partners, a `MATCH (n)-[:KNOWS]->(:Person {name: 'Bob'})` query that hits L0 SSTs has to scan every neighbour list — for a 10 M-edge graph that is single-digit seconds, blowing the budget. Bench data during may later motivate making inverse generation optional per edge type, but the default in v1 is “always”. #### 3.2 File layout All multi-byte integers are **little-endian**. All offsets are absolute byte offsets from the start of the file unless stated otherwise. ```text ┌─────────────────────────────────────┐ offset 0 │ File header (64 bytes, FROZEN) │ ├─────────────────────────────────────┤ │ Section: key_ids │ kind = 0x0001 │ sorted UUIDv7s (16 B each) │ │ "src_ids" in fwd / "dst_ids" inv │ ├─────────────────────────────────────┤ │ Section: offsets │ kind = 0x0002 │ one entry per key_id + 1 sentinel│ │ bitpacked u24/u32/u40/u48 │ ├─────────────────────────────────────┤ │ Section: partners │ kind = 0x0003 │ split or dense per-group blocks │ │ "neighbours (dst)" in fwd │ │ "neighbours (src)" in inv │ ├─────────────────────────────────────┤ │ Section: per_edge_lsn │ kind = 0x0004 │ u64 LE, one per edge in order │ ├─────────────────────────────────────┤ │ Section: per_edge_tombstones (opt) │ kind = 0x0005 │ bitmap, 1 bit per edge │ │ (omitted when HAS_TOMBSTONES = 0)│ ├─────────────────────────────────────┤ │ Section: fence_index (opt) │ kind = 0x0006 │ sparse index over key_ids │ │ (present when key_count > 65 536)│ ├─────────────────────────────────────┤ │ Section: property_stream × N (opt) │ kind = 0x0100 (each) │ one section per declared prop + │ │ `__overflow_json` if any present │ │ Zstd-compressed Arrow IPC chunk │ ├─────────────────────────────────────┤ │ Footer (variable size) │ │ section table + body fields + │ │ 20-byte trailer (xxhash + len + │ │ magic) │ └─────────────────────────────────────┘ offset = file_size ``` Section **order on disk is not normative** — the footer’s section table determines the true byte ranges. The diagram above shows the typical layout the writer produces in v1.0. Each section is **independently addressable**: the footer carries a `Section` table mapping `SectionKind → (offset, length, xxhash3, codec)`. A reader that needs only `(key_ids, offsets, partners)` for a label scan fetches exactly those three ranged GETs and ignores the property streams. ##### 3.2.1 File header (64 bytes, frozen) The 64-byte header is **frozen for the lifetime of `format_major`**. Adding any field here requires a major bump. Forward-compatible extension happens only through new footer sections (see §5.2). ```text offset size field value ─────── ──── ──────────────────────────── ──────────────────────────── 0 8 magic b"TGEDGE\0\0" 8 1 format_major u8, current = 1 9 1 format_minor u8, current = 0 10 2 header_size u16 = 64 (sanity check) 12 4 flags u32 bitfield (see below) 16 16 edge_type_id blake3(edge_type)[..16] 32 16 src_label_id blake3(src_label)[..16] 48 16 dst_label_id blake3(dst_label)[..16] ``` Flags: | Bit | Name | Meaning | | ---- | ----------------- | ------------------------------------------------------------------------------------ | | 0 | `HAS_PROPERTIES` | At least one property column present in §6+. | | 1 | `HAS_TOMBSTONES` | At least one tombstone bit is `1` (cheap shortcut). | | 2 | `SKEW_BUCKETS` | Section 3 contains at least one skew-bucket group. | | 3 | `INVERSE_PARTNER` | This file is the inverse-direction CSR; §1 path uses `edges-inv-`. | | 4-31 | reserved | Must be zero in v1.0; reserved bits a v1.x reader does not recognise abort the read. | A reader **must** reject the file if `format_major > 1` or `header_size ≠ 64`. It must treat `format_minor > 0` as forward-compatible per the rules in §5.2. It must reject any non-zero reserved bit it does not understand. ##### 3.2.2 Section 1: key\_ids `key_ids` is the strictly increasing, deduplicated array of key `node_id`s present in this SST. **Semantics depend on `flags.INVERSE_PARTNER`:** * `INVERSE_PARTNER = 0` (forward): keys are `src_id`s; partners in §3.2.4 are `dst_id`s. Reads “out-edges of `s`”. * `INVERSE_PARTNER = 1` (inverse): keys are `dst_id`s; partners are `src_id`s. Reads “in-edges of `d`”. Fixed-size 16-byte UUIDv7 records. The writer asserts: every `key_id` must be `>` the previous one (`Ord` on `[u8; 16]`), no duplicates. This array is the binary-searchable handle into the offsets / partners structure. Length is `key_count`; section size is `16 * key_count`. ##### 3.2.3 Section 2: offsets `offsets[i]` is the byte offset (inside Section 3, relative to the start of Section 3) at which the partner group of `key_ids[i]` begins. `offsets[key_count]` is a sentinel == size of Section 3. Encoding is bitpacked with a width chosen by the writer at close time based on the maximum offset value: | Section 3 size | Bits per offset | Format | | ----------------- | --------------- | --------- | | < 2²⁴ B (16 MiB) | 24 | 3-byte LE | | < 2³² B (4 GiB) | 32 | `u32` LE | | < 2⁴⁰ B (1 TiB) | 40 | 5-byte LE | | < 2⁴⁸ B (256 TiB) | 48 | 6-byte LE | A fixed-width layout is preferable to varint here because the read path needs random access (`offsets[i]`) without scanning. The chosen width is recorded in the footer (`offsets_bits` field). ##### 3.2.4 Section 3: partners (neighbours / sources) Each per-key group is laid out as one of two block kinds. The writer picks the kind per group based on degree. Every block opens with a 1-byte tag and a varint `deg`. v1 defines two tags; future v1.x readers may support more. ```text ┌──────────────┬─────────┬────────────────────────────────────────────┐ │ deg: varint │ tag: u8 │ payload │ └──────────────┴─────────┴────────────────────────────────────────────┘ tag = 0x01 → split block (split-top64/bottom64 encoding) tag = 0x10 → dense block (raw 16-byte partners) tag = others → reject as Error::Corrupted in v1.0 ``` The writer picks the block kind per key group based on the rule below; the picked kind is independent of `flags.SKEW_BUCKETS`, which is set in the header whenever **any** group of the file emitted a dense block (`tag = 0x10`). ###### Selection rule For a group of degree `d` with partners sorted ascending by the full 128-bit id, the writer computes the encoded byte cost of the split block (deterministically, using the encoding below). It then emits: ```plaintext let split_cost = …; // see "Split block — encoding" below let dense_cost = 16 * d; // always if d > skew_threshold || split_cost >= dense_cost { emit dense block (tag = 0x10) } else { emit split block (tag = 0x01) } ``` `skew_threshold` is bench-driven; the v1 default is `max(1024, 4 * sqrt(key_count))`. The `split_cost >= dense_cost` clause is the “always-correct fallback”: for pathological partner distributions (spanning the full `u64` range with near-uniform deltas) the split encoding can balloon to 18 B per partner, so the dense block bounds the worst case at 16 B per partner regardless. ###### Split block — encoding (`tag = 0x01`) UUIDv7 splits cleanly into a top 64 bits (ms timestamp + 4-bit version * 12-bit sub-ms entropy) that is nearly monotonic over time, and a bottom 64 bits that is uniformly random. We exploit the top half for compression and write the bottom half raw. Payload, partners sorted ascending by the full 128-bit id: ```text top64[0]: varint // absolute top64 of partner[0] bot64[0]: u64 LE // raw bottom64 of partner[0] top64_delta[j]: varint // = top64[j] - top64[j-1] (j ∈ 1..deg) bot64[j]: u64 LE // raw bottom64 of partner[j] ``` Encoded cost in bytes: `split_cost = len_varint(top64[0]) + 8 + Σ_{j=1..deg-1} (len_varint(top64_delta[j]) + 8)`. Typical cost (partners clustered within seconds of each other, so `top64_delta <= 127`): **9 B per partner**. Cost when partners span months but were created in the same year: **13–14 B per partner**. Absolute worst case (artificial, e.g. `u64::MAX` deltas): **18 B per partner** — the writer detects this via the selection rule above and emits a dense block instead. `top64_delta[j]` may legally be `0` (two partners created in the same ms with the same 12-bit sub-ms entropy). The bottom-64 ordering breaks the tie; the writer asserts strictly increasing 128-bit partner id, so two partners with both halves equal are a writer bug. ###### Dense block — raw partners (`tag = 0x10`) ```text ┌──────────────┬─────────┬────────────────────────────────────────────┐ │ deg: varint │ tag: u8 │ partners: [u8; 16 * deg] │ └──────────────┴─────────┴────────────────────────────────────────────┘ ``` Always-correct fallback. Used for super-nodes (`deg > skew_threshold`) and for any group where the split encoding would not be smaller. Future v1.x readers may support `tag = 0x11..` (e.g. Roaring on `hash(partner) mod 2³²`) and a writer that emits them; v1.0 readers reject any tag not in `{0x01, 0x10}` with `Error::Corrupted`. ##### 3.2.5 Section 4: per-edge LSN For every edge, in the same order as the partner enumeration of Section 3, one `u64` LE with the LSN at which that edge was applied. Length is `edge_count * 8`. Used for: * Conflict resolution at read time when the same `(key, partner)` pair shows up in older and newer SSTs (the newer LSN wins). * Compaction merge to filter shadowed edges. ##### 3.2.6 Section 5: per-edge tombstone bitmap `ceil(edge_count / 8)` bytes; bit *j* is the tombstone flag of edge *j* in the partner enumeration order of Section 3. A tombstone edge keeps its position in the partner array; the reader filters it out unless explicitly asked for history (branching / replay). **Section-omission rule.** If no edge in the SST is tombstoned the writer **omits** this section: it sets `flags.HAS_TOMBSTONES = 0` in the header, and the footer’s section table contains no entry of kind `per_edge_tombstones`. A v1.X reader that finds `HAS_TOMBSTONES = 0` must treat every edge as non-tombstoned without looking for the section; a reader that finds `HAS_TOMBSTONES = 1` but no entry in the section table treats the SST as corrupted. **Forward / inverse consistency invariant.** When an edge `(s, d, lsn)` is tombstoned in the writer’s frozen memtable, its corresponding entry in **both** the forward partner (key = `s`, partner = `d`) and the inverse partner (key = `d`, partner = `s`) is tombstoned at the same LSN. The writer enforces this by reading the tombstone bit from a single canonical source (the frozen memtable’s `MemOp::Tombstone`) during the construction of each partner, never from independent computations over the two transpositions. Tests #5 and #6 (§7) lock this invariant down. ##### 3.2.7 Sections 6..N: property streams One section per declared property `q` on this edge type. Each section holds a Zstd-compressed Arrow IPC chunk with a single column whose row *j* corresponds to edge *j* in the partner enumeration order. We choose Arrow IPC (not Parquet) for property streams because: * Each section already lives inside the CSR file’s footer table, so we do not need Parquet’s column metadata. * Arrow IPC’s record batch layout maps 1:1 to a column; zero-copy decode with `arrow-ipc::reader::StreamReader`. * Reusing Arrow primitives means a property column for an edge looks identical to a property column for a node — same `DataType ↔ ArrowDataType` mapping as in `namidb-core::schema`. Schema-undeclared properties on edges land in a single `__overflow_json` property stream with `name = "__overflow_json"`. Unlike node SSTs (where the `__overflow_json` *column* is always present in the Parquet schema, possibly all-null), the edge SST `__overflow_json` **section** is **only emitted when at least one edge has overflow data**. When no overflow is present, the writer omits the section and `HAS_PROPERTIES` reflects only the declared properties (or is `0` if there are none). A reader that needs overflow data and finds no `__overflow_json` section reads every edge’s overflow as `null`. ##### 3.2.8 Footer The footer is the last bytes of the file. It has a fixed-length **trailer** (always 20 bytes at the very end) and a variable-length **body** that precedes the trailer. ```text ┌──────────────────────────────────────────────────┐ ← footer body start │ Section table: section_count × SectionEntry │ │ SectionEntry { │ │ kind: u16, // discriminator │ │ offset: u64, // from file byte 0 │ │ length: u64, // bytes │ │ codec: u8, // 0=none, 1=zstd │ │ reserved: u8, │ │ xxhash3_64: u64, // over the on-disk │ │ // bytes of the │ │ // section as stored │ │ name_len: u8, │ │ name: [u8; name_len], // utf8 │ │ } │ ├──────────────────────────────────────────────────┤ │ section_count: u32 │ │ key_count: u64 │ │ edge_count: u64 │ │ offsets_bits: u8 // 24 / 32 / 40 / 48 │ │ min_key_id: [u8; 16] │ │ max_key_id: [u8; 16] │ │ min_lsn: u64 │ │ max_lsn: u64 │ │ schema_version_min: u64 │ │ schema_version_max: u64 │ ├──────────────────────────────────────────────────┤ ← trailer start │ footer_xxhash3_64: u64 (covers footer body) │ │ footer_len: u32 (body + trailer length) │ │ magic: 8 bytes b"TGEDGE\xFE\xEF" │ └──────────────────────────────────────────────────┘ ← end of file ``` Precise definitions: * **`footer_xxhash3_64`** is computed over the **footer body** only: from the first byte of the section table up to and including `schema_version_max` (i.e. all bytes between the body-start marker and the trailer-start marker above). It does *not* cover any byte of the trailer itself. * **`footer_len`** is the total byte length of footer body + trailer (i.e. the offset from the trailer’s last byte to the body’s first byte, inclusive). Equivalently: `footer_len = file_size - body_start`. A reader uses this to find the body start once it has the trailer. Section `kind` discriminators (u16): | Value | Kind | Notes | | ------ | --------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | 0x0001 | key\_ids | Mandatory. | | 0x0002 | offsets | Mandatory. | | 0x0003 | partners | Mandatory. | | 0x0004 | per\_edge\_lsn | Mandatory. | | 0x0005 | per\_edge\_tombstones | Optional (see §3.2.6). | | 0x0006 | fence\_index | Optional; required when `key_count > 65 536`. See §3.2.9. | | 0x0100 | property\_stream | Optional; **one entry per property**, distinguished by `name`. Reserved names: `__overflow_json` for schema-undeclared props. | | Others | reserved | A v1.0 reader skips unknown kinds outside the reserved ranges (forward-compat per §5.2). | All property streams share the same `kind = 0x0100`; the `name` field discriminates them. The writer rejects any property declaration whose `name` collides with a reserved column name (see §2.1) at SST creation time — `__overflow_json` is the only legal entry beginning with `__`. A reader locates the footer by: 1. Ranged GET for the last 4 KiB of the object (covers any footer up to \~4 KiB; for SSTs with few sections this is enough). 2. Read the trailing 8-byte magic at the end of the response. If absent, expand to the last 64 KiB and retry. If still absent the file is corrupt. 3. From the trailer read `footer_len`. If the prefetched window is too small, issue a second ranged GET for `[file_size - footer_len, file_size)`. 4. Verify `footer_xxhash3_64` against the body bytes. Mismatch → `Error::Corrupted`. 5. Validate that every `SectionEntry`’s `[offset, offset + length)` range lies strictly within `[64, file_size - footer_len)`. Any overflow → `Error::Corrupted`. The section table is sorted ascending by `offset`. A reader can linear-scan by `kind` when looking up a specific section. ##### 3.2.9 Fence-pointer index (optional) The naive lookup of “find `s` in `key_ids`” requires either fetching the entire `key_ids` section (16 B × `key_count`) or doing a remote binary search (≈`log2(key_count)` ranged GETs over 16-byte windows). For `key_count = 1 M` the first option costs a 16 MiB cold GET; the second costs \~20 round-trips. Neither is acceptable for the §14.1 cold-query budget when SSTs grow past a few hundred thousand keys. The fence-pointer index solves this with a sparse local index over `key_ids`. The writer emits one fence entry **every `fence_stride` keys** (default `fence_stride = 256`). Each entry stores the key value and the byte offset of that key within the `key_ids` section. ```text ┌──────────────────────────────────────────────────┐ │ fence_stride: u32 (e.g. 256) │ │ entry_count: u32 (= ceil(key_count / stride)) │ │ entries: [ FenceEntry ; entry_count ] │ │ FenceEntry { │ │ key: [u8; 16], // = key_ids[i * fence_stride] │ │ key_ids_offset: u64, // = i * fence_stride * 16 │ │ // (relative to byte 0 │ │ // of section key_ids) │ │ } │ └──────────────────────────────────────────────────┘ ``` Total size: `4 + 4 + entry_count * 24` bytes. For 1 M keys with stride 256 → 3 906 entries → ≈94 KiB — cacheable by foyer on first probe. **Writer rule.** A fence index is **emitted** when `key_count > 65 536`. Below this threshold the entire `key_ids` section is small enough (≤ 1 MiB) to fetch and binary-search in memory cheaply. **Reader algorithm for “find offset of key `k` inside `key_ids`”:** ```text if footer has no fence_index section: fetch the full key_ids section (≤ 1 MiB by construction) binary search in memory else: fetch the fence_index section once (cached) binary search the fence entries to find the bracket [fence[i].key, fence[i+1].key) containing k issue one ranged GET for key_ids[fence[i].key_ids_offset .. fence[i+1].key_ids_offset] binary search that window in memory ``` Total cold cost: **2 GETs** (fence + key\_ids window) regardless of `key_count`. Warm cost: 1 GET (the window; fence is cached). The fence index is a v1.0 optional artefact: an older reader that ignores the section still works correctly via the naive path. #### 3.3 Statistics extraction When the writer closes either partner of an edge SST it emits: ```rust pub enum EdgeDirection { /// Keys are src_id; partners are dst_id. File path uses `edges-fwd-`. Forward, /// Keys are dst_id; partners are src_id. File path uses `edges-inv-`. Inverse, } pub struct EdgeSstStats { pub direction: EdgeDirection, pub key_count: u64, pub edge_count: u64, pub tombstone_count: u64, pub min_key_id: [u8; 16], pub max_key_id: [u8; 16], pub min_lsn: u64, pub max_lsn: u64, pub degree_histogram: DegreeHistogram, pub property_stats: Vec, pub schema_version_min: u64, pub schema_version_max: u64, } pub struct DegreeHistogram { /// 64 log2-spaced buckets: /// counts[i] = #keys with deg in [2^i, 2^(i+1)) pub counts: [u32; 64], pub max_degree: u64, pub sum_degree: u64, } ``` For a **forward** partner, `degree_histogram` describes out-degree. For an **inverse** partner, it describes in-degree. The cost-based optimizer reads the histogram of the partner it is about to traverse. The bloom filter is **not** part of this struct (side-car; see §4.2). #### 3.4 Read access patterns and ranged GETs This section quantifies the GET count for the common access patterns of v1, to make the columnar layout’s cost explicit. Notation: * `D` = direct-cached descriptor reads (`current.json` + manifest body), amortised across queries. * `B` = bloom side-car GET (1 GET per SST when min/max does not already exclude). Cached by foyer; second visit free. * `F` = SST footer GET (last 4 KiB; cached per SST). * `Khdr` = SST header GET (first 64 B + the section table prefix; can be coalesced with `F` in one ranged GET for SSTs ≤ \~16 MiB). **Pattern A — point lookup `node_id = v`.** ```plaintext D + (per candidate SST) [B + F + ranged GET into the matching page] ``` Cold per SST: \~3 GETs. With foyer warm, `B + F` are free; only the page GET remains (\~1 GET). **Pattern B — out-edge expansion of a known src `s` (forward SST).** The reader resolves `s → index_in_key_ids → offset_in_partners → range of partners` using the fence index (§3.2.9) when present, or the full `key_ids` section otherwise. For SSTs **without** a fence index (`key_count ≤ 65 536`, so `key_ids ≤ 1 MiB`): ```plaintext D + B + F + GET key_ids (≤ 1 MiB) + GET offsets[i..i+1] + GET partners[off..off+len] ``` Cold per SST: **5 GETs**; the `key_ids` and `offsets` ranges coalesce into a single ranged GET when `key_count * 16 + offsets_bytes ≤ 1 MiB` (true for L0 SSTs after a single flush). Warm: `B + F + key_ids + offsets` are foyer-cached; only the `partners` GET remains (**1 GET warm**). For SSTs **with** a fence index (`key_count > 65 536`): ```plaintext D + B + F + GET fence_index (~100 KiB) + GET key_ids window (≤ fence_stride * 16 ≈ 4 KiB by default) + GET offsets[i..i+1] + GET partners[off..off+len] ``` Cold per SST: **6 GETs**, all independent and parallelisable. Warm: 1 GET (partners). **Pattern C — in-edge expansion of a known dst `d` (inverse SST).** Identical to Pattern B, just with the inverse partner SST. **Pattern D — edge expansion with property predicate (e.g. `where edge.since > date`).** Pattern B + 1 additional GET on the property stream’s range corresponding to the partners we touched. Cold per SST: \~6 GETs; property stream GET coalesces with `partners` when both ranges lie in the same MiB window. The “concurrent ranged GETs” feature of `object_store::aws` lets us fire patterns B/D’s GETs in parallel; cold p50 wall time is bounded by the slowest GET, not their sum. With `S3 Express One Zone` the per-GET floor drops from \~30 ms to \~5 ms — directly on the §14.1 budget. ### 4. Embedded statistics + bloom side-car in the manifest #### 4.1 Extended `SstDescriptor` This RFC promotes `SstDescriptor` from the minimal version in RFC-001 to the form below. Everything in this struct is JSON-cheap (a few hundred bytes per SST excluding `property_stats`, which scales with column count). For 100 K SSTs the manifest stays under \~10 MiB, the budget at which we switch JSON → Arrow IPC (recorded as an Open Question in RFC-001). ```rust pub struct SstDescriptor { // ── identity ── pub id: Uuid, pub kind: SstKind, // Nodes | EdgesFwd | EdgesInv pub scope: String, // label or edge_type pub level: SstLevel, pub path: String, // relative to namespace // ── physical ── pub size_bytes: u64, pub row_count: u64, // node rows or edge rows pub created_at: DateTime, // ── key range (raw bytes; serialised as base64 in JSON) ── pub min_key: [u8; 16], // node_id (Nodes) or key_id (Edges) pub max_key: [u8; 16], pub min_lsn: u64, pub max_lsn: u64, pub schema_version_min: u64, pub schema_version_max: u64, // ── stats embedded ── pub property_stats: Vec, pub kind_specific: KindSpecificStats, // ── bloom side-car pointer (None when the SST is small enough // that scanning is cheaper than probing; see §4.2) ── pub bloom: Option, } pub enum SstKind { Nodes, EdgesFwd, EdgesInv, // Vectors lands in RFC-007; reserved here so reader code can match // exhaustively against the v1 set. } pub enum KindSpecificStats { Nodes { tombstone_count: u64 }, Edges { // key_count == row_count for nodes; for edges key_count is // distinct src/dst count (depending on direction). key_count: u64, tombstone_count: u64, degree_histogram: DegreeHistogram, }, } ``` JSON serialisation: `min_key` / `max_key` are 16-byte arrays serialised as **base64** (`base64::STANDARD`). All other fields use their natural JSON encoding. #### 4.2 BloomDescriptor (side-car pointer) The bloom filter for an SST lives in its own object next to the SST body. The manifest only carries a pointer to it plus the parameters needed to probe it without re-reading the body: ```rust pub struct BloomDescriptor { pub path: String, // object_store path of the side-car pub size_bytes: u32, // total side-car file size (header + blocks + trailer) pub key_count: u64, // number of keys inserted into the filter pub bits_per_key: u8, // default 10 → ~1 % FPR pub block_count: u32, // 256-bit (32-byte) blocks pub xxhash3_64: u64, // checksum over the side-car body (per §4.2 wire spec) } ``` We use the **split-block bloom filter (SBBF)** construction Parquet adopted — a single 64-bit hash per key drives a deterministic 8-bit mask inside one 256-bit block. There is no separate `k_hashes` parameter (the “k” is fixed at 8 by construction). The hash function is **xxHash3-64** (same library, same seed = 0 as elsewhere in this RFC). Block selection from a hash `h`: ```plaintext let block_index = ((h >> 32) * block_count as u64) >> 32; ``` The 8-bit mask inside the chosen block is the standard SBBF mask (see Putze et al., 2010; identical to Parquet’s `bloom_filter_algorithm = SPLIT_BLOCK`). Implementations crib the constants from `parquet-rs 55::bloom_filter`. The total side-car size is exactly `28 (header) + 32 * block_count + 8 (trailer xxhash)`. For 10 bits / key the writer rounds up `block_count = ceil(key_count * bits_per_key / 256)`; e.g. 1 M keys ⇒ `block_count = 39 063` ⇒ side-car = 1 250 052 bytes ≈ 1.19 MiB. ##### Side-car wire format ```text ┌──────────────────────────────────────┐ offset 0 │ magic: 8 bytes b"TGBLOOM\0" │ │ format_major: u8 = 1 │ │ format_minor: u8 = 0 │ │ reserved: u16 = 0 │ │ bits_per_key: u8 │ │ reserved2: u8 = 0 │ // was k_hashes pre-rev3; kept │ │ // for alignment, value MUST be 0 │ reserved3: u16 = 0 │ │ block_count: u32 │ │ key_count: u64 │ ├──────────────────────────────────────┤ │ blocks: [SbbfBlock; block_count] │ │ SbbfBlock = [u8; 32] │ ├──────────────────────────────────────┤ │ xxhash3_64 over the entire file │ │ minus the trailing 8 bytes: │ │ trailing: u64 LE │ └──────────────────────────────────────┘ ``` Split-block bloom filters (SBBF) at 10 bits/key give \~1 % FPR — the same parameters Parquet uses internally. For 1 M keys the side-car is ≈1.25 MiB; for a typical SST of 100 K–200 K keys it is ≈125–250 KiB. **A reader probes the bloom by**: 1. (Optional) `min_key`/`max_key` overlap test — manifest-only, no GET. If no overlap, skip the SST. 2. Resolve `bloom.path` to an absolute object\_store path. 3. Issue one ranged GET for the side-car body; foyer caches it after the first probe per process. 4. Verify `xxhash3_64`. Run k hashes against the appropriate `SbbfBlock`. If absent, skip the SST. The bloom over `node_id` (for node SSTs) and over `key_id` (for edge SSTs of either direction) is therefore the gate between “manifest says maybe” and “let’s pay for the SST body GET”. For very small SSTs (`size_bytes < 256 KiB`), the writer **omits the bloom side-car** entirely — `SstDescriptor.bloom` is set to `None` and no `.bloom` object exists on object storage. A 200-key SST is faster to scan than to probe. Readers seeing `bloom = None` skip the bloom step (and skip the corresponding ranged GET) but still respect the manifest’s `min_key`/`max_key` overlap test. #### 4.3 PropertyColumnStats ```rust pub struct PropertyColumnStats { pub name: String, pub null_count: u64, pub min: Option, pub max: Option, pub ndv_estimate: Option, // 1 KiB HLL++; None for vectors/json } pub enum StatScalar { Bool(bool), Int32(i32), Int64(i64), Float32(f32), // NaN / Inf are stat-disqualifying; field is None Float64(f64), // idem Utf8(String), Binary(Bytes), Date32(i32), TimestampMicrosUtc(i64), } ``` Vector columns (`FloatVector { dim }`) and `Json` columns produce no `min`/`max` (they remain `None`) but still contribute a `null_count`. The `__overflow_json` column always produces `min`/`max = None`, `null_count` only — its `ndv_estimate` is also `None` (HLL over JSON documents has no operational use here). ### 5. Wire compatibility #### 5.1 Node SSTs Parquet itself carries its own version + magic (`PAR1`) and is forward-compatible across `parquet-rs` minor versions. We pin `parquet = "55"` workspace-wide. Reading SSTs written by future NamiDB builds works as long as we do not introduce new logical column conventions; if we do, we will bump a `node_sst_format` field in `SstDescriptor.kind_specific` so old readers can refuse. `__overflow_json` is required by **all** v1 node SSTs (even when every row is null). A reader that loads an SST missing this column refuses with `Error::Corrupted { detail: "node SST missing __overflow_json" }`. #### 5.2 Edge SSTs Edge SSTs are **owned** by NamiDB. The compatibility contract is: | Condition observed by reader v1.X (X ≥ 0) | Action | | --------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | `format_major > 1` | Refuse: `Error::Corrupted`. | | `format_major < 1` | Refuse: `Error::Corrupted` (no v0 exists). | | `format_major = 1`, `header_size ≠ 64` | Refuse: `Error::Corrupted`. | | `format_major = 1`, `format_minor ≤ X` | Read normally. | | `format_major = 1`, `format_minor > X` | Read normally; skip any footer section whose `kind` is not in this reader’s v1.X table; refuse if any section table entry crosses the file end. | | Unknown reserved bit in `flags` | Refuse: an unknown flag implies an unknown invariant. | | Unknown `partners` block tag | Refuse: `Error::Corrupted`. | | Unknown footer section `kind` outside the reserved ranges | Skip the section (forward-compat). | A writer **must not** introduce a breaking change to the 64-byte header or to any existing section’s internal layout without bumping `format_major`. Adding a new footer section kind is a `format_minor` bump only. Removing a footer section kind is a major bump. #### 5.3 Side-cars The bloom side-car follows the same major / minor convention. v1.0 readers refuse any bloom side-car with `format_major > 1`. ### 6. Implementation plan (Rust crate layout) Inside `crates/namidb-storage/src/sst/`: ```text sst/ ├── mod.rs # re-exports + common types (SstId, BloomDescriptor, …) ├── stats.rs # PropertyColumnStats, DegreeHistogram, HLL sketch ├── bloom.rs # SBBF build + probe; side-car wire format ├── nodes.rs # NodeSstWriter, NodeSstReader (Parquet) └── edges/ ├── mod.rs # public API: EdgeSstWriter, EdgeSstReader, EdgeDirection ├── header.rs # 64-byte header struct + serde ├── footer.rs # section table + xxhash3 + magic (per §3.2.8) ├── writer.rs # streaming writer (forward + inverse in one pass) ├── reader.rs # ranged-GET reader; section cache; bloom integration ├── encoding.rs # bitpacked offsets, split-top64/bottom64 neighbours, │ # selection rule (split vs dense) ├── fence_index.rs # writer + reader for the optional fence-pointer index └── inverse.rs # in-memory transpose of a FrozenMemtable edge bucket ``` New types lift into `namidb-storage::lib.rs` as part of the public crate API for downstream crates (query engine). The `manifest` module is updated in lockstep to carry the extended `SstDescriptor`. Two new workspace dependencies are required: * `xxhash-rust` — feature `xxh3`, used for all SST + bloom checksums. * `base64` — used by the `min_key` / `max_key` JSON encoding in the manifest. ### 7. Test plan The following tests land alongside this RFC’s implementation. Test budget: bring the workspace from 36 → ≥ 70 passing tests. 1. **Round-trip property nodes.** Build a memtable of `Person` rows, freeze, write Parquet SST to `object_store::memory::InMemory`, read back, assert byte-for-byte equality of property values. 2. **Overflow round-trip.** Write a node with one declared and one undeclared property; assert the undeclared one round-trips through `__overflow_json` losslessly. 3. **Tombstone semantics.** Insert + delete + insert at increasing LSNs; read back; assert the latest LSN wins and that deleted-then-reinserted nodes are present. 4. **Edge CSR round-trip (forward).** Build a graph with 100 K edges across 10 K sources, write forward CSR, read back, assert neighbour lists equal. 5. **Edge CSR inverse partner.** Same graph; assert the inverse SST, when probed by `dst`, returns each src that originally pointed to that dst, in sorted order. 6. **Inverse partner == transposed forward.** Build a graph; write both partners; for every edge, assert it is present in both. 7. **Edge skew bucket.** Construct one super-node with degree 5 000, the rest degree ≤ 4; assert the writer emitted `tag = 0x10` for that group and `tag = 0x01` otherwise; reader returns the full list. 8. **Split-encoded compression win.** Generate 1 000 partners with all their top64 equal (same ms); assert the encoded size is `< 9 * 1000 + small_overhead` (vs 16 KiB raw). 8a. **Split-to-dense fallback.** Construct a 100-partner group whose partners are spaced so that every `top64_delta` would require ≥ 9 varint bytes; assert the writer emitted `tag = 0x10` (dense) for that group and that the bytes-on-disk for the group are exactly `1 (varint deg) + 1 (tag) + 100 * 16`. 8b. **Reserved column name rejected.** Build a `SchemaBuilder` with a `PropertyDef { name: "tombstone", … }`; assert `Error::SchemaConflict`. 8c. **Fence-pointer index round-trip.** Build an edge SST with `key_count = 200 000` (above the fence threshold); assert that the footer contains a `fence_index` section, that the reader cold-path issues exactly 2 GETs for a `src` lookup (fence + window), and that the result matches a naive linear scan. 8d. **Fence-pointer index absent below threshold.** Build an edge SST with `key_count = 1 000`; assert no `fence_index` section in the footer and that the reader takes the “fetch full key\_ids” branch. 8e. **Tombstone consistency fwd ↔ inv.** Flush a memtable with one tombstoned edge `(s, d, lsn)`; assert that the forward partner has `tombstone_bit[j] = 1` at the position corresponding to `(s → d)` and the inverse partner has `tombstone_bit[k] = 1` at the position corresponding to `(d → s)`, with both LSNs equal. 9. **Random-access edge lookup.** Open the reader, query `src=X`, assert only the expected ranged GETs hit the store (use the `object_store::memory::InMemory` plus a counting wrapper). Validate pattern B’s GET count. 10. **Stats correctness.** After writing an SST, the returned stats match those independently computed from the source data (`row_count`, `min/max`, `tombstone_count`, `degree_histogram`). 11. **Bloom correctness.** Bloom contains every inserted key (FPR check on a held-out set is ≤ 2 × theoretical). 12. **Bloom side-car wire.** Write a bloom side-car, corrupt one byte, assert `Error::Corrupted` at probe time. 13. **Small SST omits bloom.** Write an SST with `size_bytes < 256 KiB`; assert `bloom.path == ""` in the descriptor and the reader uses the in-body scan path. 14. **Footer corruption.** Truncate the last 16 bytes of an edge SST, assert `Error::Corrupted`. 15. **xxHash3 mismatch detected.** Flip one byte inside a section’s body, assert the section’s checksum verification fails when read. 16. **Forward-compat skip.** Write an SST with a synthetic `section_kind = 0x0FFF` of payload `"ignored"`; the v1 reader must ignore it and still return correct data. 17. **Major mismatch refused.** Manually flip `format_major = 2` in the header, assert the reader returns `Error::Corrupted`. 18. **Header size mismatch refused.** Flip `header_size = 80`, assert the reader refuses. 19. **Unknown reserved flag refused.** Set flag bit 5, assert `Error::Corrupted`. 20. **LocalStack integration.** A single end-to-end test (`#[ignore]`) that writes both a node SST and a forward+inverse edge pair through `object_store::aws` against LocalStack and reads them back, including pattern B GET-count assertions against the LocalStack request log. ## Alternatives considered ### A. Parquet for edges too Use Parquet `list>` keyed by `src_id`. Rejected: * A list-shaped column needs Parquet repetition levels, which add \~2 bytes per edge of metadata and a definition-level mask that has to be walked on read. * Random access to “neighbours of src = X” still costs O(row\_group), not O(1), because Parquet has no random access into a list cell. * We lose the ability to encode the skew optimisation cleanly (would require a sibling sparse column). ### B. Lance v2 for edges Lance v2 is excellent for vectors and blobs but is not optimised for the adjacency-list shape. Its strengths (zero-copy random access into blob columns; Vamana / IVF integration) do not map onto CSR. We will use Lance for the **vector** SST kind in RFC-007. ### C. SlateDB’s SST verbatim SlateDB is a KV store. Its SST format is well-tuned for `(key, value)` pairs but does not carry the columnar invariants we need for property columns or for CSR offsets. Reusing it would force every read to do a key-decode pass that we get for free in Parquet. ### D. Iceberg manifest + Parquet data files We considered structuring SSTs as an Iceberg table. Rejected for v1 because: * Iceberg’s manifest layout is heavier than ours and adds a level of indirection irrelevant to a single-writer LSM. * Iceberg’s snapshot semantics overlap with ours but with different retention semantics; we would not be able to express “branch with fork retention” without subverting Iceberg’s vacuum. * We will revisit in **RFC-014 Iceberg integration** as an *export* surface — write an Iceberg view *of* the SSTs, not store them as one. ### E. Embedded bloom (revision-1 plan) Originally we planned to inline the bloom inside the manifest `SstDescriptor`. Rejected during revision 2: a 1.25 MiB raw bloom becomes 1.65 MiB base64, and a 100 K-SST namespace would produce a \~165 GB manifest. Side-car keeps the manifest under the JSON budget while still allowing the bloom to be fetched lazily and cached by foyer. ### F. Single-direction edge SSTs Skipping the inverse partner halves write amplification at flush time. Rejected for v1: a `MATCH (n)-[:KNOWS]->(:Person {name: 'Bob'})` query against L0 SSTs degenerates to `O(|E|)` neighbour scans, which destroys the §14.1 cold-query budget. Single-direction may be reintroduced as a per-edge-type override (e.g. for write-heavy log-shaped edges) once we have bench data. ### G. Per-section CRC32 (revision-1 plan) Originally CRC32 IEEE, matching the WAL. Rejected during revision 2 in favour of xxHash3-64 for SSTs only. Rationale: S3 already provides strong integrity (HTTP MD5 / CRC32C) end-to-end, so SST checksums exist to defend against client-side / memory-side corruption. xxHash3 is \~3-5 × faster than CRC32 IEEE at the same defence quality for the fail-modes that matter at this layer. WAL keeps CRC32 because its fail-modes include torn 4 KiB writes where CRC’s burst-error guarantees are useful. ## Drawbacks 1. **Bloom side-car costs a GET.** A query that does not have min/max pruning available pays one extra ranged GET per candidate SST on the cold path. Mitigation: foyer caches every bloom side-car after first touch (typical size 125 KiB–1.5 MiB), and the “small-SST omit-bloom” rule means the cost only applies to SSTs large enough to benefit anyway. Bench-targeted. 2. **Inverse partner doubles write amp on the edges path.** Mitigation: per-edge-type override is a v1.1 follow-up. The expectation is that most graph workloads are write-once-read-many, so the asymmetry between flush and query cost is acceptable. 3. **Custom edge format is more code to maintain.** We are taking on a wire format that we now own forever. Mitigations: a small `format_major / minor` invariant, exhaustive round-trip tests, and a `namidb-storage` CLI subcommand (`inspect-sst `) to dump the header / footer for ops debugging (lands with the writer). 4. **Parquet’s per-row-group footer overhead** dominates for very small node SSTs (< 10 K rows). Mitigation: the writer aggregates short flushes to ≥ 128 K rows when possible — see flush path RFC-003 for the policy. 5. **Skew block ships only `tag = 0x10` dense.** Roaring integration (`tag = 0x11`) lands once bench data justifies it; until then a super-node with 1 M out-edges uses 16 MiB of dense storage per SST. Acceptable for prototype. 6. **`f32` / `f64` stats skip NaN / Inf.** This matches Parquet’s strict stats but means we silently drop min / max when a column contains them. Predicate pushdown gracefully falls back to per-row evaluation. Tracked. 7. **Manifest growth is bounded but not constant.** With 100 K SSTs the manifest is still \~10 MiB (dominated by `property_stats`, `degree_histogram`, key ranges). The JSON → Arrow IPC switch lands when bench data warrants it; until then we set a hard 10 MiB cap and the writer fails the commit if a new manifest would exceed it (with a clear error pointing at the migration RFC). ## Open questions 1. **Bloom probe in pattern A vs page-index probe.** For point lookups on node SSTs the Parquet page index already gives row-group pruning at min/max granularity. The bloom helps when min/max intervals overlap. Bench may show that the bloom is unnecessary for node SSTs and only edge SSTs need it. Leaving the bloom on for both kinds in v1 keeps the read path uniform. 2. **HLL sketch byte budget.** 1 KiB per column per SST is the current default. Could be lowered to 256 bytes (less accuracy) if manifest growth becomes a bottleneck before the IPC migration. Bench-driven. 3. **`tag = 0x11` (Roaring) timing.** Promote when a workload has a super-node with degree ≥ 1 M *and* benches show ≥ 2 × savings. 4. **Edge property layout: per-edge vs per-key chunks.** Today property stream row *j* maps to edge *j* in partner enumeration order. An alternative is to chunk by key group so that all properties of a single src’s out-edges are contiguous. The current choice maximises columnar scan efficiency for `WHERE edge.prop ...` predicates; the alternative would maximise per-key locality. Defer until query engine benches. 5. **JSON → Arrow IPC manifest threshold.** Currently set at 10 MiB. This RFC’s structures keep manifests well under that for 100 K SSTs; the threshold will be re-evaluated when a namespace approaches it. Tracked as RFC-003 follow-up. 6. **Per-edge-type inverse opt-out.** Defer to bench data; a “log-shaped” edge type (e.g. immutable events) might never need in-edge expansion, and could opt out of inverse partner generation at schema declaration time. ## References * Parquet specification, . * Lemire, Boytsov, **Decoding billions of integers per second through vectorisation** (Software: Practice & Experience, 2015). Varint / bitpacking implementation reference. * Chambi et al., **Better bitmap performance with Roaring bitmaps** (SP\&E, 2016). Section 3.2.4 skew layout. * Putze, Sanders, Singler, **Cache-, Hash- and Space-Efficient Bloom Filters** (J. Exp. Algorithmics, 2010). SBBF foundations. * Heule, Nunkesser, Hall, **HyperLogLog in practice** (EDBT 2013). * Y. Collet, **xxHash3** specification, . * Jin et al., **Kùzu** (CIDR 2023). Property graph + CSR + factorised representation in a single binary; reference architecture. * Hu et al., **EmptyHeaded** (SIGMOD 2017). WCOJ over factorised intermediate results — relevant for how SST stats inform planning. * **DuckDB DataChunk + Parquet integration**, . Reference for column-store integration over Parquet. * **turbopuffer architecture**, . Embedded-stats-in-manifest pattern + bloom side-car pattern. * **SlateDB SST format**, . We diverge by being column-oriented and CSR-aware. * **Apache Iceberg manifest spec**, . Reference for the design we did *not* adopt as the primary layout, but will use as an *export* surface in RFC-014. * **UUIDv7 specification**, RFC 9562. Layout that underlies the split-top64 / bottom64 encoding in §3.2.4. # RFC 003: Read-path ranged reads + Parquet page index > **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — > *Mirrored from [`docs/rfc/003-read-path-ranged-reads.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/003-read-path-ranged-reads.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Replace the full-body `object_store::get()` that every cold `lookup_node`/`edge_lookup` issues today with a byte-ranged fetch driven by the Parquet **page index** and the existing per-row-group min/max stats. The goal is to bring cold `lookup_node` p50 on real S3 (50–100 MB/s, \~1 ms RTT from a co-located EC2 instance; far worse from a developer laptop) inside the the envelope of `<500 ms p50` at 10 M nodes — which the in-process / LocalStack bench cannot exercise because localhost bandwidth hides the issue. ## Motivation A previous iteration closed the bench gate against LocalStack: row-group pruning on the `node_id` column reduced per-lookup decode from O(rows\_in\_sst) to O(rows\_per\_row\_group), and the resulting numbers at 10 M nodes are: | Metric | Target | Measured (LocalStack, MacBook) | | ---------------------- | ------- | ------------------------------ | | Cold `lookup_node` p50 | <500 ms | **381 ms** | | Warm `lookup_node` p50 | <10 ms | **9.27 ms** | Both gates pass — but the cold number is misleading because `Snapshot::get_sst_body` still fetches the **entire** Parquet body (currently 300–500 MB for a 10 M-node SST with zstd compression). LocalStack on a single host moves that in \~350 ms; real S3 moves it in 2–10 s depending on co-location. The same code path against `s3.us-east-1.amazonaws.com` from a developer laptop would consistently violate the gate, even though the test harness reports green. The root mismatch is structural: a point lookup needs \~tens of KB of column data from a single row group, not the whole SST. The Parquet 2.0 page index gives us exactly that — per-column-chunk per-page min/max offset + length — but we currently ignore it. Cost of doing nothing: any production deploy against real S3 ships with a hidden \~10× regression on cold lookups vs the bench gate. That blocks the SaaS demo and the public launch. ## Design ### Surface change `NodeSstReader::open` and `EdgeSstReader::open` today take `body: Bytes`. The new path takes an `Arc` + `Path` + `ObjectMeta` (size known from the manifest descriptor) and uses `parquet::arrow::async_reader::ParquetObjectReader` under the hood. The existing `Bytes`-backed constructors stay for the in-process test path and for the eager `scan_label` use case (which already needs every row group). ```rust // New constructor (additive). impl NodeSstReader { pub async fn open_async( label: LabelDef, store: Arc, path: Path, size_hint: u64, ) -> Result { /* ... */ } } ``` The async reader exposes a parallel `targeted_scan_async(&[u8; 16]) -> RecordBatch` that: 1. **Footer fetch.** `ParquetObjectReader` issues one `get_range` for the trailing \~8 KB of the SST to read the Parquet footer + column-chunk metadata. For a 500 MB SST this is \~one round-trip and \~8 KB transferred. 2. **Row-group pruning.** Same min/max stats check we already have in `targeted_scan`. Pick the single row group that straddles the target key (writer guarantees strict ascending `node_id` so there is at most one). 3. **Page index fetch.** If `with_page_index(true)`, the reader fetches the `OffsetIndex` + `ColumnIndex` for the chosen row group (\~few KB). Combined with the per-page min/max from the column index, we identify the single data page in the `node_id` column that contains the target. 4. **Page fetch.** A single `get_range` of the chosen page’s bytes (\~1–8 KB depending on rows-per-page). Decode, find the row offset within the page, project the same row offset across the other columns’ pages — each is one more `get_range`. For a `Person` label with \~6 declared properties + 2 system columns, that’s 8 ranged GETs of \~1–8 KB each, or a `get_ranges` batched call. Total wire footprint per cold lookup: **\~50–100 KB** (vs \~500 MB today) and **3–4 round trips** (footer, page index, batched column pages). On S3 us-east-1 from EC2 (\~1 ms RTT), that’s \~5–20 ms. From a laptop (\~30 ms RTT), \~100–150 ms — both inside the 500 ms gate with comfortable margin. ### Cache integration The current `SstCache` keys on the full path and stores the entire body. Under the new design we shift to **range-keyed caching**: keys become `(path, offset, length)` or a normalised `(path, kind)` for the three structurally-fixed regions: * `:footer` — the trailing footer + column metadata block (size known after the first fetch). * `:row_group_:column_` — per-column-chunk pages for the hot row group. Warm lookups against the same SST and same row group hit memory without re-fetching. This is essentially a buffer pool keyed by Parquet’s logical units instead of by file. Foyer continues to back it; the `weighter` adds the key length plus the value length and the budget stays in real bytes. ### Edge SST counterpart The edge SST format is custom (RFC-002 §3) and already ships a fence-pointer index for `key_count > 65 536`. The same idea applies: today `EdgeSstReader::open` reads the full body; the async variant reads the footer + fence index + the per-key partner block. Wire format is unchanged — only the reader navigates differently. ### Manifest descriptor extension `SstDescriptor.size_bytes: u64` already exists in the manifest. No schema change needed; the reader passes that as the `size_hint` so `ParquetObjectReader` can position the trailing footer read without a HEAD request. ## Alternatives considered ### A. Persist a separate “index” side-car per SST A `.idx` blob with `node_id → (row_group, offset)` mapping, written at flush time. Cold lookup = 1 GET of the side-car + 1 GET of the chosen row group. Rejected: the side-car would essentially duplicate the Parquet column index, and we’d carry two sources of truth that must stay in sync. The Parquet page index is already on disk inside the body — re-using it is free. ### B. Maintain a sorted in-memory key→row-group map per SST Build it on `open()` and cache. Cold lookup pays one full-SST decode the first time, warm is instant. Rejected: the first lookup is what we’re trying to fix. Building the map requires reading the footer + column index anyway, so we may as well consume that information directly instead of caching it in a parallel structure. ### C. Smaller row groups (e.g., 4 K rows) Today’s row group is 128 K rows. Smaller groups would amortise less per-group overhead and let us decode less per pruned hit. Rejected as a complete solution: ratio improvement is linear in the row-group shrink but at some point per-group metadata cost dominates the body. Real fix is page-level granularity, not finer row groups. ### D. Materialise hot keys into a separate SST per layer LSM-style “block index” promoted to its own file. Rejected: adds a writer-side component (when to promote? what to evict?) and another manifest descriptor. The Parquet page index already gives us per-page granularity for free; promoting hot keys is premature. ## Drawbacks 1. **Two read paths to maintain.** The async ranged path coexists with the eager `Bytes`-backed path used by `scan_label` / `scan_edge_type` / compaction. We accept the surface area because compaction genuinely needs every row group and would issue worse access patterns if forced through the ranged reader. 2. **Foyer cache keying changes.** Existing tests that assert `SstCache.usage() > 0` after a warm cycle keep working (the cache holds page bytes instead of body bytes) but the bytes-per-entry distribution shifts dramatically — smaller entries, more of them. Eviction tuning may need a second pass. 3. **Round-trip count on S3.** A cold lookup goes from 1 wide GET to \~3 narrow GETs. For backends with HEAD+GET RTT penalties (some self-hosted gateways) this could regress wall-clock time despite the bandwidth win. Mitigation: support `object_store::get_ranges` (which coalesces) and benchmark explicitly against real S3 + LocalStack before declaring victory. 4. **Bench harness debt.** `benches/read_latency.rs` today exercises the cached `Bytes` path. The harness needs a new bench (`cold_ranged_from_s3`) that exercises the async reader and reports both the LocalStack and real-S3 numbers — otherwise we re-introduce the LocalStack-only blind spot this RFC was written to close. ## Open questions 1. **Coalescing strategy for column pages.** `object_store::get_ranges` issues a single multi-range request when the backend supports it; for backends that don’t (some S3 gateways), it falls back to parallel single-range GETs. Need to measure which dominates for our typical 8-column projection. 2. **Page index always-on?** The writer can produce the page index unconditionally (\~negligible footer overhead) or only when row count exceeds a threshold. Cheap to always emit — recommend on by default and revisit only if footer size becomes a problem. 3. **Bloom filter probe ordering.** Today: manifest min/max → bloom → body GET. New flow: manifest min/max → bloom → footer GET (cheap) → row-group prune → page GET. Bloom still saves a footer round trip on a true miss, so keep it first. But if the bloom misses are rare in practice (well-tuned FPR), we may want to skip it and go straight to the footer fetch which is similarly small. 4. **Property-stream evolution interaction.** RFC-002 §3.2.7 (declared edge property streams) is a follow-up. When per-property streams ship, the ranged read pattern extends naturally: one extra `get_range` per requested property stream. No new design needed, just one more knob on the column projection. 5. **`scan_label` / `scan_edge_type` retention of the eager path.** Confirm that range scans always stay on the body-fetch path or whether they should also use ranged reads when the result set is small. Probably “always eager for now, revisit when the query engine surfaces predicate push-down.” ## References * RFC-002 §4.1 (SstDescriptor format, `size_bytes` already in the manifest). * Apache Parquet [Page Index spec](https://github.com/apache/parquet-format/blob/master/PageIndex.md). * `object_store::ObjectStore::get_ranges` ([docs.rs](https://docs.rs/object_store/0.13.0/object_store/trait.ObjectStore.html#method.get_ranges)). * `parquet::arrow::async_reader::ParquetObjectReader` ([docs.rs](https://docs.rs/parquet/55.2.0/parquet/arrow/async_reader/struct.ParquetObjectReader.html)). # RFC 004: Cypher subset compatibility scope > **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — > *Mirrored from [`docs/rfc/004-cypher-subset.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/004-cypher-subset.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Declara el subconjunto exacto de Cypher 25 / openCypher / GQL ISO/IEC 39075:2024 que el parser `namidb-query` acepta en la primera iteración del query engine (v0). La meta es **parsear sin error las 12 queries de LDBC SNB Interactive Complex que no dependen de `shortestPath`/`allShortestPaths`**, dejando IC13 e IC14 explícitamente fuera de scope hasta RFC-009 (WCOJ + recursive patterns) y cerrando 100 % de la superficie cubierta con tests que viven con el código. El subset es **deliberadamente menor** que el de Neo4j Community 5.x y que el de Kùzu: privilegiamos compatibilidad estricta sobre features, evitamos APOC, evitamos subqueries `CALL` y evitamos `FOREACH`. El compromiso es: *“lo que el parser acepta corre o devuelve un error tipado claro — nunca un warning silencioso que cambia la semántica”*. ## Motivation Cypher es un lenguaje grande. El Cypher 25 specification (mayo 2025) define unas 80 cláusulas/expresiones de primer nivel, openCypher TCK suma \~10 000 casos de test. Una implementación completa toma 18+ meses (Memgraph tardó \~2 años en cubrir el 80 % útil; Kuzu nunca llegó al 100 % antes del archive). Sin un scope explícito el parser se vuelve un agujero negro de tiempo: * Cada feature nuevo demanda decisiones de semántica (e.g. `MERGE` con multi-label, `OPTIONAL MATCH` con left-anti-join, `WITH *`). * Cada feature nuevo demanda tests, error messages, lowering al IR del logical plan. * Sin un gate de “qué está adentro y qué afuera” no podemos honestamente comunicar al usuario qué funciona. **Referencia de scope:** LDBC SNB Interactive Complex Q1–Q14. Cubrir 12 de las 14 queries en el parser deja scope ejecutable para las etapas siguientes (lowering, optimizer, executor) sin abrir nuevos frentes de compatibilidad. ## Design ### Versión declarada del estándar * **Base normativa:** GQL ISO/IEC 39075:2024 (publicado 11 abril 2024) + openCypher 9 (el último cuya specification es libre de patentes). * **Cypher 25 (Neo4j):** trataremos como referencia de naming y syntax pero **no** implementaremos nada exclusivo de Neo4j (e.g. `db.*` functions, APOC). * **Cuando hay conflicto** entre GQL y openCypher, **GQL wins**. Razón: evitar lock-in vendor-specific, posicionarnos junto a la dirección que Memgraph, RisingWave y la comunidad académica están tomando. ### Subconjunto v0 (in-scope) #### Clauses | Clause | v0 | Notas | | -------------------------- | -- | ----------------------------------------------------------------------------------------------------------------- | | `MATCH` | ✅ | Patrón fijo o variable-length `*n..m` con bounds finitos. | | `OPTIONAL MATCH` | ✅ | Semantics left-outer-join. | | `WHERE` | ✅ | Predicados arbitrarios sobre el scope visible. | | `RETURN` | ✅ | Projection list con aliases (`AS`). `DISTINCT` soportado. `*` no soportado en v0 (se exige projection explícita). | | `WITH` | ✅ | Pipe que reinicia el scope. Soporta `WHERE` interior y aliases. | | `ORDER BY` | ✅ | Multi-key `ASC`/`DESC`. | | `SKIP` / `LIMIT` | ✅ | Solo literales o `$param`. Sin expresiones. | | `UNWIND` | ✅ | Lista → rows. | | `CREATE` | ✅ | Nodes y edges con properties literales o `$param`. | | `MERGE` | ✅ | `MERGE... ON CREATE SET... ON MATCH SET...`. | | `SET` | ✅ | Property assign, label add. | | `DELETE` / `DETACH DELETE` | ✅ | Single binding por delete. | | `REMOVE` | ✅ | Property remove, label remove. | | `UNION` / `UNION ALL` | ✅ | Mismo arity y mismos aliases. | #### Patterns | Element | v0 | Notas | | ----------------------------------------------------- | -- | --------------------------------------------------------------------------------- | | Node pattern `(a:Label {prop: val})` | ✅ | Multi-label `(a:A:B)`. Map property filter inline. | | Relationship pattern `-[r:TYPE]->` | ✅ | Direction `-->`, `<--`, `--`. | | Relationship type alternation `-[r:TYPE_A\|TYPE_B]->` | ✅ | | | Variable-length `-[r:KNOWS*1..3]->` | ✅ | Bounds finitos requeridos. `*` solo o `*n..` (sin upper bound) → error explícito. | | Pattern chain `(a)-[]-(b)-[]-(c)` | ✅ | | | Pattern de múltiples partes `MATCH (a), (b)` | ✅ | | | Anonymous variable “, `[]` | ✅ | | #### Expressions | Categoría | v0 | | ---------------------------------------------------------------------- | ------------------------------------- | | Literals: int, float, string, bool, null, list `[1,2,3]`, map `{k: v}` | ✅ | | Parameters `$name` | ✅ | | Variable reference `a`, property access `a.prop` | ✅ | | Operators arith `+ - * / % ^` | ✅ | | Operators string `+` (concat), `=~` (regex) | ✅ | | Operators bool `AND OR NOT XOR` | ✅ | | Comparison `= <> < > <= >=` | ✅ | | `IS NULL` / `IS NOT NULL` | ✅ | | `IN` (membership lista) | ✅ | | `STARTS WITH`, `ENDS WITH`, `CONTAINS` | ✅ | | Function call `length(x)`, `count(a)`, `collect(a.prop)` | ✅ (built-ins listados abajo) | | `CASE WHEN... THEN... ELSE END` | ✅ (forma simple y forma multi-branch) | | List comprehension `[x IN list WHERE pred \| expr]` | ✅ | | Pattern comprehension `[(a)-[]->(b) \| b.name]` | ✅ | | Pattern predicates `WHERE (a)-[]->(b)` | ✅ | #### Built-in functions (mínimas para Q1–Q12) **Aggregations:** `count(*)`, `count(x)`, `count(DISTINCT x)`, `sum`, `avg`, `min`, `max`, `collect`, `collect(DISTINCT x)`. **Scalar:** `id(n)`, `labels(n)`, `type(r)`, `keys(n)`, `properties(n)`, `length(p)`, `size(coll)`, `head(coll)`, `last(coll)`, `tail(coll)`, `coalesce(x, y,...)`. **String:** `toLower`, `toUpper`, `trim`, `substring`, `replace`, `split`, `toString`, `toInteger`, `toFloat`. **Numeric:** `abs`, `ceil`, `floor`, `round`, `rand`, `sign`. **Temporal:** `date`, `datetime`, `duration` (forma constructor solo con ISO 8601 strings; no la álgebra completa todavía). **Pattern:** `exists(pattern)`, `nodes(path)`, `relationships(path)`. #### Tipos `INTEGER` (64-bit signed), `FLOAT` (64-bit), `STRING`, `BOOLEAN`, `NULL`, `LIST` (heterogénea permitida — typecheck en runtime), `MAP`, `NODE`, `RELATIONSHIP`, `PATH`, `DATE`, `DATETIME` (sin timezone), `DURATION`. Out-of-scope v0: `BYTES`, `POINT`, `LOCALDATETIME`, `ZONEDDATETIME`, `LOCALTIME`, `TIME`. #### Semántica de NULL Three-valued logic estándar Cypher: * `NULL = NULL` → `NULL` (no `true`). * `NULL AND false` → `false`, `NULL AND true` → `NULL`. * `WHERE` filter rechaza rows con predicado `NULL` (como `false`). * `IS NULL` / `IS NOT NULL` son las únicas formas de testear NULL. #### Error model `ParseError { code: ErrorCode, message: String, span: SourceSpan, help: Option }` donde `ErrorCode` es un enum exhaustivo (`E001_UnexpectedToken`, `E002_UnboundedVariableLength`, `E003_ReservedKeyword`,…). Mensaje sigue el formato de `ariadne` con caret highlighting y `help:` opcional. Múltiples errores se reportan en la misma pasada via `chumsky::recovery`. ### Out-of-scope explícito v0 Lista exhaustiva — cualquier feature que NO esté aquí ni en el subset in-scope falla con error de “feature no soportada” + número de RFC futuro donde aterriza. | Feature | Por qué afuera | Aterriza en | | ------------------------------------------------------------------- | ---------------------------------------------------------------------------- | --------------------------------------------------------- | | `shortestPath(...)` | Recursive pattern matching. Requiere WCOJ + planner especial. | RFC-009 | | `allShortestPaths(...)` | Idem. | RFC-009 | | `CALL {... }` (subqueries) | Subquery scoping rules son sutiles, no necesarias para LDBC SNB Interactive. | RFC futuro | | `CALL procedure.name(...)` | No tenemos procedure registry. APOC explícitamente out. | RFC futuro | | `FOREACH` | Imperativo, raramente útil. | RFC futuro | | `USE database` | Cross-database queries. Single namespace por sesión en. | RFC-010 (cloud) | | `LOAD CSV` | Bulk ingest path es `WriterSession`. | Nunca; usar el ingest API. | | `CREATE INDEX` / `CREATE CONSTRAINT` | DDL fuera de Cypher; lo manejará el schema API directo. | RFC futuro | | `EXPLAIN` / `PROFILE` | Pendiente pero ya con scope: vienen una vez exista LogicalPlan. | RFC futuro | | Transacciones explícitas (`BEGIN`/`COMMIT`/`ROLLBACK` Cypher-level) | El cliente las maneja externamente via `WriterSession.commit_batch`. | Nunca via Cypher en v0. | | Variable-length sin upper bound (`*1..`) | Sin upper bound el optimizador no puede limitar el blowup. | Posible relajación con WCOJ. | | Pattern de longitud cero (`*0..n`) | Trivial pero abre dudas semánticas (auto-loops). | RFC futuro. | | `MATCH p = (a)-[*]->(b) RETURN p` (paths como first-class) | Requiere materialización del path; útil pero no crítico para Q1–Q12. | RFC futuro. | | Tipos `POINT`, `TIME`, `ZONEDDATETIME` | Sin uso en LDBC SNB Interactive. | RFC futuro cuando aterricen verticales geo / time-series. | | `db.*` / `apoc.*` namespaces | Vendor-specific Neo4j; no portables. | Nunca. | ### Mapping a LDBC SNB Interactive Complex Q1–Q14 Cada query se evalúa contra el subset y se marca `IN` (parsea en v0) o `OUT` (queda excluida hasta el RFC indicado). | Query | Features requeridas | v0 | | ----------------------------------------------- | ------------------------------------------------------- | --------------- | | **IC1** — Friends by name (transitive) | `MATCH... *1..3... WHERE... ORDER BY... LIMIT` | ✅ IN | | **IC2** — Recent messages by friends | `MATCH 2-hop... WHERE timestamp <... ORDER BY... LIMIT` | ✅ IN | | **IC3** — Friends in two countries | `MATCH... WHERE country IN [...]` | ✅ IN | | **IC4** — New topics on friend posts | `MATCH 2-hop + WITH + collect + UNWIND + WHERE NOT IN` | ✅ IN | | **IC5** — New groups (membership count) | `MATCH... WITH... count + ORDER BY` | ✅ IN | | **IC6** — Tag co-occurrence | `MATCH 2-hop... WITH tag, count... ORDER BY` | ✅ IN | | **IC7** — Recent likers | `MATCH... ORDER BY... LIMIT` | ✅ IN | | **IC8** — Recent replies | `MATCH... ORDER BY... LIMIT` | ✅ IN | | **IC9** — Recent messages by friends-of-friends | `MATCH *2..2... WHERE... ORDER BY... LIMIT` | ✅ IN | | **IC10** — Friend recommendation | `MATCH 2-hop... WITH common_count... ORDER BY` | ✅ IN | | **IC11** — Job referral | `MATCH... WHERE... ORDER BY` | ✅ IN | | **IC12** — Expert search by tag class | `MATCH 2-hop + tag class hierarchy + count + ORDER BY` | ✅ IN | | **IC13** — Single shortest path | `shortestPath((a)-[*]-(b)` | ❌ OUT — RFC-009 | | **IC14** — All shortest paths weighted | `allShortestPaths` + weight calc | ❌ OUT — RFC-009 | **Cobertura v0:** 12/14 (85.7 %). IC13–IC14 son los únicos excluidos y ambos requieren recursive pattern matching que el WCOJ planner desbloquea. ### Estructura del crate `namidb-query` ```plaintext crates/namidb-query/src/ ├── lib.rs # reexports públicos ├── parser/ │ ├── mod.rs # entry point: parse(&str) -> Result> │ ├── lexer.rs # &str → Vec<(Token, SourceSpan)> │ ├── ast.rs # tipos AST (Query, Clause, Pattern, Expression,...) │ ├── grammar.rs # chumsky combinators │ ├── display.rs # Display impl canonical (round-trip) │ └── error.rs # ParseError, ErrorCode, SourceSpan └── tests/ # integration tests parser ``` LogicalPlan, optimizer y executor viven en módulos hermanos cubiertos por RFCs hermanas — quedan fuera del scope de RFC-004. ### Dependencias agregadas | Dep | Versión | Por qué | | --------- | ------- | ----------------------------------------------------------------------------------- | | `chumsky` | 0.10 | Parser combinators con error recovery y AST-friendly. Justificado en §Alternativas. | | `ariadne` | 0.5 | Pretty error messages (caret, span highlight, multi-error). | No agregamos `nom`, `pest`, `lalrpop`, ni `antlr-rs`. Justificación en §Alternativas. ## Alternatives considered ### A. Hand-written recursive descent parser **Pro:** máxima velocidad de parsing, control absoluto de error messages, sin dependency tree. **Con:** \~3 000–5 000 LoC para cubrir el subset declarado, \~30–50 % del tiempo se va en boilerplate de precedence + error recovery, refactor caro cuando agregamos features. **Veredicto:** Rechazado. Es la opción “Postgres” — válida cuando el parser es el producto principal. Para nosotros el producto es el storage + executor, el parser es overhead. ### B. `nom` parser combinators **Pro:** maduro (\~10 yrs), rápido, gran comunidad Rust. **Con:** error messages requieren mucha plumbing manual (`VerboseError` ayuda pero queda lejos de `ariadne`), no tiene recovery built-in, tipo de combinators byte-stream-first (no token-stream-first) — friction natural con un lexer tokenizado separado. **Veredicto:** Rechazado. Es la mejor opción si el parser fuera la única prioridad pero el dev experience de errores es inferior a chumsky. ### C. `chumsky` 0.10+ **Pro:** parser combinators con error recovery first-class (`recovery::skip_then_retry_until`, `nested_delimiters`), AST-friendly (retorna `Result>` con todos los errores no solo el primero), buena integración con `ariadne` para pretty errors, version 1.0 cerca. **Con:** versión 0.10 cambió API significativamente vs 0.9 — un breaking change vertical futuro probable. Slower que `nom` en benchmarks micro (\~2×). **Veredicto:** **Aceptado**. Velocidad de parsing es irrelevante en nuestro workload (la query string viene del usuario una vez, se parsea, se cachea). Error quality es lo que importa. ### D. ANTLR4 + generador de parser Rust (antlr-rust) **Pro:** openCypher distribuye una gramática ANTLR oficial; reusarla evita re-litigar precedencia y syntax edge cases; cobertura del estándar “para free”. **Con:** `antlr-rust` no está bien mantenido (último release 2022), la gramática openCypher cubre features que están out-of-scope (`shortestPath`, `CALL`, `FOREACH`,…) y filtrarlos post-parse es más caro que parsear el subset directo. ANTLR genera código que es pesado de leer; el debugging cuando algo sale mal es difícil. **Veredicto:** Rechazado. Reusar la gramática openCypher como referencia informal — sí. Generar Rust desde ella — no. ### E. LALRPOP (LR(1) generator) **Pro:** maduro, rápido, parser determinístico. **Con:** Cypher no es LR(1) limpio (ambigüedad pattern vs expression dentro de `WHERE` clauses), forzar grammar a LALR causa hacks. Error recovery en LR(1) es notoriamente difícil. **Veredicto:** Rechazado. LR genera grammars rígidas; quereremos evolucionar rápido (futuro). ### F. Lexer separado vs lexer inline en chumsky chumsky soporta parsear directo desde `&str` sin lexer (es lo idiomático en muchos ejemplos). Decisión: **lexer separado**. **Razones:** * Comments (`//`, `/* */`) y whitespace son más limpios de manejar en lexer. * Keyword vs identifier es ambigüedad léxica (`COUNT` puede ser función o identifier en algunos contextos) — resolverlo a nivel de token simplifica el grammar. * Spans más precisos: cada token lleva su span; el parser solo conecta tokens, no recomputa offsets. * Test independiente: el lexer puede testearse sin tocar el parser, y viceversa. Costo: \~150 LoC extra de lexer. Aceptable. ## Drawbacks 1. **Subset muy chico** comparado con Neo4j (5 % de la superficie) — early adopters que vienen de Neo4j chocarán con “feature not supported” en cada feature avanzado. Mitigación: error message indica qué RFC futuro lo cubre, link a roadmap público. 2. **Cypher 25 está evolucionando**: GQL ISO/IEC 39075 puede ganar ammendments. Mitigación: rebase del subset cada release; RFC-004 se trata como living document (Status puede pasar a `superseded` cuando aparezca RFC-004.1 o RFC-004 v1). 3. **`chumsky` 0.10 → 1.0** breaking change esperado en próximos meses. Mitigación: encapsular el uso detrás de `parser::grammar::*` privado, refactor confinado a un módulo. 4. **`MERGE` con multi-label patterns** tiene semantics ambiguas (Neo4j y Memgraph difieren). Decisión: en v0 `MERGE` requiere exactamente un label por node pattern. `MERGE (a:A:B)` retorna error parser-level. Documentado en error code `E007_MergeMultiLabel`. 5. **`OPTIONAL MATCH` con variable-length** no está bien definida en el estándar (¿qué pasa con OPTIONAL en `*0..n`?). Decisión v0: rechazar la combinación en el parser. `E008_OptionalVariableLength`. Aterriza con RFC-009. ## Open questions * **Q1: `RETURN *`** — en v0 no se soporta. ¿Lo agregamos más adelante cuando llegue el binding scope resolver? Likely sí, es feature high-value low-cost. * **Q2: `WITH *`** — idem. Decisión deferida. * **Q3: User-defined functions** — el plan §13.2 menciona RFC futura pero no está numerada. ¿`namidb.fn.*` namespace? ¿WASM sandbox? Out of scope v0; lo deciden. * **Q4: `LOAD CSV`** — fuera explícitamente; pero usuarios que vienen de Neo4j lo van a buscar. ¿Documentamos un equivalente “`namidb-cli ingest --csv...`” o lo dejamos al SDK Python? Decisión separada de este RFC. * **Q5: Identifiers con backticks** — `MATCH (a:`Foo Bar`)`. openCypher los permite, GQL los exige para identifiers con espacios o reserved words. Decisión: **soportar siempre** (mejor superset de standards). ## References * GQL ISO/IEC 39075:2024 — * openCypher 9 specification — * Cypher 25 (Neo4j) — * LDBC SNB Interactive Workload, v0.4 — Erling et al., SIGMOD 2015. * Memgraph Cypher subset — * Kuzu Cypher compatibility — (snapshot pre-archive oct 2025). * chumsky 0.10 documentation — * ariadne — * `recursive-descent` vs `combinators` discussion in Rust DBMS community — Niko Matsakis, “Why I built lalrpop” (2017); Geal blogposts on nom. # RFC 008: Logical Plan IR > **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — > *Mirrored from [`docs/rfc/008-logical-plan-ir.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/008-logical-plan-ir.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Define la representación intermedia (IR) que el query engine usa entre el AST de Cypher (RFC-004) y el executor. El IR es un árbol de operadores relacionales extendidos para grafos — el shape estándar de los DBMS modernos (DuckDB, DataFusion, Materialize, Kùzu) — adaptado al modelo property-graph. Esta RFC fija los operadores, su semántica, las reglas de lowering desde Cypher, el tipo runtime `RuntimeValue` y la API del executor naïve inicial (tree-walking, eager `Vec`). Out-of-scope explícito en la versión inicial: streaming/morsel-driven execution, cost-based optimizer, WCOJ planner, parallelism, distribución multi-namespace, query result caching. ## Motivation Sin un IR estable, lowering, optimizer y executor terminan acoplados a la forma del AST. Esto duele en tres dimensiones: 1. **Optimizer imposible de injertar.** Para rewrite predicate pushdown / join-order / projection-elimination necesitamos un árbol que sepa hablar de `Filter(input, pred)` y `Project(input, items)` como operadores independientes, no como cláusulas anidadas. 2. **Executor morsel-driven no puede compartir código.** El executor vectorizado va a operar sobre el mismo árbol de operadores que el naïve; solo cambia la representación de filas (Arrow `RecordBatch` vs `Vec`) y la estrategia de scheduling. Si el IR es estable, la versión vectorizada reemplaza solo la implementación de `Operator::execute`. 3. **EXPLAIN/PROFILE necesitan algo qué imprimir.** Sin IR, el `EXPLAIN` tendría que recorrer el AST y traducirlo on-the-fly cada vez. Con IR imprimimos el árbol directo. El costo de hacerlo de entrada (vs diferirlo) es \~700–1 000 LoC y una iteración de design. El costo de diferirlo es refactor obligatorio cuando entren optimizer y executor vectorizado — peor. ## Design ### Tipo runtime: `RuntimeValue` `namidb-core::Value` cubre los escalares (`Null/Bool/I64/F64/Str/Bytes/Vec`) pero le faltan los compuestos que Cypher necesita: `LIST`, `MAP`, `NODE`, `RELATIONSHIP`. Definimos `RuntimeValue` standalone en `namidb-query` para mantener `core` agnóstico del query layer: ```rust pub enum RuntimeValue { Null, Bool(bool), Integer(i64), Float(f64), String(String), List(Vec), Map(BTreeMap), Node(Box), Rel(Box), // Date / DateTime / Duration: stubs iniciales; semantics completas más adelante. Date(i32), // days since 1970-01-01 DateTime(i64), // microseconds since 1970-01-01T00:00:00Z } ``` `NodeValue` y `RelValue` envuelven `NodeView` / `EdgeView` del storage: ```rust pub struct NodeValue { pub id: NodeId, pub label: String, pub properties: BTreeMap, } pub struct RelValue { pub edge_type: String, pub src: NodeId, pub dst: NodeId, pub properties: BTreeMap, } ``` Conversiones `From` y `From` mapean `core::Value → RuntimeValue` row-wise; esto introduce una copia pero es aceptable en la versión inicial (la versión vectorizada futura va a operar directo sobre Arrow batches sin esta conversión). ### Tipo runtime: `Row` ```rust pub struct Row { pub bindings: BTreeMap, } ``` Una `Row` es el estado completo de un binding scope en el current scope. `MATCH (a)-[r]->(b) RETURN a.name, r.weight, b.id` produce rows con tres bindings vivos (`a`, `r`, `b`) hasta el `RETURN`, que projecta a una nueva row con solo `a.name`, `r.weight`, `b.id`. Decisión `BTreeMap` (no `HashMap`): determinismo en orden de iteración para tests y `EXPLAIN` output. Lookup `O(log k)` con `k = #bindings` — inmaterial vs el costo de IO. ### Operadores del IR Cada operador es una variante de `LogicalPlan`. El árbol es child-pointer single-input excepto `Union` (dos inputs). Aristas implícitas: cada operador “produce rows” para su parent. ```rust pub enum LogicalPlan { /// Producer de rows: scan completo de todos los nodes con `label`. /// `alias` es el binding que cada NodeValue ocupa en la row de salida. NodeScan { label: String, alias: String, }, /// Variante O(1): scan de un único node por id. Usado cuando el AST /// llega con `(p:Person {id: $personId})` — lowering detecta el filtro /// trivial y lo convierte en `NodeById` en vez de `Filter(NodeScan, ...)`. NodeById { label: String, alias: String, id: Expression, // typically Parameter("personId") or Literal(NodeId) }, /// Toma rows del `input`, expande la binding `source` por sus edges /// `direction`/`edge_type`, materializa el destino bajo `target_alias` /// y opcionalmente bind la rel en `rel_alias`. Expand { input: Box, source: String, edge_type: Option, direction: RelationshipDirection, rel_alias: Option, target_alias: String, /// Cuando el AST trae variable-length `*min..max`, este campo /// guarda los bounds; lowering decide si genera un único `Expand` /// con length o (a futuro) un sub-plan recursivo. length: Option, }, /// Selecciona rows que satisfacen `predicate`. Filter { input: Box, predicate: Expression, }, /// Reemplaza el row con una nueva proyección. Mantiene scope abierto /// vía la lista de items (cada item es expression + optional alias). /// Si `discard_input_bindings = true`, las bindings no proyectadas /// se borran (RETURN-style). Si `false`, se conservan (WITH-style). Project { input: Box, items: Vec, distinct: bool, discard_input_bindings: bool, }, /// Agrupa por `group_by` y aplica las funciones aggregate. Aggregate { input: Box, group_by: Vec<(Expression, String)>, // (key expression, output alias) aggregations: Vec<(String, AggregateExpr)>, // (output alias, agg) }, /// Sort + skip + limit fundidos. Si solo hay sort, `skip = 0`, /// `limit = u64::MAX`. Si solo hay limit, `keys` es vacío. TopN { input: Box, keys: Vec, skip: u64, limit: u64, }, /// Distinct sobre el set completo de columnas visibles. Distinct { input: Box, }, /// UNION o UNION ALL. Union { left: Box, right: Box, all: bool, }, /// Expande una expression-list a multiple rows, una por elemento. Unwind { input: Box, list: Expression, alias: String, }, /// Driver inicial sin filas — produce exactamente un row vacío. /// Necesario para queries que abren con UNWIND o WITH literal, ni /// para subqueries que arrancan independientes. Empty, } pub struct ProjectionItem { pub expression: Expression, pub alias: String, } pub struct OrderKey { pub expression: Expression, pub direction: OrderDirection, } pub enum AggregateExpr { Count { arg: Option, distinct: bool }, Sum { arg: Expression, distinct: bool }, Avg { arg: Expression, distinct: bool }, Min { arg: Expression }, Max { arg: Expression }, Collect { arg: Expression, distinct: bool }, } ``` ### Semántica NULL (three-valued logic) Misma que Cypher 25 / GQL: * `NULL OP NULL = NULL` para todo `OP ∈ {=, <>, <, >, ...}`. * `NULL AND false = false`, `NULL AND true = NULL`, `NULL AND NULL = NULL`. * `NULL OR true = true`, `NULL OR false = NULL`, `NULL OR NULL = NULL`. * `Filter(predicate)` descarta rows cuyo predicate evalúa a `NULL` (igual que `false`). * `IS NULL` / `IS NOT NULL` son los **únicos** operadores que devuelven `Bool` para input `NULL`. * Aggregate functions (excepto `count(*)`) **ignoran NULL** en sus inputs. * Comparison entre tipos incompatibles (e.g. `1 = "x"`) → `NULL` (no error). * Division by zero entre enteros → error runtime. Entre floats → `NaN` (siguiendo IEEE 754; downstream `<` con `NaN` retorna `NULL`). ### Semántica de scope Cada clause `MATCH`/`OPTIONAL MATCH`/`UNWIND`/`WITH`/`CREATE`/`MERGE` extiende el scope con nuevas bindings. * `WITH` **cierra** el scope: bindings no proyectadas se descartan. Es el único punto de re-arranque limpio. Cypher fuerza un `WITH` entre dos `MATCH` que comparten bindings — esto se controla en el AST, no en el IR. * `OPTIONAL MATCH` propaga `NULL` en todas las bindings cuando el match no tiene resultado. Implementado como `Filter` + outer-join semantics a futuro — inicialmente se baja a `Expand` con flag `optional` que produce rows con bindings `NULL` cuando no encuentra targets. * Las bindings de una `OrderBy` clausula siguiente a `RETURN` (o `WITH`) son las de la proyección, no las pre-proyección. Eso obliga a lower `RETURN ... ORDER BY` como `Project + TopN`, no `TopN + Project`. ### Evaluation order garantizado El executor ejecuta el árbol bottom-up, depth-first. Si un operador tiene dos entradas (`Union`) ejecuta `left` antes que `right`. Side-effects en el executor están prohibidos inicialmente (no hay `SET` / `CREATE` / `DELETE` todavía); cuando lleguen van a operadores dedicados (`SetProperty`, `CreateNode`, `DeleteNode`) que ejecutan strictly after todos los reads de la query (o lazy según RFC futuro). ### Lowering rules Para cada cláusula Cypher del subset RFC-004: | Cypher | LogicalPlan | | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | | `MATCH (a:L)` (no patterns más) | `NodeScan { label: "L", alias: "a" }` | | `MATCH (a:L {id: $x})` (igualdad sobre id) | `NodeById { label: "L", alias: "a", id: Parameter("x") }` | | `MATCH (a:L {id: $x})` (igualdad sobre otra prop) | `Filter(NodeScan, a.prop = $x)` | | `MATCH (a)-[r:R]->(b)` | `Expand { input: , source: a, edge_type: R, dir: Right, rel_alias: r, target_alias: b }` | | `MATCH (a) WHERE p` | `Filter(, p)` | | `RETURN x, y AS z` | `Project { items: [x, z=y], discard_input=true }` | | `RETURN DISTINCT x` | `Project { distinct: true, ... }` | | `WITH x, y AS z` | `Project { items: [x, z=y], discard_input=true }` (mismo que RETURN — diferencia es solo si hay clauses siguientes) | | `WITH x WHERE p` | `Filter(Project(...), p)` | | `ORDER BY k1, k2 SKIP s LIMIT l` (después de Project) | `TopN { keys: [k1, k2], skip: s, limit: l }` | | `UNION ALL` | `Union { all: true }` | | `UNION` | `Distinct(Union { all: false })` | | `UNWIND list AS x` | `Unwind { input: , list, alias: x }` | | `MATCH a, b` (multiple patterns, mismo `MATCH`) | Cross product: lower `b` con `input = lowered(a)` y sin shared bindings. | La regla específica para `OPTIONAL MATCH`: * `OPTIONAL MATCH (a)-[r]->(b)` con `a` ya bindeada del scope anterior: lower como `Expand { ..., optional: true }`. Si no hay match, produce un row con `r = NULL` y `b = NULL` (preserva el row input). * Sin variable-length permitido (parser ya lo rechaza, ver `RFC-004 §Drawbacks 5`). ### EXPLAIN format ```plaintext Project [name=a.firstName, age=a.age] TopN keys=[a.age DESC] skip=0 limit=10 Filter (a.age > 18) Expand source=a edge_type=KNOWS dir=-> target=b NodeScan label=Person alias=a ``` Cada operador se imprime en una línea con el nombre del operador, sus parámetros entre `[...]` o `nombre=value`, y los hijos indentados con dos espacios. `EXPLAIN` produce esto; `PROFILE` (a futuro) lo decora con runtime stats (`rows_out`, `time_ms`, `bytes_read`). ### API del executor ```rust pub async fn execute( plan: &LogicalPlan, snapshot: &Snapshot<'_>, params: &BTreeMap, ) -> Result, ExecError>; ``` Trae todo a memoria. Eager. Single-thread (tokio current\_thread). `ExecError` cubre: binding not found, type error, parameter not provided, storage error. ## Alternatives considered ### A. AST → directamente executor (no IR) **Pro:** menos código, menos boilerplate. **Con:** acopla executor a AST. Optimizer requeriría refactor masivo. EXPLAIN tendría que reconstruir el plan en string-time. **Veredicto:** rechazado. La inversión IR-first es \~300 LoC extra que ahorra >1000 LoC más adelante. ### B. Push-based dataflow (Materialize-style) **Pro:** modelo dataflow nativo, encaja con streaming y continuous queries. **Con:** mucho más complejo. Cada operador es un actor con state + input/output channels. Overhead alto para queries one-shot. Diferencial solo aparece en multi-query / streaming scenarios. **Veredicto:** rechazado; potencial RFC futuro si entramos a streaming/CDC. ### C. Volcano-style iterator (`trait Operator { fn next(); }`) **Pro:** estándar en DBMS clásicos (Postgres, MySQL pre-pipelined). Lazy, low-memory per operator. Streaming natural. **Con:** sin parallelism. Function-call overhead por row. La industria moderna (DuckDB, Velox) lo abandonó. **Veredicto:** rechazado. Inicialmente eager `Vec` es más simple y suficiente; a futuro vamos directo a morsel-driven, no Volcano. ### D. DataFusion como IR **Pro:** maduro, optimizer “para free”, compatibilidad con Arrow. **Con:** DataFusion es relacional, no graph-shaped. Adaptar `Expand`, multi-hop, WCOJ a DataFusion es trabajo grande y nunca natural. **Veredicto:** rechazado como **IR único**. A futuro lo cableamos como **bridge para SQL surface paralelo** (graph queries en nuestro IR, SQL surface en DataFusion, mismo executor). ### E. Single-input vs multi-input operators Decisión: single-input excepto `Union`. `Join` (Hash, NL, LFTJ) es explícito multi-input pero **no aparece inicialmente** (lowering produce `Expand` chain, no joins). Joins entran cuando el optimizer re-ordene. ## Drawbacks 1. **`RuntimeValue` introduce conversión row-by-row vs Arrow.** Aceptable inicialmente (correctness-first); la versión vectorizada elimina la conversión midiendo sobre `RecordBatch` directo. Mientras tanto, hot loops convierten `BTreeMap → BTreeMap` por cada NodeView accedida. 2. **`Empty` operator + `NodeById` son corner cases.** Podrían vivir como casos especiales del `NodeScan`, pero declararlos explícitos en el IR los hace inspeccionables en `EXPLAIN` y trivial de optimizar después. 3. **OPTIONAL MATCH como flag en `Expand`** mezcla orthogonality (left outer join semantics) con sintaxis (cypher-specific clause). A futuro probablemente lo refactorizamos a `LeftOuterExpand` o un explicit `LeftJoin` operator cuando el optimizer lo necesite. 4. **`Distinct` sobre el row entero** no permite optimizar `DISTINCT col` donde solo necesitamos uniqueness de una columna. Optimización diferida. 5. **Lowering errors no son recuperables** — un solo `BindingNotFound` aborta el plan. En contraste, parser tiene multi-error recovery. Aceptable: semantic errors son menos frecuentes que typos sintácticos y queremos fail-fast. ## Addendum — `SemiApply`, `Argument`, `PatternList` Tres operadores adicionales al IR para soportar pattern predicates, pattern comprehensions y back-references a outer scope: * **`Argument { bindings: Vec }`** — single-row placeholder cuyas bindings se cargan desde el outer scope. Aparece como leaf de subplans dentro de `SemiApply` o `PatternList`. El executor materializa `vec![row]` donde `row` copia las bindings nombradas desde el outer. * **`SemiApply { input, subplan, negated }`** — semi-join existencial. Para cada row producida por `input`, ejecuta `subplan` parametrizado por el row (vía `outer_row`); mantiene la row iff el subplan emitió ≥1 (positivo) ó =0 (negated). Reemplaza la semántica `Filter(Exists(...))` con un operador dedicado. Pendiente: convertir nested-loop semi-apply a hash-semijoin cuando hay >N rows. * **`PatternList { input, subplan, projection, alias }`** — materializa una `RuntimeValue::List` por outer row. Para cada row, ejecuta `subplan` parametrizado por la row, evalúa `projection` sobre cada inner row, colecta a una lista y bindea a `alias` en la row outer. Es el lowering de `[(pattern) WHERE p | proj]` cuando aparece como top-level projection item. ### Lowering rules adicionales * **WHERE con EXISTS**: descompone el AND-tree del predicate. Cada término que es `Exists(pattern)` o `NOT Exists(pattern)` se extrae a un `SemiApply` chained sobre el input plan; los residuos se reconstruyen como `Filter` encima de la chain. Casos no soportados en v0: `Exists` dentro de `OR`, `CASE`, doble negación, etc. → `UnsupportedFeature`. * **Pattern comprehension top-level**: hoist a `PatternList` con alias sintético `__pcN`, substitute la comprehension expression por `Variable(__pcN)` en el item de la projection. * **Aggregate nesting** (e.g. `head(collect(x))`): el lowering walk recursivo cada item expression, hoist cada aggregate function call a un alias sintético `__aggN` con la `AggregateExpr` correspondiente, substituye la call por `Variable(__aggN)`. Group keys = items que no contienen ningún `__aggN`. Items con agg-nesting se evalúan sobre la row post-Aggregate. * **`RETURN *` / `WITH *`**: expande `ExpressionKind::Star` a una projection item por cada binding nombrada visible en `LowerCtx` (skip `__anon*`). Cierra RFC-004 Q1. * **Back-reference de head pattern**: cuando `(a)` reutiliza una binding ya en scope y no hay input plan, emite `Argument { bindings: [a] }` en vez de `Empty`. Esto permite que un subplan de `SemiApply`/ `PatternList` reciba la binding outer al ejecutarse. ### Out-of-scope todavía (pendiente para versiones futuras) * Pattern comprehensions nested dentro de scalar functions (`size([(a)-[]->(b)|b.name])`). * `EXISTS` fuera del AND-root del WHERE (dentro de OR/CASE/etc). * Path bindings (`p = (a)-[*]->(b)`) + path materialization. * Write clauses (CREATE/MERGE/SET/REMOVE/DELETE). ## Open questions * **Q1: ~~Pattern predicates como sub-plans.~~** ✅ Cerrada vía `SemiApply` + `Argument`. La optimización a hash-semijoin queda pendiente. * **Q2: Variable-length patterns sin variable-length operator.** Inicialmente podemos pasar `length: Option` al `Expand` y dejar que el executor itere `length.min..=length.max` iterations. Eso funciona pero no escala. ¿Variable-length explícito como operador separado (`Traverse`) a futuro con WCOJ? Probable sí. * **Q3: Materialización de paths.** `MATCH p = (a)-[*]->(b)` requiere que `p` sea materializable como List. Diferido. * **Q4: ~~`WITH *` y `RETURN *`.~~** ✅ Cerrada vía `expand_star_items` en el lowering. * **Q5: Hoist de pattern comprehensions nested.** Hoy solo top-level en projection items. Hoist nested requiere planning de orden de evaluación y bookkeeping de scopes intermedios. Diferido. ## References * DuckDB logical/physical plans — (architecture notes en el repo de DuckDB). * Kuzu morsel-driven execution — Boncz et al., CIDR 2024 paper . * Materialize/Differential Dataflow operators — McSherry et al., 2013. * Volcano model — Goetz Graefe, “Volcano—An Extensible and Parallel Query Evaluation System”, IEEE TKDE 1994. * Cypher openCypher 9 §Section 3 (Linear queries semantics). * GQL ISO/IEC 39075:2024 §17 (Linear queries) y §18 (Composite queries). # RFC 009: Write clauses + execution model > **Status:** accepted **Author(s):** Matías Fonseca **Supersedes:** — > *Mirrored from [`docs/rfc/009-write-clauses.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/009-write-clauses.md) in the engine repo. Source of truth lives there.* **Status:** accepted **Author(s):** Matías Fonseca **Supersedes:** — ## Summary El read-path está completo: `parse → lower → execute` contra `Snapshot` (read-only) corre MATCH / Expand / Filter / Project / TopN / Aggregate / Distinct / Union / Unwind / SemiApply / PatternList / list & pattern comprehensions / `RETURN *` / `EXPLAIN`, y los 4 IC representativos del LDBC SNB Interactive (IC2/7/8/9) producen resultados correctos sobre un mini-graph. Este RFC extiende el query engine al **write-path**: las cláusulas `CREATE`, `MERGE`, `SET`, `REMOVE`, `DELETE` (y `DETACH DELETE`) parsean y producen AST válido, pero el lowering reporta `UnsupportedFeature`. Esta RFC cierra ese gap. ## Motivation Sin write-path, el subset Cypher de namidb no es completo: cualquier usuario que quiera cargar datos debe hacerlo via la API Rust `WriterSession::upsert_node/upsert_edge` directamente. Eso rompe el pitch “developer-first universal con embed + Cypher” y bloquea de plano: * LDBC SNB **Update queries** (IU1 insertPerson, IU2 addPostLike, IU3 addCommentLike, IU4 addForum, IU5 addForumMembership, IU6 addPost, IU7 addComment, IU8 addFriendship). Sin estas no hay pipeline LDBC end-to-end (load → run → measure). * Quickstart docs (“crea un nodo, agrega una arista, lee de vuelta”) — hoy requieren un `Cargo.toml` + `tokio::main` boilerplate. * Loading desde scripts Cypher (.cypher files con CREATE chains que Neo4j / Kuzu aceptan de fábrica). Costo de no hacerlo ahora: cada query LDBC IU permanece como dead-letter; los benchmarks de siguen necesitando harnesses ad-hoc que escriben via API Rust en vez de via la abstracción Cypher idiomática; ningún consumidor externo puede probar namidb sin escribir Rust. ## Design ### Operadores nuevos en `LogicalPlan` ```rust pub enum LogicalPlan { // ... read operators ... Create { input: Box, elements: Vec, }, Merge { input: Box, pattern: CreateElement, on_match_sets: Vec, on_create_sets: Vec, }, Set { input: Box, items: Vec, }, Remove { input: Box, items: Vec, }, Delete { input: Box, targets: Vec, detach: bool, }, } ``` Helpers: ```rust pub enum CreateElement { Node { alias: String, label: String, properties: Vec<(String, Expression)>, }, Rel { alias: Option, edge_type: String, source_alias: String, target_alias: String, direction: RelationshipDirection, properties: Vec<(String, Expression)>, }, } pub enum SetOp { Property { target_alias: String, key: String, value: Expression }, Replace { target_alias: String, value: Expression }, // a = {...} Merge { target_alias: String, value: Expression }, // a += {...} Labels { target_alias: String, labels: Vec }, // a:Label[:Label] } pub enum RemoveOp { Property { target_alias: String, key: String }, Labels { target_alias: String, labels: Vec }, } ``` `children()` retorna `[input]` para los 5 nuevos. `operator_name()` retorna `"Create"`, `"Merge"`, `"Set"`, `"Remove"`, `"Delete"` (prefijado por `Detach` cuando aplica). ### Lowering rules * **CREATE** clause sin MATCH previo: `Empty → Create`. Cuando hay MATCH previo, `Create` se encadena: `... → Create { input, elements }`. Las bindings nuevas (node aliases + rel aliases) se introducen en `LowerCtx` antes del próximo clause. * **MERGE** clause: solo una pattern part en v0. Se baja a `Merge { input, pattern, on_match_sets, on_create_sets }`. Las bindings del pattern se introducen en `LowerCtx`. * **SET**: cada item se traduce a un `SetOp`; el operador `Set` lee el binding del row y muta. * **REMOVE**: similar a SET; cada `RemoveOp` se aplica. * **DELETE / DETACH DELETE**: las expressions de `targets` se evalúan per-row para producir Node/Rel/Path; el operador lo tombstones. * Una query solo-write (sin MATCH) arranca con `LogicalPlan::Empty` para proveer una “single driver row”. Esto reusa el patrón ya usado por UNWIND. Bindings de salida: al final del query, las bindings del último write clause + las del último read clause permanecen visibles si hay un RETURN posterior (Cypher 25 permite `CREATE (a:Person {name: 'Ada'}) RETURN a`). ### Executor split: read vs write Mantengo dos entry points distintos: ```rust // Read-only path pub async fn execute( plan: &LogicalPlan, snapshot: &Snapshot<'_>, params: &Params, ) -> Result, ExecError>; // Write-aware path pub async fn execute_write( plan: &LogicalPlan, writer: &mut WriterSession, params: &Params, ) -> Result; pub struct WriteOutcome { pub rows: Vec, pub nodes_created: u64, pub edges_created: u64, pub nodes_deleted: u64, pub edges_deleted: u64, pub properties_set: u64, } ``` `execute_write`: 1. Walk down the plan. Read operators (NodeScan/Expand/Filter/…) usan `writer.snapshot()` interno (re-pinned por clause). 2. Write operators (Create/Merge/Set/Remove/Delete) llaman `writer.upsert_node/upsert_edge/tombstone_node/tombstone_edge` per-row. 3. Al final, **auto-commit**: `writer.commit_batch().await` antes de retornar `WriteOutcome`. Garantiza durabilidad de toda la query como unidad. `execute_write` queda separado de `execute` por dos razones: * Type safety — `&mut WriterSession` vs `&Snapshot<'_>` no son intercambiables. * Permite que `execute` se siga ejecutando contra snapshots persistidos (read-replicas) en SaaS sin acoplar el writer side. ### Read-your-own-writes: NO en v0 Una query como: ```cypher CREATE (a:Person {name: 'Ada'}) MATCH (p:Person) RETURN p.name ``` verá rows = whatever existía pre-CREATE. La nueva Ada **no** está visible al MATCH. Razón: * Implementar visibility intra-query require overlay sobre Snapshot (memtable+SST+pending\_payloads). El WriterSession actual ya tiene `pending_payloads` pero solo se aplican al memtable post-`commit_batch`. * La complejidad de read-your-own-writes choca con la semántica de cluster-distributed eventual consistency que querremos en SaaS. * La gran mayoría de queries write-then-read son separadas por commits (sesiones interactivas). LDBC IU queries son monolíticas pero write-only. Mitigación: una vez se introduzca transactional consistency real, overlay la memtable + pending → read-your-own-writes “just works”. Hasta entonces, error explícito si detectamos write-then-read en el mismo plan tree (advisor warning, no hard fail). ### MERGE semantics ```plaintext MERGE (n:Label {key: value}) ON MATCH SET n.lastSeen = $now ON CREATE SET n.firstSeen = $now, n.lastSeen = $now ``` Ejecución: 1. Intenta matchear el pattern (igual que MATCH). Si encuentra ≥1 row: * Para cada row matched, aplica `on_match_sets`. * Output rows reflejan los matches. 2. Si encuentra 0 rows: * Genera el pattern (igual que CREATE). * Aplica `on_create_sets` al row del CREATE. * Output rows reflejan la creación. Limitaciones v0: * Solo una pattern part por MERGE (no multi-element). RFC-004 ya rechazaba multi-label en parser. * No locks/serializability. Una MERGE concurrente con otra writer puede crear duplicados — esto queda para una RFC futura. ### DETACH DELETE semantics ```plaintext MATCH (a:Person {id: $id}) DETACH DELETE a ``` Para cada `a` matched, antes de tombstone el node, enumera todas las edges incidentes vía `out_edges(*, a.id) + in_edges(*, a.id)` para CADA edge\_type declarado en el manifest schema, y las tombstones primero. Luego tombstone el node. DELETE sin DETACH falla con `ExecError::Mutation` si el node tiene edges (mensaje explícito sugiriendo DETACH). ### Path binding (caso simple) ```rust pub enum RuntimeValue { // ... Path(Vec), // alternating Node, Rel, Node, Rel, ..., Node } ``` Para `MATCH p = (a)-[r]->(b) RETURN p`: * `PatternPart.binding = Some(p)` se baja a `Expand { ..., path_binding: Some("p") }`. * El executor, al producir cada row, materializa `[a_value, r_value, b_value]` y bindea a `p`. * Para chains más largos `p = (a)-[r1]->(b)-[r2]->(c)`, el executor acumula a través del Expand chain. Variable-length paths (`p = (a)-[*1..3]->(b)`) requieren materializar listas de longitud variable y quedan diferidos. `fingerprint_value` se extiende con un caso `Path(items)` para que Distinct + collect distinct funcionen sobre paths. ## Alternatives considered **A. Single executor entry que toma `&mut WriterSession` siempre.** Rechazada: Snapshot read path es claramente diferente del write path (no mutación, lifetime más corto, posible read-only replica). Forzar WriterSession en TODOS los reads acopla los SaaS paths. **B. Lazy commit (caller decides cuándo flush).** Rechazada para v0: hace que `execute_write` retorne un handle a un “pending transaction” y requiere transaction API formal. La sentencia “una query es una transacción” es predecible y suficiente para LDBC IU + quickstart. **C. Read-your-own-writes via overlay.** Considerada pero deferida: el overlay sobre Snapshot requiere mantener un view temporal “memtable + pending\_payloads + el plan write effects acumulados hasta ahora”. Es \~300 LoC y complica el reasoning sobre snapshot lifetimes. Vuelve a futuro con el transactional model. **D. MERGE con locks.** Considerada y rechazada para v0: requiere coordinación a nivel WriterSession (single-writer per namespace ya nos da serialización a nivel de namespace, pero MERGE necesita serialization local entre clauses). Vive bien con LWW pero introduce flakiness en tests si dos writers race. Mientras tenga single-writer-per-namespace (que tiene), MERGE es safe. **E. Mantener Create/Merge/Set/Remove/Delete como UnsupportedFeature.** Rechazada: bloquea LDBC IU y quickstart developer experience indefinidamente. El opportunity-cost de no tenerlos es mayor que la complejidad de implementarlos ahora. **F. Soportar variable-length path bindings de entrada.** Rechazada: materializar lista de longitud variable + interaccionar con `Expand` multi-hop es \~150 LoC más y un test surface considerable. El caso simple cubre la mayoría de quickstart docs; var-len queda diferido. ## Drawbacks 1. **No read-your-own-writes** rompe expectativas de usuarios que vienen de Neo4j / Kuzu. Mitigación: documentar explícitamente en README + retornar warning en `WriteOutcome` si se detectó el pattern; cerrar a futuro. 2. **Auto-commit per query** no permite multi-statement transactions. Para LDBC IU es suficiente (cada IU es atomic by design); para workloads ETL más complejos no. Mitigación: a futuro se introduce explicit `BEGIN TRANSACTION ... COMMIT` clauses con session API. 3. **MERGE sin locks** depende del single-writer-per-namespace invariant. Si en multi-tenant SaaS hacemos multi-writer sharded namespaces, MERGE necesita revisitarse. Documentado. 4. **DETACH DELETE enumeration is O(edge\_types × incident\_edges).** Para nodos high-degree (super-nodes) puede ser caro. Acceptable para v0; optimización vive junto con el catálogo de edge\_types activos. 5. **`WriteOutcome` counters son aproximados.** Counters incrementan por cada operación del executor, no por cada cambio real de estado (e.g. SET de la misma propiedad al mismo valor cuenta como 1 property\_set aunque sea no-op). Documentado. ## Open questions * **Q1: WriteOutcome.rows.** ¿Una query write-only (CREATE sin RETURN) retorna `Vec` vacío? Cypher dice sí. ¿Y con RETURN? `RETURN a` después de CREATE retorna el row con `a` bound. Implementar igual que un Project encima del Create. * **Q2: Schema discovery via CREATE.** Si CREATE introduce una label o edge\_type nueva, ¿se autopopula el schema en el manifest? RFC-002 permite schema implícita via property names. Sí — el executor introspecciona la label + edge\_type y los agrega si no existen. Requiere que `WriterSession` exponga un schema extension API; hoy el commit\_batch no toca schema. Pieza adicional. * **Q3: Multi-statement Cypher.** `CREATE (a) ; CREATE (b)` (con semicolon). Hoy parser lo acepta como query terminator pero no como separator entre statements. ¿Statement separator es necesario para Cypher scripts? Diferido. ## References * openCypher 9 §6 (Write clauses), §7 (Reading + writing clauses). * GQL ISO/IEC 39075:2024 §19 (Linear data modifications). * Neo4j MERGE semantics: * Kuzu storage write path: kuzudb/kuzu README §“Bulk loading + transactions”. * DuckDB inserts as plans: * RFC-008 (Logical Plan IR + addendum). * RFC-002 (SST format) — schema introspection at storage layer. # RFC 010: Cost-Based Optimizer — Foundation > **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — > *Mirrored from [`docs/rfc/010-cost-based-optimizer.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/010-cost-based-optimizer.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Supersedes:** — ## Summary Fija la base del cost-based optimizer (CBO) que cierra el gate de LDBC SNB Interactive dentro de 2× de Kuzu. El alcance de **esta** RFC es solamente la **fundación**: un catálogo de estadísticas derivado del `Manifest`, una rutina de estimación de cardinalidad por operador, una rutina de estimación de selectividad de predicates, y `EXPLAIN VERBOSE` con números. Los rewrites estructurales (predicate pushdown, join reorder, hash-conversión de `SemiApply`/`CrossProduct`) son out-of-scope explícito de esta RFC; se encadenan en RFCs siguientes sobre esta base. La RFC se publica como **draft pre-implementation** para alinear shape y fórmulas antes de quemar decisiones de rewrite. Las cifras concretas salen del fixture LDBC SNB micro-graph (6 Person / 8 Message / 4 Comment) y de las estructuras `PropertyColumnStats` / `DegreeHistogram` que ya viven en cada `SstDescriptor`. Out-of-scope explícito de esta RFC: * Predicate pushdown / filter merging. * Join-order DP/greedy sobre `Expand` chains y `CrossProduct`. * Conversión `SemiApply` → `HashSemiJoin` y `CrossProduct` con shared bindings → `HashJoin` (ver RFC-011). * Histogramas equi-depth o quantiles para selectividad de rangos precisa. * HyperLogLog actually populated por el writer (hoy `ndv_estimate` es siempre `None` — la plomería existe, el cómputo todavía no). * Adaptive / runtime cost feedback. * PROFILE con observed cardinality post-ejecución. * Cost model multi-namespace / partition-aware. ## Motivation El executor naïve inicial ejecuta cualquier `LogicalPlan` válido correcta y deterministamente. El problema visible es: 1. **Multi-pattern MATCH naïve.** `MATCH (a:Person {id: $x}), (b:Message {id: $y})` se baja a `CrossProduct { NodeById, NodeById }`. Sin reorder el outer puede ser el lado pesado; con `SemiApply` el outer nested-loop reejecuta el subplan |outer| veces sin cache. 2. **EXPLAIN sin números.** El árbol del plan es indentado pero no dice cuántas filas espera procesar cada operador. Sin números, ningún rewrite tiene base para decidir “este Expand explota a 10 K rows, conviene pushear el Filter antes”. 3. **`PropertyColumnStats` + `DegreeHistogram` sin consumer.** Las dos estructuras viven en cada `SstDescriptor`. El writer las puebla con `min`/`max`/`null_count` (HLL todavía no), pero ningún consumer las lee — son data dormida. El costo de no hacerlo ahora es: * Los rewrites posteriores tendrían que inventar su propio cost model inline, contaminando cada paso con lookups de stats. * Las decisiones de pushdown/reorder se tomarían a ciegas (heurísticas sin números), reproduciendo el problema “Cypher.runtime=slotted” de Neo4j: optimizaciones que parecen razonables pero pierden en queries reales. * LDBC SNB SF1 (gate) no se puede preparar sin una baseline numérica que diga *dónde* el plan actual gasta tiempo. Hacerlo ahora cuesta \~1 500 LoC (módulo `cost::`, EXPLAIN VERBOSE, smoke tests) y abre la puerta a rewrites sin refactor. ## Design ### 1. Catálogo de estadísticas (`StatsCatalog`) crates/namidb-query/src/cost/stats.rs ```rust pub struct StatsCatalog { labels: BTreeMap, edge_types: BTreeMap, /// Total nodes across all labels — usado como denominador para /// estimaciones de patrones anónimos (label desconocido). total_nodes: u64, /// Total edges across all edge types — análogo para edges /// anónimos. total_edges: u64, } pub struct LabelStats { pub name: String, /// Σ row_count - tombstone_count sobre SSTs del label (no incluye /// memtable: el catálogo se construye desde Manifest committed). pub node_count: u64, /// Propiedad → estadísticas por columna. Se mergean per-name a /// través de todos los SSTs del label. pub properties: BTreeMap, } pub struct PropStats { pub null_count: u64, pub non_null_count: u64, pub min: Option, // reusado de storage::sst::stats pub max: Option, /// NDV decodificado del HLL fused; `None` cuando el writer no /// pobló el sketch (caso default en v0). pub ndv: Option, } pub struct EdgeTypeStats { pub name: String, /// Σ row_count - tombstone_count sobre SSTs `EdgesFwd` del tipo. pub edge_count: u64, /// avg_degree para src → dst, derivado de degree_histogram fused. /// Si no hay SST `EdgesFwd`, es 0. pub avg_out_degree: f64, pub max_out_degree: u64, /// idem para EdgesInv (dst → src). pub avg_in_degree: f64, pub max_in_degree: u64, /// Schema-declared endpoints. `None` cuando no hay schema explícito /// (caso típico hoy: las queries inferieron edge_type del pattern). pub src_label: Option, pub dst_label: Option, } ``` **Construcción:** ```rust impl StatsCatalog { pub fn from_manifest(m: &Manifest) -> Self; pub fn empty() -> Self; // fallback cuando el query corre sin Snapshot pub fn label(&self, name: &str) -> Option<&LabelStats>; pub fn edge_type(&self, name: &str) -> Option<&EdgeTypeStats>; pub fn total_nodes(&self) -> u64; pub fn total_edges(&self) -> u64; } ``` **Merge de stats per-label**: itera `m.ssts` filtrando por `kind == SstKind::Nodes && scope == label`. Para cada `SstDescriptor`: * `node_count += row_count - tombstone_count` (`KindSpecificStats::Nodes`). * Para cada `PropertyColumnStats`: * `null_count += sst.null_count`. * `non_null_count += (row_count - tombstone_count - null_count)`. * `min = stat_min(self.min, sst.min)` (lex-order según tipo). * `max = stat_max(self.max, sst.max)`. * `ndv`: cuando los SSTs traen HLL (v1 follow-up), fuse; v0 → `None`. **Merge de stats per-edge\_type**: itera `m.ssts` filtrando por `(EdgesFwd, edge_type)` y `(EdgesInv, edge_type)`: * `edge_count = Σ row_count(EdgesFwd) - Σ tombstone_count(EdgesFwd)`. * `avg_out_degree = sum_degree(EdgesFwd) / key_count(EdgesFwd)` (Σ y Σ). * `max_out_degree = max(max_degree(EdgesFwd))` across SSTs. * idem para `EdgesInv` → `avg_in_degree`, `max_in_degree`. * `src_label` / `dst_label`: lookup `m.schema.edge_type(name)`; si no hay declaración, `None`. **Coste de construcción**: O(|ssts|). En un manifest real típico (1 M nodos / 1 M edges sobre R2) son \~10² SSTs — micro-segundos. Para SF1 LDBC (\~3 M nodes / 17 M edges) serán \~10³ SSTs, sigue siendo sub-milisegundo. El catálogo se construye **una vez por `Snapshot`** y se reutiliza para todas las optimizaciones del plan; no es hot-path. **Edge case — schema vacío + zero SSTs (CLI ephemeral `namidb run`):** el catálogo retorna `LabelStats::empty()` para cualquier label solicitado. La cardinalidad cae al fallback default (ver §3.4) y EXPLAIN VERBOSE marca el nodo con `(no stats)`. Esto permite que `namidb explain --verbose` funcione sin datos cargados, útil para debugging del plan shape. ### 2. Selectividad de predicates (`cost::selectivity`) Función pura: dado un `Expression`, un `LabelStats` (o tabla de `LabelStats` por alias) y un mapa de tipos opcional, retorna la fracción esperada de filas que satisface el predicate. ```rust pub fn selectivity( expr: &Expression, bindings: &BindingStats, ) -> f64; pub struct BindingStats<'a> { /// alias → LabelStats. None cuando el alias no está bound a un /// label conocido (Argument / Project synthetic / etc). pub by_alias: BTreeMap, } ``` **Reglas (v0):** | Predicate | Estimación | | -------------------------------------------------------- | -------------------------------------------------------------- | | `prop = literal` | `1 / ndv(prop)` si hay HLL; `0.1` fallback (10 %). | | `prop <> literal` | `1 - eq_sel(prop, literal)`. | | `prop < literal` | rango sobre `[min, max]` si min/max + tipo numérico; `0.33`. | | `prop <= literal` / `prop > literal` / `prop >= literal` | mismo trato que `<`. | | `prop BETWEEN low AND high` | rango bilateral; `0.25` fallback. | | `prop IN [list]` | `min(1, len(list) / ndv)`; `min(1, len(list) * 0.1)` fallback. | | `prop IS NULL` | `null_count / (null_count + non_null_count)`; `0.05` fallback. | | `prop IS NOT NULL` | `1 - is_null_sel`. | | `prop STARTS WITH 'p'` | `0.1` (sin tries / sin index). | | `prop CONTAINS 'p'` / `ENDS WITH 'p'` | `0.1`. | | `prop LIKE 'pattern'` (no soportado) | n/a. | | `__label_eq(alias, L)` | fold pre-Filter — siempre `1.0` (el operador ya garantiza). | | `AND` | producto: `sel(left) * sel(right)`. Asume independencia. | | `OR` | unión: `sel(left) + sel(right) - sel(left)*sel(right)`. | | `NOT (pred)` | `1 - sel(pred)`. | | `XOR` | `sel(left) + sel(right) - 2*sel(left)*sel(right)`. | | Cualquier otro caso | `0.5` (unknown). | **Independencia**: asumimos columnas independientes — clásico Selinger ‘79. Es un fallback, no un teorema; selectividades correlacionadas quedan para más adelante (multi-column histograms). **Rangos**: para `prop < lit` y un `PropStats { min, max }` numérico, `sel = clamp01((lit - min) / (max - min))`. Si `min == max`, retorna `1.0` cuando `min < lit` y `0.0` otherwise (degenerate column). **Tipos no comparables** (e.g. `min: Utf8`, `lit: Int64`): el selector cae al fallback `0.33`. La selectividad nunca propaga errores — robustez sobre exactitud. **Tabla rationale**: los defaults de la columna derecha siguen el folklore PostgreSQL `default_statistics_target=100` calibrado para queries OLTP, no porque sean “verdad”, sino porque son el menor mal en ausencia de stats reales. En particular el `0.1` para `eq` es el “selectividad agresiva” que prefiere planes index-friendly cuando hay duda. Se documentan acá para auditarlas después. ### 3. Estimación de cardinalidad por operador Función pura sobre el árbol: ```rust pub fn estimate(plan: &LogicalPlan, catalog: &StatsCatalog) -> Cardinality; pub struct Cardinality { /// Filas estimadas que emite este nodo. pub rows: f64, /// Cardinalidad de los inputs, en mismo orden que `plan.children()`. pub children: Vec, /// Bindings que el operador deja "vivos" downstream, junto con la /// `LabelStats` asociada cuando se conoce. Heredado por el padre. pub bindings: BTreeMap, } pub struct BindingMeta { /// Cuando el binding está bound a un nodo de un label conocido, /// referenciamos esa LabelStats por nombre. (No anidamos el /// borrow porque `Cardinality` es owned.) pub label: Option, /// Cuando el binding es de un edge. pub edge_type: Option, } ``` #### 3.1 Operadores leaf | Operador | Cardinalidad | | ----------------------- | --------------------------------------------------------------------- | | `Empty` | `1.0` (single driver row; consistente con `RETURN 1+1` retornando 1). | | `Argument { bindings }` | `1.0` (placeholder de outer; siempre exactamente una fila). | | `NodeScan { label }` | `catalog.label(label).node_count` (0 si no hay stats). | #### 3.2 Operadores con un input | Operador | Cardinalidad | | -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | | `NodeById { input, .. }` | `min(input.rows, 1.0)` cuando `input.rows >= 1`. Si `input` es `Empty`, `1.0`. Punto-lookup, asume hit típico. | | `Expand { input, edge_type, direction, optional }` | `input.rows * branch_factor(edge_type, direction)` + `if optional && branch == 0 { input.rows }`. | | `Filter { input, predicate }` | `input.rows * selectivity(predicate, bindings)`. | | `Project { input, distinct: false }` | `input.rows` (projection no cambia cardinalidad). | | `Project { input, distinct: true }` | `dedup_estimate(input)` — `min(input.rows, Π ndv(item))` cuando los items son props con NDV; fallback `input.rows^0.7`. | | `Distinct { input }` | `dedup_estimate(input)`. | | `Aggregate { input, group_by, .. }` | si `group_by.is_empty()`: `1.0`. Si no: `Π ndv(group_by_i)` truncado a `input.rows`; fallback `input.rows ^ 0.5`. | | `TopN { input, skip, limit, .. }` | `min(input.rows - skip, limit)` clamp a `[0, input.rows]`. | | `Unwind { input, list }` | `input.rows * avg_list_length(list)` — para `list = Literal::List(xs)` usamos `xs.len()`; para `Parameter` o `Variable` usamos default `5.0`. | | `PatternList { input, subplan, .. }` | `input.rows` (emite una row por outer; la lista es value, no rows). | | `Argument`-like wrappers | identidad. | #### 3.3 Operadores con dos inputs | Operador | Cardinalidad | | ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | `CrossProduct { left, right }` | `left.rows * right.rows`. Si comparten un binding, un rewrite posterior lo convierte a `HashJoin` y la fórmula cambia. | | `Union { left, right, all: true }` | `left.rows + right.rows`. | | `Union { left, right, all: false }` | `dedup_estimate(left + right)` aproximado como `max(left.rows, right.rows) + 0.5 * min(...)`. | | `SemiApply { input, subplan, negated: false }` | `input.rows * min(1.0, subplan.rows)` — naïve probabilidad de match. | | `SemiApply { input, subplan, negated: true }` | `input.rows * max(0.0, 1.0 - subplan.rows)`. | #### 3.4 `branch_factor(edge_type, direction)` para `Expand` Cuando `edge_type` está declarado: ```plaintext if direction == Right (out): branch = catalog.edge_type(et).avg_out_degree elif direction == Left (in): branch = catalog.edge_type(et).avg_in_degree elif direction == Both: branch = avg_out_degree + avg_in_degree ``` Cuando `edge_type` es `None` (anonymous `-[]-`): suma sobre todos los edge\_types `Σ avg_*_degree`. Fallback default `2.0` cuando no hay stats. Para `Expand { length: Some(l) }` (variable-length): `branch ^ l.max` hasta cap `MAX_VARLEN_BRANCH = 10_000` (para que `*1..6` no exploten el estimate a infinito en grafos densos). Esta es la fórmula naive de DuckDB-graph; mejora con Markov / random-walk va a futuro con WCOJ. #### 3.5 Operadores write Los `Create/Merge/Set/Remove/Delete` retornan `0.0` rows (el executor no emite tuplas; emite `WriteOutcome`). Su `input` mantiene su cardinalidad para EXPLAIN VERBOSE pero el operador write en sí es “sink”. #### 3.6 Bindings y heredado * `NodeScan { label, alias }` introduce `alias → BindingMeta { label, .. }`. * `NodeById` idem. * `Expand { target_alias, target_label, .. }` introduce `target_alias` con `label = target_label` cuando el lowering lo declaró. * `Project { distinct, items, discard_input_bindings: true }` reemplaza el set de bindings con los aliases del proyectado (sin LabelStats asociada, salvo que el item sea `Variable(x)` con `x` un alias pre-existente). * `Project { discard_input_bindings: false }` (WITH) merge: agrega los aliases de los items sobre los heredados. * Los demás operadores (`Filter/TopN/Distinct/Unwind/...`) heredan bindings sin modificar. ### 4. `EXPLAIN VERBOSE` Nueva función `explain_verbose(plan, catalog) -> String` que extiende `explain(plan)` con cardinalidad estimada por nodo y costo total. **Formato:** ```plaintext TopN keys=[m.creationDate DESC, m.id ASC] limit=20 (est=20) Project [...] (est=20) Expand source=p edge_type=KNOWS dir=-> target=friend (est=180) NodeById label=Person id=$personId (est=1) Empty (est=1) ``` **Convenciones:** * `(est=N)` redondea `f64` a entero positivo (ceil cuando `0 < x < 1`, para no mostrar “est=0” a un operador que sí emite filas). * Para nodos cuyo `LabelStats` no existe en el catálogo, se agrega `(no stats)` después del `(est=...)`. * El header del root incluye total: `# Estimated rows: N` antes del árbol. **Total cost (informativo):** Σ rows sobre todos los nodos. No es un cost en sentido fuerte (no factoriza CPU vs IO), es una baseline para comparar plans pre y post-rewrite. Futuras iteraciones refinarán si necesario. **Parser**: `EXPLAIN VERBOSE `. Sintaxis: * `EXPLAIN` sin VERBOSE: comportamiento actual (sin números). * `EXPLAIN VERBOSE`: agrega `Query.explain_verbose: bool` flag (además del `explain: bool` existente). `Display for Query` round-trips. * `EXPLAIN VERBOSE` exige stats; cuando se invoca sin Snapshot (CLI ephemeral), usa `StatsCatalog::empty()` y todos los nodos se marcan `(no stats)`. No es error. ### 5. Integración con CLI `namidb explain --verbose ` activa el flag. La query string tampoco necesita el `VERBOSE` prefix (`--verbose` lo inyecta). `namidb run ` sigue siendo read/write como hoy; el cost model **no** afecta la ejecución en esta versión (no hay rewrites todavía). ### 6. API pública del crate crates/namidb-query/src/cost/mod.rs ```rust pub mod stats; pub mod selectivity; pub mod cardinality; pub use stats::{StatsCatalog, LabelStats, EdgeTypeStats, PropStats}; pub use selectivity::selectivity; pub use cardinality::{estimate, Cardinality, BindingMeta}; // Re-exports desde lib.rs pub use crate::cost::{StatsCatalog, estimate}; pub use crate::plan::explain_verbose; ``` ## Alternativas consideradas ### A. Inferir stats del primer `scan_label` (lazy) Levantar el catálogo cada vez que el optimizer toca un operador con label desconocido. Rechazado: triple-pago de IO si dos ramas del plan hablan del mismo label, y rompe el invariante “todo el plan se optimiza antes de empezar a ejecutar” (necesario para correctness de pushdown). ### B. Cost en BigDecimal / fixed-point `f64` puede acumular error de redondeo en plans de 10+ operadores. Rechazado: el error relativo de un f64 sobre 10 operaciones está en \~10^-13, varios órdenes de magnitud por debajo del error de modelo (asumir independencia ya introduce 10–50 %). El folklore PostgreSQL / DuckDB usa f64; no inventemos un problema que no existe. ### C. Sketch-only (sin min/max), HLL-everywhere Hace los rangos imposibles. Rechazado: rangos numéricos sobre `creationDate` aparecen en 7/14 LDBC IC; sin min/max el estimate del filter colapsa al fallback 0.33 y matamos el optimizer en queries date-bounded. ### D. Cost model basado en bytes (DuckDB-style “rows × width”) Multiplicar `rows` × `avg_row_bytes` para tener algo cercano a IO. Rechazado por ahora: el executor naïve mantiene todo en memoria; no hay disco-spill ni vectorización donde el ancho importe. Con morsels y Arrow vectorization sí, y ahí refinamos. ### E. Manifest-side reporta `StatsCatalog` ya armado Mover `from_manifest` al crate `namidb-storage`. Rechazado: el catálogo lo consume el query layer; mantenerlo en `namidb-query` preserva separation of concerns y permite que el storage lib quede agnóstico de PropStats con NDV (que es concepto de query). El storage expone `Manifest`, `SstDescriptor`, `PropertyColumnStats`, `DegreeHistogram` — primitivas, no agregados. ### F. Pre-construir el catálogo cuando el manifest se carga El `Snapshot::new` podría construir `StatsCatalog` y exponerlo via `Snapshot::stats()`. Considerado, **deferido**: requeriría exportar el tipo cross-crate. Por ahora el caller (executor o CLI) construye el catálogo a partir de `snapshot.manifest().manifest`. La API `from_manifest(&Manifest)` queda pura. ## Drawbacks 1. **HLL no poblado → eq selectivity siempre 0.1.** Hoy el writer no emite sketches, así que para `prop = literal` el optimizer usa fallback aunque haya min/max. Es aceptable v0; HLL real va en follow-up (writer side \~200 LoC, cost-side cero). 2. **`avg_degree` es promedio, no mediana.** Distribuciones power-law (típicas de social graphs: LDBC SNB tiene exponente \~2.3) hacen que el promedio sea engañoso — un fan-out de 100 K en un super-nodo eleva el avg sin que la mayoría de nodos lo cumpla. Hoy `degree_histogram` está disponible pero no lo usamos en la fórmula (los buckets log₂ están ahí para join-order percentile-based futuro). Documentado. 3. **Selectividad asume independencia entre columnas.** En LDBC SNB, `Person.firstName` y `Person.lastName` son altamente correlacionados con `id`; un `WHERE firstName='Alice' AND lastName='Smith'` puede ser mucho más selectivo que el producto. A futuro introducimos multi-column stats. 4. **No hay sample-based cardinality.** PostgreSQL y CockroachDB hacen sampling para columnas con histogramas. Acá no — el writer no muestrea y el cost path no lo invoca. Llega a futuro con el morsel executor donde sampling es \~free. 5. **Stats viven en el manifest committed → no incluye memtable.** Las queries que corren contra una `Snapshot` con memtable activo (caso normal de single-writer) usan estimates del manifest sin contar las filas no-flushed. Cuando el writer está callado, es \~OK; cuando hay ingest activo, el catálogo subestima. Aceptable v0: el writer flush-cadence típico es ≤1 GB de memtable, así que el under-estimate está acotado. A futuro el vectorized executor agregará `memtable_stats` live. 6. **`Cardinality` paraleliza el árbol del plan.** En vez de mutar `LogicalPlan` con annotations inline, retornamos un árbol paralelo `Cardinality`. Es \~2× memoria del plan pero mantiene `LogicalPlan` inmutable (otros consumers — EXPLAIN, executor, future PROFILE — no tienen que filtrar las annotations). Trade-off explícito. ## Open questions * **OQ1.** ¿Selectividad debe ser `f64` o `Probability` (tipo wrapper con clamp a \[0,1])? Hoy es `f64`; v1 considera wrapper si vemos un bug por overflow. * **OQ2.** ¿`StatsCatalog::from_manifest` debe ser `async` (por si en el futuro lee sketches HLL desde un side-car)? Por ahora se mantiene síncrono — todo lo que necesita está in-line en el manifest. Si HLL side-car aterriza, se rompe la API y lo trabajamos. * **OQ3.** ¿Cost total debe ser `Σ rows` (estimate-based) o `Σ rows × per-operator-weight` (CPU model)? Por ahora usamos el primero. El segundo llega cuando midamos costo real por operador en el morsel executor. * **OQ4.** Cómo expresamos “shared bindings entre lados de CrossProduct” en el modelo. Hoy `CrossProduct` cardinality es `L × R`. Cuando se introduzca `HashJoin`, queremos algo como `(L × R) / max(ndv(shared_key, L), ndv(shared_key, R))`. La estructura de `BindingMeta` ya carga el alias; falta agregar acceso a PropStats del binding desde Cardinality. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — origen del cost-based optimizer y del fallback 0.1. * Heimel et al., *Hardware-Oblivious Parallelism for In-Memory Column-Stores* — defaults modernos para selectividad sin index. * PostgreSQL `default_statistics_target` documentation — fuente de los fallbacks numéricos. * Kuzu paper (Mhedhbi & Salihoglu, SIGMOD ‘23) — cardinality estimation para graph join enumeration via WCOJ; referencia para trabajo futuro. * DuckDB CBO blog series (Mark Raasveldt 2023) — uso de stats inline-en-Parquet para skipping y join-order; mismo patrón que acá. * HyperLogLog++ paper (Heule et al., EDBT ‘13) — formato del sketch cuando aterrice el writer. * `docs/rfc/008-logical-plan-ir.md` — operadores que esta RFC anota. * `docs/rfc/009-write-clauses.md` — write ops que retornan 0 rows. ## Plan de implementación 1. Crate `namidb-query`: * `src/cost/mod.rs` — re-exports. * `src/cost/stats.rs` — `StatsCatalog`, `LabelStats`, `EdgeTypeStats`, `PropStats` + `from_manifest`. \~300 LoC + 6-8 unit tests. * `src/cost/selectivity.rs` — `selectivity(expr, bindings) -> f64`. \~250 LoC + 10-12 unit tests (eq/range/IN/AND/OR/NOT/IS NULL/ STARTS WITH/fallback). * `src/cost/cardinality.rs` — `estimate(plan, catalog) -> Cardinality`. \~350 LoC + 8-10 unit tests cubriendo cada operator. 2. `src/plan/explain.rs`: * `explain_verbose(plan, catalog) -> String`. \~80 LoC + 5 tests. 3. `src/parser/grammar.rs`: * Reconocer `VERBOSE` como soft keyword después de `EXPLAIN`. * `Query.explain_verbose: bool`. Display round-trips. * \~30 LoC + 3 tests. 4. CLI: * `namidb explain --verbose `. \~15 LoC. 5. Tests integration: * `tests/cost_smoke.rs` — micro-graph → `StatsCatalog::from_manifest`, `estimate(plan)` vs `execute(plan).len()`. Documentar gap. 6-8 tests cubriendo IC2/IC7/IC8/IC9 + filter selectivity sweep. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 348 → \~390 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~1 500 src + \~500 tests. * Sin cambios en `namidb-storage` (consumer-only). # RFC 011: Predicate Pushdown + Filter Normalization > **Status:** draft **Author(s):** Matías Fonseca AND split + literal folding + `__label_eq` elimination) **Builds on:** RFC-010 (cost model foundation) **Supersedes:** — > *Mirrored from [`docs/rfc/011-predicate-pushdown.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/011-predicate-pushdown.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca AND split + literal folding + `__label_eq` elimination) **Builds on:** RFC-010 (cost model foundation) **Supersedes:** — ## Summary Primer rewrite estructural del optimizer: empuja cada predicate del `LogicalPlan` lo más cerca posible de los operadores leaf (NodeScan, NodeById, Argument, Empty), reduciendo la cardinalidad que los operadores caros (`Expand`, `CrossProduct`, `SemiApply`, `PatternList`) procesan. El alcance es **solamente el predicate pushdown a nivel `LogicalPlan`**: el rewriter manipula la estructura del árbol; **no** baja predicates al storage layer (Parquet predicate pushdown queda diferido), **no** reordena joins, **no** convierte `SemiApply`/`CrossProduct` a hash joins. Acompañando el pushdown, incluye tres normalizaciones del Filter tree que el lowering deja sub-óptimo y que el pushdown necesita para funcionar: 1. **AND-split** — `Filter(a AND b AND c)` se descompone en tres conjuntos pushables independientemente. 2. **Adjacent merge** — dos `Filter` consecutivos post-pushdown se fusionan en uno con AND, para minimizar nodos en EXPLAIN. 3. **Literal fold** — `Filter(true)` se elimina (`input` directo); `Filter(false)` se preserva (no se sustituye por `Empty` porque el plan podría seguir requiriendo bindings sin filas; el executor maneja 3VL). 4. **`__label_eq` cleanup** — el `Filter(__label_eq(target, L))` que el lowering inyecta defensivamente arriba de un `Expand` con `target_label=Some(L)` se elimina (el operador ya garantiza el label en la capa storage). El contrato público cambia: `lower(query)` sigue siendo puro (unchanged), pero **`execute` / `execute_write` ahora aplican `optimize` por default**. EXPLAIN VERBOSE muestra el plan optimizado; EXPLAIN RAW (nueva sintaxis) muestra el plan literal del lowering. Out-of-scope explícito: * Parquet predicate pushdown al storage layer. * Join-order DP/greedy sobre `Expand` chains y `CrossProduct`. * Conversión `SemiApply`/`CrossProduct` con shared bindings → HashJoin (ver RFC-012). * Projection pushdown / column pruning. * Boolean simplification más allá de `true`/`false` literales (De Morgan, Karnaugh, common-subexpression). * HLL populated por el writer (RFC-010 §“Drawbacks 1”). ## Motivation La **fundación** del optimizer (catálogo de stats real, selectividad, cardinalidad, EXPLAIN VERBOSE) ya está. El gap visible es: el plan que el lowering produce hoy es estructuralmente naïve y deja trabajo grueso sobre la mesa. Ejemplo concreto, consulta LDBC SNB IC-shape: ```cypher MATCH (a:Person)-[:KNOWS]->(b:Person) WHERE a.age > 30 AND b.firstName = 'Alice' RETURN b.id ``` Lowering produce: ```plaintext Project [b.id] Filter (a.age > 30 AND b.firstName = 'Alice') Filter (__label_eq(b, "Person")) Expand source=a edge_type=KNOWS dir=-> target=b NodeScan label=Person alias=a ``` Sobre el micro-graph LDBC (6 Person, avg\_degree=1, age:\[25,40]), eso expande 6 Person × 1 = 6 pairs antes de filtrar — pequeño, pero la forma es la misma sobre SF1 (3 M Person, avg\_degree≈30): 90 M pairs materializados antes de filtrar al \~17 % (`age > 30 ⇒ sel≈0.67`, `firstName='Alice' ⇒ sel≈0.10⇒0.067` total). Con pushdown: ```plaintext Project [b.id] Expand source=a edge_type=KNOWS dir=-> target=b Filter (a.age > 30) NodeScan label=Person alias=a (Filter b.firstName = 'Alice' queda arriba — refiere target_alias) ``` Filter sobre `a` baja por debajo del Expand (1 M Person → 670 k); el filter sobre `b` queda por estructura (refiere el alias introducido por el Expand). El Expand procesa 670 k × 30 = 20 M pairs en lugar de 90 M. **4.5× menos trabajo procesado**, sin tocar storage. El costo de no hacerlo: * Cada query LDBC con WHERE compuesto paga el costo de un plan estructuralmente subóptimo. SF1 gate inalcanzable. * Join reorder y hash conversión operan sobre el plan optimizado de filters; sin pushdown previo, los algoritmos de reorder ven `CrossProduct + Filter combinado` en lugar de `Filter ⇒ Subtree`, lo cual oscurece el grafo de joins. * EXPLAIN VERBOSE hoy muestra `(est=N)` que es matemáticamente correcto pero no refleja el plan que el motor podría correr — el número es engañoso porque el plan está mal estructurado. Hacerlo ahora cuesta \~1 500 LoC src + \~700 LoC tests y desbloquea join reorder y hash conversion. ## Design ### 1. API pública crates/namidb-query/src/optimize/mod.rs ```rust /// Apply the full optimizer pipeline to `plan`. Idempotent — calling /// `optimize(optimize(p, c), c)` returns a structurally identical plan. /// /// Today the pipeline consists of `predicate_pushdown` followed by /// `normalize_filters` (AND-split, adjacent-merge, literal fold, /// `__label_eq` elimination) iterated to fixpoint (cap 8 rounds). pub fn optimize(plan: LogicalPlan, catalog: &StatsCatalog) -> LogicalPlan; /// Push every `Filter` predicate as close to the leaves as possible. /// Splits AND-conjunctions and dispatches each conjunct independently /// based on the aliases it references. pub fn predicate_pushdown(plan: LogicalPlan) -> LogicalPlan; /// Tidy the Filter tree: merge adjacent Filters into a single AND, /// fold `Filter(true)` away, drop the `__label_eq` defensive filter /// when the immediate child is an Expand already constraining the /// target label. pub fn normalize_filters(plan: LogicalPlan) -> LogicalPlan; ``` `StatsCatalog` se acepta para que el pipeline pueda usar estimates para decisiones futuras (`predicate_pushdown` actual no lo necesita — es estructural — pero ya queda en la firma para evitar romper el contrato cuando un rewrite futuro lo necesite). ```rust // crates/namidb-query/src/lib.rs (nuevo) pub use optimize::{optimize, predicate_pushdown, normalize_filters}; /// Convenience: lower + optimize. Used by the executor and EXPLAIN /// VERBOSE by default. Tests that want the raw lowering should call /// `lower(query)` directly. pub fn plan(query: &Query, catalog: &StatsCatalog) -> Result { Ok(optimize(lower(query)?, catalog)) } ``` `execute(plan, snapshot, params)` y `execute_write(plan, writer, params)` no cambian — siguen aceptando un `LogicalPlan` listo. El cambio es en los **call sites** (CLI, walker bench, tests integration): donde antes hacían `let p = lower(&query)?;`, ahora hacen `let p = plan(&query, &catalog)?;`. Tests internos que prueban operadores específicos (lowering tests, executor unit tests) siguen usando `lower(query)` directamente. ### 2. Algoritmo `predicate_pushdown` Single-pass top-down con accumulator. Cada llamada recursiva pasa un `Vec` de predicados pendientes que el caller quiere empujar hacia abajo. Cada nodo del plan decide cuáles puede absorber y cuáles devuelve a su parent vía un `Filter` materializado encima. ```rust fn pushdown_at(plan: LogicalPlan, pending: Vec) -> LogicalPlan { match plan { // Leaf nodes: aplicar pending arriba y terminar. LogicalPlan::Empty | LogicalPlan::Argument { .. } | LogicalPlan::NodeScan { .. } => apply_filters(plan, pending), // Filter node: descomponer y propagar. LogicalPlan::Filter { input, predicate } => { let mut acc = pending; for term in split_and_terms(&predicate) { acc.push(term); } pushdown_at(*input, acc) } // Operadores que introducen aliases — particionar pending por // alias-set y propagar lo pushable; el resto queda arriba. LogicalPlan::Expand { /* ... */ } => { /* §2.1 */ } LogicalPlan::NodeById { /* ... */ } => { /* §2.2 */ } LogicalPlan::CrossProduct { /* ... */ } => { /* §2.3 */ } LogicalPlan::Project { /* ... */ } => { /* §2.4 */ } LogicalPlan::Aggregate { /* ... */ } => { /* §2.5 */ } LogicalPlan::Union { /* ... */ } => { /* §2.6 */ } LogicalPlan::Unwind { /* ... */ } => { /* §2.7 */ } LogicalPlan::SemiApply { /* ... */ } => { /* §2.8 */ } LogicalPlan::PatternList { /* ... */ } => { /* §2.8 */ } // Barreras — Distinct / TopN no se cruzan porque cambian // cardinalidad de forma que el filter pre/post no es semánticamente // equivalente. LogicalPlan::TopN { /* ... */ } | LogicalPlan::Distinct { /* ... */ } => { /* §2.9 */ } // Writes son barreras: pending queda arriba, recurse solo en su // input con pending vacío. LogicalPlan::Create { /* ... */ } | LogicalPlan::Merge { /* ... */ } | LogicalPlan::Set { /* ... */ } | LogicalPlan::Remove { /* ... */ } | LogicalPlan::Delete { /* ... */ } => { /* §2.10 */ } } } ``` #### 2.1 `Expand { source, target_alias, rel_alias, target_label, optional, .. }` El `Expand` introduce `target_alias` y opcionalmente `rel_alias`. Un predicate puede empujarse al input sii **no referencia ningún alias introducido por el Expand**. La distinción entre `optional` y non-optional **no afecta la pushability**: el rule es estructural. Lo que sí cambia bajo `optional` es la **forma del Filter que queda arriba** — un predicate sobre `target_alias` post-OPTIONAL ya está evaluando 3VL contra `NULL` correctamente; no necesitamos invertir su semántica. (El lowering, además, folds los property/label filters INSIDE el Expand cuando `optional=true`, así que la situación clásica “`Filter(b.x > 0) ⇒ OptionalExpand(target=b)`” sólo ocurre con WHERE explícito del usuario, que es el caso 3VL correcto.) ```rust let introduced: BTreeSet = { let mut s = BTreeSet::new(); s.insert(target_alias.clone()); if let Some(r) = &rel_alias { s.insert(r.clone()); } s }; let (pushable, stay) = pending.into_iter() .partition(|e| expression_aliases(e).is_disjoint(&introduced)); let new_input = pushdown_at(*input, pushable); let new_expand = LogicalPlan::Expand { input: Box::new(new_input), source, edge_type, direction, rel_alias, target_alias, target_label, length, optional, }; apply_filters(new_expand, stay) ``` #### 2.2 `NodeById { input, alias, .. }` Introduce `alias`. Idéntico a `Expand` pero con un set de un elemento. #### 2.3 `CrossProduct { left, right }` Cada conjunct puede ir a `left`, `right`, o quedarse arriba: ```rust let left_aliases = produced_aliases(&left); let right_aliases = produced_aliases(&right); let mut to_left = Vec::new(); let mut to_right = Vec::new(); let mut keep_top = Vec::new(); for term in pending { let refs = expression_aliases(&term); let hits_left = !refs.is_disjoint(&left_aliases); let hits_right = !refs.is_disjoint(&right_aliases); match (hits_left, hits_right) { (true, false) => to_left.push(term), (false, true) => to_right.push(term), (true, true) => keep_top.push(term), (false, false) => keep_top.push(term), // constant — safe to keep up } } ``` **Mixed-side equality** (e.g. `a.x = b.y` con `a∈left, b∈right`) queda en `keep_top`. La inspección de `keep_top` para detectar **join-candidate** queda como hint visual en EXPLAIN VERBOSE — no modificamos el IR ni introducimos un `HashJoin` (queda diferido). La detección es: ```rust fn is_join_candidate(expr: &Expression, left: &BTreeSet, right: &BTreeSet) -> bool { if let ExpressionKind::Binary { op: BinaryOp::Eq, left: l, right: r } = &expr.kind { let la = expression_aliases(l); let ra = expression_aliases(r); let l_side = la.is_subset(left) && ra.is_subset(right); let r_side = la.is_subset(right) && ra.is_subset(left); return l_side || r_side; } false } ``` EXPLAIN VERBOSE anota cada Filter inmediatamente sobre un CrossProduct con `[join candidate]` cuando `is_join_candidate` true. #### 2.4 `Project { items, distinct, discard_input_bindings }` Un alias del input sobrevive arriba del Project **sii** algún `items[i]` tiene la forma `Variable(x)` con `items[i].alias == x` (identity projection sin renaming). Si el predicate refiere solo aliases identidad-proyectados, podemos bajarlo. Si refiere un alias introducido por la projection (e.g. `expr AS y`), queda arriba — debajo del Project el alias `y` no existe. ```rust let preserved: BTreeSet = items.iter().filter_map(|it| { if let ExpressionKind::Variable(id) = &it.expression.kind { if id.name == it.alias { return Some(id.name.clone()); } } None }).collect(); let (pushable, stay) = pending.into_iter() .partition(|e| expression_aliases(e).is_subset(&preserved)); ``` WITH \* (a futuro) podrá relajar esto. Hoy es conservador. #### 2.5 `Aggregate { group_by, aggregations }` Análogo a Project, pero con una distinción: predicates que refieren **aliases de agregaciones** son HAVING semánticos y nunca bajan. Para group\_by keys que son identity (`Variable(x)` con alias `x`), pushdown OK como pre-aggregate filter. ```rust let preserved: BTreeSet = group_by.iter().filter_map(|(e, alias)| { if let ExpressionKind::Variable(id) = &e.kind { if id.name == *alias { return Some(id.name.clone()); } } None }).collect(); let agg_aliases: BTreeSet = aggregations.iter().map(|(a, _)| a.clone()).collect(); let (pushable, stay) = pending.into_iter().partition(|e| { let refs = expression_aliases(e); refs.is_subset(&preserved) && refs.is_disjoint(&agg_aliases) }); ``` #### 2.6 `Union { left, right, all }` Pushable a ambos lados sii **todos los aliases referenciados existen en ambos**. Caso típico: post-Union los dos lados proyectan el mismo schema, así que un Filter sobre la projection sale a ambos sin ambigüedad. Si un alias falta en un lado, queda arriba. ```rust let l_aliases = produced_aliases(&left); let r_aliases = produced_aliases(&right); let (pushable, stay) = pending.into_iter().partition(|e| { let refs = expression_aliases(e); refs.is_subset(&l_aliases) && refs.is_subset(&r_aliases) }); let new_left = pushdown_at(*left, pushable.clone()); let new_right = pushdown_at(*right, pushable); ``` (Cloning pushable es OK — predicates suelen ser pequeños.) #### 2.7 `Unwind { list, alias }` Introduce `alias`. Predicate sobre `alias` queda arriba; otros bajan al input. #### 2.8 `SemiApply` / `PatternList` Ambos toman un `input` (outer) y un `subplan` (inner, parametrizado por la row outer). El **subplan nunca recibe pushdown** del rewriter — son scopes nested y el pushdown cross-scope requiere correlation analysis (decorrelation), out-of-scope. * `SemiApply`: no introduce nuevos aliases visibles arriba (es un semi-join, no proyecta). Pending fluye entero a `input`. Subplan intacto. * `PatternList`: introduce `alias` (el valor list). Predicates sobre `alias` quedan arriba; otros bajan a `input`. #### 2.9 `TopN` / `Distinct` (barreras de cardinalidad) NO se cruzan. Razones: * `TopN limit=L`: `Filter(p) ⇒ TopN(L)` retorna ≤ L filas filtradas; `TopN(L) ⇒ Filter(p)` retorna L filas pre-filter y luego filtra. Cardinalidades distintas; rows distintas. * `Distinct`: para predicates puros (deterministas, sin side-effects) el resultado **set** es el mismo, pero permitir el cruce nos obliga a verificar la pureza de cada subexpresión. Más seguro mantener como barrera v0. ```rust LogicalPlan::TopN { input, keys, skip, limit } => { let new_input = pushdown_at(*input, vec![]); let new = LogicalPlan::TopN { input: Box::new(new_input), keys, skip, limit }; apply_filters(new, pending) } ``` #### 2.10 Write ops (`Create / Merge / Set / Remove / Delete`) Barreras. Pending queda arriba (en la práctica el lowering nunca emite un `Filter` encima de un write — el patrón `MATCH ... WHERE ... SET` produce `Set { input: Filter { input: ... } }`, no `Filter { input: Set { ... } }`. La barrera es defensiva). ### 3. Algoritmo `normalize_filters` Bottom-up. Cuatro reglas, aplicadas en orden: 1. **Recursividad sobre children primero** (post-order). 2. **`Filter { input: Filter { input: x, predicate: p1 }, predicate: p2 }`** → `Filter { input: x, predicate: p1 AND p2 }`. 3. **`Filter { input, predicate: Literal::Boolean(true) }`** → `input`. 4. **`Filter { input: Expand { ..., target_alias=A, target_label=Some(L) }, predicate: __label_eq(A, L) }`** → `Expand { ... }` (el filter se elimina). La regla 4 también aplica recursivamente: si después de eliminar el filter, hay otro `__label_eq` apilado abajo, se elimina. La regla 2 fusiona las cláusulas que el split en pushdown dejó separadas. `Filter(false)` queda como está — el executor evalúa el predicate literal y descarta cada row; el optimizer no convierte a `Empty` porque eso requiere reasoning sobre los bindings que el plan necesita introducir (e.g. para un downstream Aggregate count(\*) = 0). ### 4. Helpers ```rust /// Set of aliases (Variable identifiers) referenced anywhere in `expr`. /// Property accesses contribute their target alias. Pattern subqueries /// (`Exists`, `PatternComprehension`) and list comprehensions are /// treated as opaque — we return ALL bindings they could possibly /// reference, by collecting free variables in the inner expression /// without descending into nested patterns. Conservative: when in /// doubt, the alias set is wider, so the predicate stays higher up. fn expression_aliases(expr: &Expression) -> BTreeSet; /// Set of aliases that `plan` makes visible to its parent. fn produced_aliases(plan: &LogicalPlan) -> BTreeSet; /// AND-flatten: `a AND b AND c` → vec![a, b, c]. Used by pushdown to /// split a compound predicate. fn split_and_terms(expr: &Expression) -> Vec; /// Concatenate `terms` with binary AND, preserving source order. /// Returns None if `terms` is empty. fn and_chain(terms: Vec) -> Option; /// If `terms` non-empty, wrap `plan` in a `Filter(AND(terms))`. /// Otherwise return `plan` unchanged. fn apply_filters(plan: LogicalPlan, terms: Vec) -> LogicalPlan; ``` `produced_aliases` enumera los aliases por tipo de operador: | Operador | Produce | | ------------------- | ------------------------------------------------- | | `NodeScan/NodeById` | `{alias}` | | `Argument` | bindings literales | | `Expand` | `produced(input) ∪ {target_alias, rel_alias?}` | | `Filter` | `produced(input)` | | `Project` | \`items.iter().map( | | `Aggregate` | `group_by.aliases ∪ aggregations.aliases` | | `TopN`/`Distinct` | `produced(input)` | | `Union` | `produced(left) ∩ produced(right)` (schema-aware) | | `Unwind` | `produced(input) ∪ {alias}` | | `Empty` | `∅` | | `CrossProduct` | `produced(left) ∪ produced(right)` | | `SemiApply` | `produced(input)` | | `PatternList` | `produced(input) ∪ {alias}` | | Writes | `produced(input) ∪ alias(elements)` | ### 5. Fixpoint `optimize` corre `predicate_pushdown` + `normalize_filters` en loop hasta que dos iteraciones consecutivas producen árboles idénticos (`PartialEq` already derived on `LogicalPlan`). Cap en 8 rondas para prevenir loops infinitos en caso de bug (cada ronda debería estrictamente reducir la altura del Filter tree o ser idempotente, así que >2 rondas indicaría error). Cap se loggea pero no panic. ```rust pub fn optimize(plan: LogicalPlan, _catalog: &StatsCatalog) -> LogicalPlan { let mut current = plan; for _ in 0..8 { let next = normalize_filters(predicate_pushdown(current.clone())); if next == current { return next; } current = next; } current } ``` ### 6. EXPLAIN integration #### 6.1 EXPLAIN VERBOSE muestra el plan optimizado `explain_query_verbose(query, catalog)` ahora llama `plan(query, catalog)` internamente y renderiza el árbol post-optimize. El total estimate (header `# Estimated rows`) y per-node `(est=…)` reflejan el plan que el motor realmente correría. Esto cambia el contrato previo donde EXPLAIN VERBOSE mostraba el lowering crudo — los tests existentes que dependían de esa forma específica se actualizan. #### 6.2 EXPLAIN RAW (nueva sintaxis) `EXPLAIN RAW ` y `EXPLAIN RAW VERBOSE ` muestran el plan sin optimizar. Útil para debugging del lowering y para verificar que el optimizer hizo algo: ```plaintext > EXPLAIN VERBOSE MATCH (a:Person) WHERE a.age > 30 RETURN a # Estimated rows: 2 Project [a=a] (est=2) Filter (a.age > 30) (est=2) NodeScan label=Person alias=a (est=6) > EXPLAIN RAW VERBOSE MATCH (a:Person) WHERE a.age > 30 RETURN a # Estimated rows: 2 Project [a=a] (est=6) Filter (a.age > 30) (est=2) NodeScan label=Person alias=a (est=6) ``` En el RAW (lowering crudo) el Filter está bajo el Project, y la estimación del Project asume que el Filter ya filtró — pero el operador Project itera 6 rows con el Filter arriba siendo evaluado después, lo cual es exactamente lo que muestra el árbol. En el optimizado el Filter está debajo del Project, así el Project itera 2. #### 6.3 Join-candidate annotation Cuando un Filter inmediato sobre un CrossProduct contiene una igualdad cross-side, EXPLAIN VERBOSE agrega `[join candidate]` al final de la línea del Filter: ```plaintext Filter (a.name = b.name) [join candidate] (est=...) CrossProduct (est=...) Filter (a.age > 30) (est=...) NodeScan label=Person alias=a (est=...) Filter (b.age < 50) (est=...) NodeScan label=Person alias=b (est=...) ``` Un rewrite posterior detecta el flag y convierte a HashJoin. ### 7. Parser ```text EXPLAIN [RAW] [VERBOSE] ``` * `EXPLAIN ` — lowering crudo, sin estimates. * `EXPLAIN VERBOSE ` — **optimizado**, con estimates. * `EXPLAIN RAW ` — lowering crudo, sin estimates (alias explícito del comportamiento legacy). * `EXPLAIN RAW VERBOSE ` — lowering crudo, con estimates. `RAW` es un soft-keyword reconocido sólo entre `EXPLAIN` y `VERBOSE` (o `EXPLAIN` y el inicio de la query). No es token reservado, no rompe queries con una variable llamada `raw`. `Query.explain_raw: bool` se agrega al AST junto al `explain_verbose: bool` existente. ### 8. CLI ```bash namidb explain [--verbose] [--raw] ``` * `--raw`: alias de `EXPLAIN RAW` (skip optimize). * `--verbose`: ya existe, agrega VERBOSE. Si la query string ya contiene los prefixes, se respeta la mezcla (flag + prefix son OR’eados). ## Alternativas consideradas ### A. Selinger-style cost-based exhaustivo Enumerar todas las posiciones donde el Filter puede ir y elegir la de menor costo. Rechazado: para un plan con N operadores el espacio es O(N) posiciones por predicate; con K predicates eso es O(N×K). Para LDBC IC con N≈10, K≈5 son \~50 posiciones — tractable, pero la mejor posición siempre es “lo más bajo posible” para predicates puros (propiedad bien conocida: predicate pushdown commutes con cardinalidad reduction). El cost-based enum sólo aporta cuando los predicates tienen side effects (no en SQL/Cypher) o cuando hay correlations que podrían favorecer NO bajar (cross-column correlation — out of scope hasta que aterricen multi-column histograms). ### B. Rewrite-rule engine genérico (egg / datalog) Codificar las reglas como rewrites declarativos y dejar que un engine los aplique a fixpoint. Rechazado para v0: el catálogo inicial de reglas es 4 normalizaciones + 1 algoritmo (pushdown). Un engine genérico cuesta \~2 000 LoC de infra para 4 reglas; manual rewrite es \~500 LoC. Cuando lleguemos a 20+ reglas, evaluamos egg. ### C. Pushdown integrado en `lower` Hacer que el lowering produzca directamente el plan optimizado. Rechazado: el lowering tiene una responsabilidad clara (AST → LogicalPlan correcto). Mezclar optimizer rompe testabilidad unitaria y oculta bugs de lowering detrás de bugs de pushdown. ### D. Filter pushdown solo en `WhereClause` Procesar el WHERE en `attach_where` y bajar ahí mismo, antes de generar el Filter. Rechazado: solo cubre el WHERE explícito. Los property filters (`{key: value}` inline en patterns) producen Filters **arriba** del Expand también, y el pushdown necesita verlos todos uniformemente. Además, join reorder opera sobre el plan post-pushdown — necesita el árbol normalizado. ### E. Bajar al storage layer ahora (Parquet predicate pushdown) `scan_label(label, predicates: Vec<...>)` lee los row groups del Parquet con stats min/max + Bloom + Bitmap pushdown. Rechazado por ahora: requiere extender la API de `Snapshot` con un tipo `ScanPredicate` neutral, hacer el match en `parquet_loader.rs`, y agregar tests storage-side. \~800 LoC adicionales que duplican el costo y son ortogonales al rewrite estructural. Queda diferido, con esta RFC como pre-requisito (los predicates ya están en su posición ideal cuando un rewrite futuro los traduzca a Parquet). ## Drawbacks 1. **Cambio de contrato silencioso para callers existentes.** Los tests integration que comparan `lower(query)` con un árbol esperado siguen funcionando. Los tests que comparan `execute(plan, snapshot)` también — el plan optimizado produce el mismo set de rows. Pero tests que comparan EXPLAIN VERBOSE output cambian (el plan optimizado tiene forma diferente). Mitigación: snapshot tests existentes en `explain.rs` se actualizan. 2. **`expression_aliases` es conservador con subqueries.** Para `Filter(EXISTS((a)-[]-(b)) AND x > 0)`, el predicate `EXISTS(...)` contribuye al alias set TODOS los aliases que la pattern podría referenciar. Si el predicate tiene un sub-EXISTS sobre `a` y un `x > 0` sobre `a.x` (distintos a y x), el pushdown podría ser más fino — hoy es conservador, queda como mejora futura. 3. **No bajamos a través de `TopN` / `Distinct`.** Es una decisión conservadora; el caso “filter sobre TopN” con predicate puro es pushable, pero queremos primero verificar pureza. Trabajo trivial, queda diferido. 4. **Adjacent merge fusiona Filters con spans inconsistentes.** El nuevo `Filter` con AND-chain tiene `span` cuya extension cubre los spans originales (lo que ya hace `rebuild_and_chain` en `lower.rs`). Para error messages downstream (e.g. error en el executor), el span podría apuntar a una región más amplia que el conjunct específico que falló. Mitigación: el span de cada sub-Expression dentro del AND-tree se preserva — el reporter usa ese span, no el del Filter root. 5. **El optimizer corre sobre TODOS los queries, incluyendo write.** Los write ops son barreras (predicates no bajan a través de ellas), pero el rewriter sigue visitándolos para procesar su `input`. Costo: \~O(operadores). Para queries grandes (\~100 nodos del plan), eso es <1 ms — negligible vs la query execution time. 6. **No hay way de skip optimizer en `execute`.** Si un caller necesita evitar el optimizer (e.g. para reproducir un bug del lowering), debe llamar `lower(query)?` directamente y luego `execute(plan, ...)`. La función `plan(query, catalog)` es el atajo conveniente, no el único path. ## Open questions * **OQ1.** ¿Debería `optimize` aceptar un `OptimizerSettings` con flags individuales (`enable_pushdown`, `enable_normalize`, `enable_label_eq_cleanup`)? Hoy no — un único toggle “todo o nada” vía si se llama `optimize` o `lower` directamente. Cuando agreguemos más rewrites, evaluamos un settings struct. * **OQ2.** ¿`__label_eq` cleanup también debería eliminar el filter cuando el predicate target está bound por un `NodeScan` con label declarado? Hoy sí — el operador ya garantiza el label vía `scan_label(L)`. La regla extendida cubre ambos casos sin riesgo. * **OQ3.** Pure-predicate detection para abrir TopN/Distinct: los predicates Cypher son siempre side-effect-free en v0 (sin funciones externas). Podríamos bajarlos sin verificar. Decisión: ser conservadores hasta que aterricen funciones externas (RFC futuro). ## References * Mumick & Pirahesh, *Implementation of Magic-sets in a Relational Database System* (1994) — origen de las técnicas de pushdown. * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — cost-based optimizer fundacional. * DuckDB *Predicate Pushdown Through Joins* (Mark Raasveldt, 2022) — caso moderno de pushdown sobre join trees vectorizados. * CockroachDB optimizer notes (Andy Kimball, 2018) — pushdown a través de Cypher-shaped query trees. * `docs/rfc/010-cost-based-optimizer.md` — fundación que esta RFC usa. * `docs/rfc/008-logical-plan-ir.md` — operadores que esta RFC rewritea. ## Plan de implementación 1. **Crate `namidb-query`**: * `src/optimize/mod.rs` — re-exports + `optimize(plan, catalog)`. * `src/optimize/pushdown.rs` — `predicate_pushdown` + helpers (`expression_aliases`, `produced_aliases`, `split_and_terms`, `and_chain`, `apply_filters`). \~800 LoC + 20-25 unit tests. * `src/optimize/normalize.rs` — `normalize_filters` con las 4 reglas. \~250 LoC + 8-10 unit tests. * `src/lib.rs` — re-export `plan(query, catalog)`. 2. **`src/parser/grammar.rs`**: * Reconocer `RAW` como soft keyword entre `EXPLAIN` y `VERBOSE`/query body. * `Query.explain_raw: bool`. Display round-trips. * \~25 LoC + 3 tests. 3. **`src/plan/explain.rs`**: * `explain_query_verbose(query, catalog)` aplica `optimize` antes de renderizar. * Nueva función `explain_query_raw(query)` para `EXPLAIN RAW`. * Helper `is_join_candidate` y annotación inline. * \~80 LoC + 5 tests. 4. **CLI** (`namidb-cli/src/main.rs`): * Flag `--raw`; pasar a través el flag del query string. * \~15 LoC. 5. **Executor wiring** (`src/exec/walker.rs`, `src/exec/writer.rs`): * Cualquier call site externo que llamaba `lower(&query)?` antes de `execute(...)` ahora llama `plan(&query, &catalog)?`. CLI y integration tests son los call sites principales. 6. **Tests integration** (`tests/cost_smoke.rs` + nuevo `tests/optimize_smoke.rs`): * Filter sobre source baja debajo de Expand (LDBC IC2 micro). * Filter sobre target NO baja debajo de Expand. * Filter sobre OPTIONAL target NO baja. * CrossProduct: predicates split a left / right / top. * `__label_eq` cleanup verifiable en EXPLAIN output. * Plan optimizado y plan crudo producen el mismo result set. * Plan optimizado tiene `estimate(...) ≤ estimate(crudo)`. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 413 → \~445 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~1 100 src + \~600 tests. * Sin cambios en `namidb-storage`. # RFC 012: HashJoin > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown) **Supersedes:** — > *Mirrored from [`docs/rfc/012-hash-join.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/012-hash-join.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown) **Supersedes:** — ## Summary Cosecha el `[join candidate]` annotation que el predicate pushdown (RFC-011) dejó sembrado en EXPLAIN VERBOSE: convierte el shape `Filter(a.x = b.y) ⇒ CrossProduct(L, R)` a un `HashJoin { build, probe, on: [(a.x, b.y)] }` que el executor materializa con build/probe phases. Operación O(N×M) → O(N+M). Alcance: * Nuevo operador `LogicalPlan::HashJoin { build, probe, on, residual }`. * Rewriter `convert_cross_to_hash` en el pipeline de `optimize`, posterior al pushdown. * Estimador de cardinalidad para `HashJoin`. * Executor que materializa hash table en build-side, streams probe-side. * EXPLAIN VERBOSE renders `HashJoin on=[(a.x, b.y)]` con build/probe como children. Out-of-scope explícito: * **HashSemiJoin via decorrelation** — convertir `SemiApply` cuando el subplan tiene correlación con bindings del outer. Requiere correlation analysis del subplan (separar la equality que correlaciona del resto, ejecutar subplan no-correlado, hash por la output column de correlación, probe outer). Iteración independiente. * **Sort-merge join** — alternativa cuando los inputs ya están sorted. El executor no propaga ordering, así que solo HashJoin v0. * **Broadcast / partitioned join** — distribuido. * **Hash table spilling a disk** — cuando build no entra en memoria. v0 asume single-node build fits in memory. Documentado como drawback. * **Equality detection sin AND-root** — solo extraemos cross-side equalities del top-level AND-tree del Filter. `Filter(a.x = b.y OR ...)` queda sin convertir. ## Motivation Tras predicate pushdown el plan para `MATCH (a:Person), (b:Person) WHERE a.firstName = b.firstName` queda: ```plaintext Filter (a.firstName = b.firstName) [join candidate] CrossProduct (est=1000000) // 1000 * 1000 NodeScan(Person, a) (est=1000) NodeScan(Person, b) (est=1000) ``` EXPLAIN VERBOSE flag-ea el shape como `[join candidate]` pero el plan sigue ejecutando nested-loop. Sobre LDBC SF1 (3M Person), eso son 9×10¹² pairs antes de filtrar — query nunca termina. Con HashJoin: ```plaintext HashJoin on=[(a.firstName, b.firstName)] (est=10000) build: NodeScan(Person, a) (est=1000) probe: NodeScan(Person, b) (est=1000) ``` Build phase: \~1M Person → 1M hash table entries (\~80MB en RAM con ndv(firstName)≈1k). Probe phase: 1M Person, lookup en hash → 1k matches por probe (asumiendo distribución uniforme sobre 1k buckets). Total: \~1M matches finales. **O(N+M) en vez de O(N×M)**, factor ahorro de \~3 órdenes de magnitud. Sin HashJoin, las queries multi-pattern de LDBC SNB Interactive (IC2 con EXISTS, IC9 con sub-pattern) toman seconds/minutos sobre micro-graphs y nunca terminan sobre SF1. ## Design ### 1. IR: `LogicalPlan::HashJoin` ```rust /// Inner hash join (RFC-012). Equivalent to /// `Filter(on AND residual) ⇒ CrossProduct(build, probe)` but /// executed in two phases: build a hash table over `build`'s rows /// keyed by each `JoinKey::build_side` expression; then stream /// `probe`, evaluating `JoinKey::probe_side` and looking up matches. /// /// The optimizer picks the side with smaller estimated cardinality /// as `build` so the hash table stays compact. /// /// `residual` is any non-equi predicate that survived from the /// pre-conversion Filter (e.g. a >= b in `WHERE a.x = b.y AND a.z >= b.w`). /// It is evaluated on the *joined* row in 3VL. HashJoin { build: Box, probe: Box, on: Vec, residual: Option, }, ``` ```rust #[derive(Clone, Debug, PartialEq)] pub struct JoinKey { /// Expression evaluated on each `build`-side row to compute the /// hash-table key. References only aliases produced by `build`. pub build_side: Expression, /// Expression evaluated on each `probe`-side row. References only /// aliases produced by `probe`. pub probe_side: Expression, } ``` `children()` returns `vec![build, probe]` in that order — keeps EXPLAIN rendering predictable. `operator_name()` returns `"HashJoin"`. `contains_write()` returns `false` (joins are read-side). ### 2. Conversion rewriter (`optimize::join_conversion`) ```rust pub fn convert_cross_to_hash( plan: LogicalPlan, catalog: &StatsCatalog, ) -> LogicalPlan; ``` Bottom-up rewrite. The trigger shape after the post-pushdown plan is: ```plaintext Filter { input: CrossProduct { left, right }, predicate } ``` Algorithm: 1. **Recurse** into children first (so we convert any inner joins before considering the current node). 2. **Match the trigger**. If the current plan is a `Filter` whose immediate child is a `CrossProduct`: a. **AND-split** the predicate into conjuncts. b. Compute `produced_aliases(left)` and `produced_aliases(right)`. c. For each conjunct `c`: * If `c` is `Binary { op: Eq, left: lhs, right: rhs }` and (`expression_aliases(lhs) ⊆ left_aliases ∧ expression_aliases(rhs) ⊆ right_aliases`) → push `(lhs, rhs)` to `on`. * Mirror case (`lhs ⊆ right ∧ rhs ⊆ left`) → push `(rhs, lhs)` so build\_side always lines up with `build` operand. We canonicalize. * Otherwise → push to `residual_terms`. d. If `on.is_empty()` → no conversion possible; emit the original `Filter ⇒ CrossProduct` unchanged. e. **Build vs probe decision**: compute `estimate(left, catalog).rows` and `estimate(right, catalog).rows`. Whichever has fewer rows becomes `build`. If equal, prefer left as build (deterministic). Swap `on` keys if we picked right as build. f. **Coalesce residual**: `residual = and_chain(residual_terms)` → `Option`. g. Emit `HashJoin { build, probe, on, residual }`. 3. Otherwise (or if step 2 doesn’t apply): preserve the operator and recurse over its children. #### Edge cases * **Eq with a literal on one side** (`a.x = 5`): not cross-side, falls through to residual / pushdown. Already handled by RFC-011. * **Eq with both sides referencing only one alias** (`a.x = a.y`): same-side, stays in residual. Already pushable to that side’s subtree. * **Eq with parameter** (`a.x = $param`): `expression_aliases($param) = ∅`. Falls through to residual; selectivity has it. * **AND-only predicate**: `a.x = b.y AND a.z > b.w` → `on = [(a.x, b.y)]`, `residual = Some(a.z > b.w)`. * **Multiple cross-side eqs**: `a.x = b.x AND a.y = b.y` → `on = [(a.x, b.x), (a.y, b.y)]`. Coalesced into a multi-key hash join. * **No eq at all** (`Filter(a.x > b.x)`): `on.is_empty()`, no conversion. Plan remains nested-loop. The rewriter must NOT trigger on Filters whose immediate child is not a CrossProduct — those are pushdown leftovers and irrelevant. ### 3. Conversion entry point (`optimize::optimize`) Integrates into the existing pipeline: ```rust pub fn optimize(plan: LogicalPlan, catalog: &StatsCatalog) -> LogicalPlan { let mut current = plan; for _ in 0..MAX_FIXPOINT_ROUNDS { let next = normalize_filters(predicate_pushdown(current.clone())); let next = convert_cross_to_hash(next, catalog); // NEW if next == current { return next; } current = next; } current } ``` Order matters: pushdown runs first so any pushable filter has been moved out of the way; the only Filters remaining above CrossProduct are by definition cross-side mixers. The rewriter then has the cleanest possible signal. ### 4. Cardinality estimate Add an arm to `cost::cardinality::estimate_inner` for `HashJoin`: ```rust LogicalPlan::HashJoin { build, probe, on, residual } => { let b = estimate_inner(build, catalog); let p = estimate_inner(probe, catalog); // Selinger '79: inner equi-join cardinality. // rows = (|build| * |probe|) / max(ndv(build_key), ndv(probe_key)) // For multi-key, assume independence: divide by product. let mut divisor = 1.0_f64; for key in on { let build_ndv = ndv_for_expr_opt(&key.build_side, catalog, &b.bindings).unwrap_or(1.0); let probe_ndv = ndv_for_expr_opt(&key.probe_side, catalog, &p.bindings).unwrap_or(1.0); divisor *= build_ndv.max(probe_ndv).max(1.0); } let mut rows = (b.rows * p.rows / divisor).max(0.0); // Residual reduces further. Use the existing selectivity machinery. if let Some(res) = residual { let mut combined = b.bindings.clone(); for (k, v) in &p.bindings { combined.insert(k.clone(), v.clone()); } let bs = make_binding_stats(catalog, &combined); rows *= selectivity(res, &bs); } let mut bindings = b.bindings.clone(); for (k, v) in &p.bindings { bindings.insert(k.clone(), v.clone()); } Cardinality { rows, children: vec![b, p], bindings, operator: "HashJoin", } } ``` #### Why Selinger and not “min(|L|,|R|)” A common shortcut estimate is `min(|L|, |R|)` for foreign-key joins. That’s correct when the join key is unique on one side and present in every row of the other. For graph joins on arbitrary properties the distribution is wider — Selinger captures both extremes: * Unique on both sides → `min(|L|, |R|)` (since the join key has ndv = |L| ≈ |R|). * Replicated key → much larger output. We fall back to `divisor = 1` (= no reduction → CrossProduct cardinality) when ndv is `None`. That keeps the estimate sound (never under-estimates) at the cost of being pessimistic for queries the catalog doesn’t know about. ### 5. Executor Two-phase implementation in `exec::walker`: ```rust async fn execute_hash_join( build: &LogicalPlan, probe: &LogicalPlan, on: &[JoinKey], residual: &Option, snapshot: &Snapshot<'_>, params: &Params, ) -> Result, ExecError> { // Build phase. let build_rows = execute_inner(build, snapshot, params, /*outer=*/ None).await?; let mut table: HashMap, Vec> = HashMap::new(); for row in build_rows { let mut key = Vec::with_capacity(on.len()); let mut has_null = false; for jk in on { let v = evaluate(&jk.build_side, &row, params)?; if matches!(v, RuntimeValue::Null) { has_null = true; break; } key.push(v); } if has_null { continue; } // NULL keys never match (3VL). table.entry(key).or_default().push(row); } // Probe phase. let probe_rows = execute_inner(probe, snapshot, params, None).await?; let mut out = Vec::new(); for prow in probe_rows { let mut key = Vec::with_capacity(on.len()); let mut has_null = false; for jk in on { let v = evaluate(&jk.probe_side, &prow, params)?; if matches!(v, RuntimeValue::Null) { has_null = true; break; } key.push(v); } if has_null { continue; } if let Some(matches) = table.get(&key) { for brow in matches { let mut combined = brow.clone(); for (k, v) in &prow.bindings { combined.bindings.insert(k.clone(), v.clone()); } if let Some(res) = residual { match evaluate(res, &combined, params)? { RuntimeValue::Bool(true) => out.push(combined), _ => {} // False or NULL drops. } } else { out.push(combined); } } } } Ok(out) } ``` #### NULL semantics `a.x = b.y` is `NULL` when either side is `NULL` (Cypher 3VL). `Filter` drops rows where the predicate evaluates to NULL. Our hash join replicates that: any NULL component in the join key skips both the build insert and the probe lookup. Test coverage explicitly exercises this. #### Hash key representation `Vec` as the HashMap key. `RuntimeValue` implements `Hash + Eq` through derive (numeric, string, bool variants are straightforward). `RuntimeValue::Float` requires the bit-level canonical form to make NaN sort to one bucket — already in the existing `Hash` impl since the value layer. #### Memory footprint Build hash table size: roughly `|build| * (avg_key_size + avg_row_size)`. For SF1-scale build of 3M Person × 200B/row + 50B key = \~750MB. That fits comfortably in a 8GB machine. **Drawback**: no spill, so jobs that pick the wrong build side OOM. Defended by the catalog-based build-vs-probe decision. Future RFC adds spill. #### Bindings combine When we emit a joined row, we take the build row, then `.extend()` its bindings with the probe row’s bindings. If the two sides share a binding name (shouldn’t happen in well-formed plans, but defensive), probe wins. This matches the lowering invariant: two pattern parts share no fresh aliases (lowering uses `CrossProduct` precisely when they don’t). ### 6. EXPLAIN rendering `write_header` for HashJoin: ```plaintext HashJoin on=[(a.firstName, b.firstName)] residual=(a.id < b.id) ``` `residual` omitted when None. Multi-key: ```plaintext HashJoin on=[(a.x, b.x), (a.y, b.y)] ``` The `[join candidate]` annotation that the predicate pushdown emitted on the original `Filter ⇒ CrossProduct` disappears post-conversion — the operator IS the join now. This is verifiable with a test. `EXPLAIN VERBOSE` cardinality numbers show the dramatic improvement: the HashJoin estimate is much smaller than the pre-conversion CrossProduct estimate. ### 7. Interaction with subsequent rewrites * **Predicate pushdown above HashJoin**: predicates that reference only build or only probe aliases can be pushed below the HashJoin into the respective subtree. The pushdown rule for HashJoin is identical to CrossProduct (split by side; mixed-eq is now in `on` or `residual`, doesn’t reach pushdown). We add an arm to `predicate_pushdown` to support this. * **Join reorder**: when reorder triggers, it can swap build and probe by re-evaluating `estimate` — the on-keys remain valid (just the symbol order in each pair swaps). * **HashSemiJoin**: a `SemiApply` with a correlated subplan is decorrelated by extracting the correlation key, executing the subplan unparametrised, hashing on the correlation key, and probing the outer. The IR delta is small (`HashSemiJoin` is just `HashJoin` - emit-only-outer + optional negation flag). ## Alternatives considered ### A. Nested-loop with Bloom filter probe DuckDB-style: build a Bloom filter on `build`’s join keys, probe each `probe` row by Bloom-checking, and fall through to nested-loop on the positives. **Rejected**: Bloom-filter false-positive rate \~1% means 99% of work is short-circuited, but the remaining 1% is still O(N×M) — for N,M = 10⁶ that’s 10¹⁰ comparisons. HashJoin is strictly better when memory fits. ### B. Sort-merge join If both sides are already sorted on the join key, a sort-merge join avoids materialising a hash table. **Rejected for v0**: the executor does not propagate ordering. Adding ordering metadata to the IR is a separate, larger change (morsel-driven executor — vectorised, often comes with sort ordering as a metadata column). ### C. Stream both sides and use partition-hash-join Modern approach (DuckDB, ClickHouse, MapReduce-style). Both inputs partition by hash; each partition joins independently. **Rejected**: parallelism over partitions is a morsel feature, out-of-scope here. Single-threaded HashJoin v0 is the simplest correct approach. ### D. Rely on graph-native joins (WCOJ / Worst-Case Optimal Join) For cyclic / multi-way joins, WCOJ (RFC-009-eve) outperforms binary hash joins. **Rejected aquí**: WCOJ es RFC-009’s concern. Binary hash joins cover the LDBC SNB interactive queries que esta RFC ataca (IC2, IC4, IC10 with cross-pattern equi-joins) — WCOJ gains kick in on truly cyclic queries (IC9 path patterns). ### E. Defer join conversion until query execution time Adaptive: run a tiny sample of both sides, pick the join algorithm at runtime. **Rejected**: adds runtime branching to the executor and defeats the EXPLAIN/PROFILE story. Adaptive execution can revisit a futuro. ## Drawbacks 1. **Unbounded hash table memory**. Build side fits in RAM is an assumption. For LDBC SF1 we’re fine (build of 3M rows × 250B = \~750MB), but pathological queries (joins on rare keys with replicated rows) could blow memory. Mitigated by the build-vs-probe decision; not eliminated. Spill to disk queda como follow-up. 2. **No correlated subquery conversion**. SemiApply with a correlated subplan (the typical `EXISTS` shape) still nested-loops. Una iteración independiente lo resuelve. 3. **No multi-pattern join graph**. With 3+ pattern parts the optimizer sees a tree of CrossProducts; converting bottom-up means we lose the chance to pick a globally-optimal join order. Join reorder (RFC-016) addresses this by enumerating join trees BEFORE the conversion rewriter. 4. **The build side is materialised in full**. For very large build, this prevents streaming output. Modern hash joins emit matched rows as the probe streams — we do too, but only after the full build is in memory. 5. **Conversion conservative on residual**: when AND-split leaves non-eq conjuncts, we keep them as `residual` on the HashJoin. This means the residual evaluates on every joined row, which can be expensive. Future rewrites (`predicate_pushdown` over HashJoin) can further push residual conjuncts below the join if they reference only one side. Already supported by the standard pushdown rules once HashJoin is a recognised plan node. ## Open questions * **OQ1**. `RuntimeValue::Float` as hash key — NaN canonical hashing must be enforced. Today’s `Hash` impl uses raw bits, which makes distinct NaN bit patterns hash differently. We normalise during `evaluate` for the join key? Or in the `Hash` impl? Decided pragmatically: normalise on insert/lookup via a helper. * **OQ2**. Should HashJoin emit a “join-key” column in the output so downstream operators can dedup cheaply? Today the executor does not — the build row’s aliases survive verbatim; downstream Distinct re-hashes. Defer. * **OQ3**. The cost-model assumes independence between multi-key components. For LDBC IC9 with `WHERE a.firstName = b.firstName AND a.lastName = b.lastName`, the two equalities are strongly correlated. The estimate over-reduces. Multi-column histograms a futuro lo arreglan. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — origin of cost-based join order and cardinality estimates we reuse here. * DuckDB’s *Push-Based Execution* (Mark Raasveldt 2022) — modern reference implementation for HashJoin. * *Worst-Case Optimal Joins* (Ngo, Porat, Ré, Rudra; PODS ‘14) — WCOJ baseline para trabajo futuro. Mentioned to contrast our scope. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/010-cost-based-optimizer.md` — cost model this RFC consumes via `StatsCatalog`. * `docs/rfc/011-predicate-pushdown.md` — the `[join candidate]` annotation que esta RFC cosechas. ## Plan de implementación 1. **`crates/namidb-query/src/plan/logical.rs`** (\~80 LoC + 3 tests): * Agregar `LogicalPlan::HashJoin` variant + `JoinKey` struct. * Actualizar `children()`, `operator_name()`, `contains_write()`, test del IR. 2. **`crates/namidb-query/src/optimize/join_conversion.rs`** (\~400 LoC + 12-15 tests): * `convert_cross_to_hash(plan, catalog)` recursivo. * Helper `extract_cross_side_equalities(predicate, left_aliases, right_aliases) -> (Vec, Vec)`. * Helper `pick_build_side(left, right, catalog) -> Side`. 3. **`crates/namidb-query/src/optimize/pushdown.rs`** (\~50 LoC + 4 tests): * Agregar HashJoin arm en `pushdown_at`: split por side, push pushable conjuncts a build o probe. Same shape as CrossProduct. 4. **`crates/namidb-query/src/optimize/mod.rs`** (\~10 LoC): * Llamar `convert_cross_to_hash` post-pushdown en `optimize` pipeline. 5. **`crates/namidb-query/src/cost/cardinality.rs`** (\~60 LoC + 4 tests): * Nuevo arm para `HashJoin` con la fórmula de §4. 6. **`crates/namidb-query/src/exec/walker.rs`** (\~150 LoC + 5 tests): * `execute_hash_join` con build/probe phases. 7. **`crates/namidb-query/src/plan/explain.rs`** (\~30 LoC + 3 tests): * `write_header` arm para HashJoin. * `plan_has_stats` arm. 8. **`crates/namidb-query/tests/cost_smoke.rs`** (+6 integration tests). Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 509 → \~555 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~800 src + \~400 tests. * Sin cambios en `namidb-storage`. # RFC 013: Parquet predicate pushdown > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-002 §4 (SST stats) **Supersedes:** — > *Mirrored from [`docs/rfc/013-parquet-predicate-pushdown.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/013-parquet-predicate-pushdown.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-002 §4 (SST stats) **Supersedes:** — ## Summary Cosecha los `min/max` que el writer ya emite por row-group (RFC-002 §4 * read-back desde el Parquet footer) para evitar decodificar row-groups que no pueden contener filas que satisfacen el WHERE. El predicate pushdown estructural (RFC-011) empujó cada Filter lo más cerca del leaf (NodeScan) que pudo; este siguiente paso convierte el `Filter` inmediatamente sobre `NodeScan` en `predicates` *del NodeScan*, que el storage layer consume durante el scan para descartar row-groups completos sin decodificar. Alcance: * Tipo `ScanPredicate` en `namidb-storage::sst::predicates` (Eq / Lt / LtEq / Gt / GtEq / Between / IsNull / IsNotNull / In) contra una sola columna y un literal canónico (`StatScalar`). * `eval_row_group(predicate, &PropertyColumnStats) -> RowGroupVerdict` conservador: `Absent` solo si los stats demuestran imposibilidad, `MaybePresent` cuando los stats faltan o no son concluyentes. * `NodeSstReader::scan_with_predicates(&[ScanPredicate])` lee el Parquet metadata, evalúa cada predicate contra los stats por row-group, skipea cualquier row-group con verdict `Absent` para CUALQUIER predicate. * `Snapshot::scan_label_with_predicates(label, &[ScanPredicate])` y `Snapshot::scan_label(label)` mantenido como wrapper de `scan_label_with_predicates(label, &[])` (compat). * `LogicalPlan::NodeScan` gana `predicates: Vec` field. Default vacío para callers existentes. * Rewriter en `optimize::pushdown`: cuando llega a `NodeScan` con `pending` no-vacío, intenta convertir cada conjunct a un `ScanPredicate` sobre `alias.property`. Los pushables van al NodeScan.predicates; los no-pushables permanecen como `Filter`. * Executor pasa los predicates al storage en el callsite de `walker::execute_node_scan`. * Cardinality evalúa selectividad de los predicates ya empujados sobre el `node_count` del catalog (consistente con RFC-010 §3.1). * EXPLAIN VERBOSE: `NodeScan label=Person alias=a predicates=[a.age > 30, a.firstName = "Alice"]`. Out-of-scope explícito: * **Parquet row-level filtering** (Arrow `filter` operator dentro del reader). El executor ya tiene `Filter` y aplicarlo dos veces duplicaría trabajo. Solo hacemos *row-group* pruning en storage. * **Predicates sobre Expand / edge SST**. La RFC-002 sí define stats por edge SST (`DegreeHistogram`) pero las edges no tienen stats por propiedad arbitraria. Se mantiene fuera del v0. * **OR predicates**. Cada `ScanPredicate` es un single-column AND conjunct. `WHERE a.x = 1 OR a.y = 2` no se empuja en v0 — requeriría unión de row-groups con bookkeeping del verdict por-row. Conjunctive-only. * **Predicates cross-alias** (`WHERE a.age = b.age`). El storage no conoce `b`; el Filter cross-alias permanece arriba del NodeScan y cuando el HashJoin rewrite ya lo está convirtiendo a HashJoin, esto NO le quita nada. * **Predicates derivados de parámetros**. Los parámetros se resuelven en runtime; el storage layer no los ve. Si el lowering conoce el valor (constante), se baja como literal; si es un parameter abierto, el Filter permanece arriba. Una posible extensión post-v0: resolver parameters en `optimize::optimize` cuando `&Params` se pase al pipeline. * **Page-index pruning** (Parquet 2.0 column index + offset index). Vale para SSTs grandes con row-groups grandes pero requiere otra layer de stats. Skip v0; el writer ya emite chunk-level stats (`EnabledStatistics::Chunk`) — suficiente. * **Bloom filter check sobre eq predicates**. El writer emite bloom filters para `node_id` pero no para propiedades arbitrarias. Extender el bloom a properties es un trade-off de espacio (bloom bytes son \~7×rows) — esto queda diferido si las stats min/max no alcanzan. * **Cardinality estimate con dependencia entre predicates**. Cada predicate aplica selectividad independiente, como RFC-010 §3.2. El catalog HLL daría correlación pero v0 mantiene la asunción de independencia. ## Motivation Sobre LDBC SNB SF1 (3M Person nodes en \~30 SSTs de 100k rows c/u, con row-groups de 8192 rows = \~366 row-groups por label), una query `MATCH (a:Person) WHERE a.creationDate > '2020-01-01' RETURN a` que hoy: * Lee TODOS los SSTs (cada uno \~10–40 MB sobre S3). * Decodifica TODOS los row-groups (\~366 Parquet decompressions). * Filtra row-level en el executor: descarta \~99% de las filas. Con predicate pushdown: * Lee TODOS los SSTs **footer + page index** (\~64 KiB c/u, no body). * Por cada SST consulta `min/max(creationDate)` por row-group. * Skip los row-groups cuyo `max(creationDate) < '2020-01-01'`. * Decodifica solo los row-groups que pueden contener matches (\~10–30 row-groups en lugar de 366). **Ahorro de IO sobre S3 (dominante en cloud):** 10× típico para queries selectivas. Ahorro de CPU en decoding (Parquet deserialización): factor \~30×. Esto es la última pieza del pushdown end-to-end del query layer al storage. Sin ella, los `min/max` que el writer emite no se usan: solo están en el catalog para el cost model. ## Design ### 1. `namidb-storage::sst::predicates` ```rust /// A single-column conjunctive predicate that the SST reader can /// evaluate against per-row-group stats to skip entire row-groups /// without decoding them. /// /// Each variant references a column **by its declared property name** /// (not by Parquet leaf path). The reader resolves to leaf index at /// scan time. #[derive(Clone, Debug, PartialEq)] pub enum ScanPredicate { Eq { column: String, value: StatScalar }, Lt { column: String, value: StatScalar }, LtEq { column: String, value: StatScalar }, Gt { column: String, value: StatScalar }, GtEq { column: String, value: StatScalar }, /// `Between { low, high }` is INCLUSIVE both sides. Equivalent to /// `Gte(low) AND Lte(high)`. Between { column: String, low: StatScalar, high: StatScalar }, IsNull { column: String }, IsNotNull { column: String }, In { column: String, values: Vec }, } ``` The literal type — `StatScalar` — is the same one the writer emits into `PropertyColumnStats`. This guarantees comparison between predicate and stats lives in a single ordering. #### Verdict ```rust #[derive(Clone, Copy, Debug, PartialEq, Eq)] pub enum RowGroupVerdict { /// Stats prove no row in this row-group can satisfy the predicate. Absent, /// Stats are insufficient (missing min/max, type mismatch) OR /// stats overlap — rows may or may not match. Decode the /// row-group. MaybePresent, } ``` Conservatism: we only return `Absent` when the stats *prove* the row-group contains no match. Missing min/max ⇒ `MaybePresent`. Type mismatch (predicate is `Utf8` but stats are `Int32`) ⇒ `MaybePresent` (defensive; treats the predicate as inapplicable rather than asserting). #### Evaluation ```rust pub fn eval_row_group( predicate: &ScanPredicate, stats: &PropertyColumnStats, ) -> RowGroupVerdict; ``` Algorithm per predicate (column known to match `stats.name`): * `Eq(v)`: * If `min ≤ v ≤ max` → MaybePresent. Else Absent. * If min/max missing → MaybePresent. * `Lt(v)`: * If `min < v` → MaybePresent (some row may be < v). Else Absent (all rows ≥ v). * `LtEq(v)`: `min ≤ v` → MaybePresent. Else Absent. * `Gt(v)`: `max > v` → MaybePresent. Else Absent. * `GtEq(v)`: `max ≥ v` → MaybePresent. Else Absent. * `Between { low, high }`: equivalent to `GtEq(low) AND LtEq(high)`. Apply both; AND of verdicts = Absent if either is Absent. * `IsNull`: `null_count > 0` → MaybePresent. Else Absent. * `IsNotNull`: row-group has at least one non-null value (`null_count < row_count`); we don’t have `row_count` per stats here but we DO have `null_count` and if `null_count == 0` we’re sure non-nulls exist (defensive: always MaybePresent in v0 when stats exist; falls back to row-level Filter for the per-row check). * `In { values }`: build the closed interval `[min(values), max(values)]`; apply same logic as `Between`. False-positives are accepted (some intermediate value may not be in the `In` list — caught by the Filter operator above the NodeScan that the rewriter leaves intact as residual when In is partial). Comparison between `StatScalar` variants follows the obvious type matching: `Int32 vs Int32`, `Float64 vs Float64`, `Utf8 vs Utf8`, etc. Cross-type comparison (e.g. `Int32 vs Float64`) returns `MaybePresent` in v0; the optimizer doesn’t generate cross-type predicates because the property type is declared in the schema. NULL handling: `min` and `max` in `PropertyColumnStats` are computed over non-null values only (writer convention; see RFC-002 §4.1). So a column where every value is NULL has `min=None, max=None, null_count=N`. We evaluate `IsNull` from null\_count alone, and any ordered predicate (`Eq/Lt/Gt/...`) on a min/max=None column returns MaybePresent (defensive — the row-group has no non-null rows that could satisfy the predicate, but evaluating the predicate at row-level will correctly drop NULLs via 3VL). ### 2. `NodeSstReader::scan_with_predicates` ```rust impl NodeSstReader { pub fn scan_with_predicates( &self, predicates: &[ScanPredicate], ) -> Result>; } ``` Algorithm: 1. Build the Parquet reader once: `ParquetRecordBatchReaderBuilder::try_new(body)`. 2. Read the metadata; collect the leaf index for every property column referenced by any predicate. 3. For each row-group: * For each predicate, locate the column leaf in the row-group’s ColumnChunkMetaData → `cc.statistics()`. Map to a per-row-group `PropertyColumnStats` synthesizing only the fields evaluation needs (`null_count`, `min`, `max` — same coding as `compute_property_stats`). * Evaluate each predicate with `eval_row_group`. If ANY returns `Absent`, skip the row-group. 4. Collect surviving row-group indices into `keep`. 5. If `keep` is empty, return `Vec::new()` (no decode). 6. Otherwise build the reader `with_row_groups(keep)` and decode as in `scan()`. Cost: \~few µs per row-group of metadata inspection (the metadata is already in memory from `try_new`). When all row-groups survive we fall through to the same path as `scan()` and pay no extra IO. ### 3. `Snapshot::scan_label_with_predicates` ```rust impl Snapshot<'_> { pub async fn scan_label(&self, label: &str) -> Result> { self.scan_label_with_predicates(label, &[]).await } pub async fn scan_label_with_predicates( &self, label: &str, predicates: &[ScanPredicate], ) -> Result>; } ``` The new variant: * Iterates the memtable as `scan_label` does today, but additionally evaluates each predicate against the materialised NodeView’s property map. Memtable values are decoded already, so this is cheap (no IO). * For each SST scoped to `label`, calls `reader.scan_with_predicates(predicates)` instead of `reader.scan()`. * Returns the same `BTreeMap` semantics: tombstones win; last-write-wins by LSN. Memtable predicate evaluation is row-by-row in v0 and uses the same 3VL semantics as the executor’s Filter (`Bool(true)` → keep, `Bool(false)` / `Null` → drop). ### 4. `LogicalPlan::NodeScan` change ```rust NodeScan { label: String, alias: String, /// Predicates that have been pushed into the scan from a Filter /// directly above it. Empty for the lowering output; populated /// by `optimize::pushdown` when conjuncts qualify (see §5). /// The executor passes them verbatim to /// `Snapshot::scan_label_with_predicates`. predicates: Vec, } ``` `PartialEq`, `Clone`, `Debug` derive over the new field. All existing constructions of `NodeScan` (lowering, tests) now use `predicates: Vec::new()`. `operator_name()` remains `"NodeScan"`. EXPLAIN VERBOSE renders the predicates inline (see §6). ### 5. Rewriter in `optimize::pushdown` The existing leaf arm: ```rust LogicalPlan::Empty | LogicalPlan::Argument { .. } | LogicalPlan::NodeScan { .. } => { apply_filters(plan, pending) } ``` becomes: ```rust LogicalPlan::NodeScan { label, alias, predicates } => { let (pushable, residual) = classify_pending_for_scan(pending, &alias, &label_def); let mut merged = predicates; merged.extend(pushable); apply_filters( LogicalPlan::NodeScan { label, alias, predicates: merged }, residual, ) } ``` `classify_pending_for_scan(pending, alias, label_def)` returns: * `pushable: Vec` — conjuncts that are single-column comparisons on `alias.` with a literal/parameter (only literals in v0; parameters deferred) and reference NO other alias. The property must be declared in `label_def`. * `residual: Vec` — everything else; stays as `Filter` above the NodeScan. The classification function lives in `optimize::parquet_pushdown::classify` (new module) and is unit tested independently. The integration into `pushdown_at` is a single arm change. Why fold the parquet pushdown into the same `pushdown_at` pass instead of a separate post-pass: the `pending` accumulator already carries every conjunct that the existing pushdown was about to materialise as `Filter` over `NodeScan`. Classifying them at the leaf is the natural place — it costs O(|pending|) per leaf and avoids a second tree walk. ### 6. EXPLAIN VERBOSE ```plaintext Project [a] (est=1500) NodeScan label=Person alias=a predicates=[a.age > 30, a.firstName = "Alice"] (est=1500) ``` Plain `EXPLAIN` (no VERBOSE) also renders the predicates — they are part of the operator shape, not annotations. `EXPLAIN RAW` shows the pre-optimize lowering, where `NodeScan` has `predicates: vec![]` and the conjuncts live in a `Filter` above. `predicates=[...]` rendering uses each predicate’s `Display`: * `Eq { column, value }` → `a.col = ` * `Lt { column, value }` → `a.col < ` * `Between { column, low, high }` → `a.col BETWEEN AND ` * `IsNull { column }` → `a.col IS NULL` * `IsNotNull { column }` → `a.col IS NOT NULL` * `In { column, values }` → `a.col IN [, , ...]` The `alias.col` prefix comes from the NodeScan’s `alias`. Literals render via their `StatScalar` Display. ### 7. Cardinality The `NodeScan` arm in `cost::cardinality::estimate_inner` becomes: ```rust LogicalPlan::NodeScan { label, alias, predicates } => { let base = catalog.label(label).map(|l| l.node_count as f64).unwrap_or(0.0); let sel = predicates_selectivity(predicates, catalog, label); let rows = base * sel; // bindings, leaf as before } ``` `predicates_selectivity` reuses the existing `selectivity` machinery from `cost::selectivity` by translating each `ScanPredicate` to its `Expression` analogue (`Eq → BinaryOp::Eq`, etc.) and calling `selectivity(&expr, &binding_stats)` where `binding_stats` is seeded from the property stats in the catalog. Multi-predicate combines under the independence assumption (RFC-010 §3.2). Trade-off: we double-evaluate selectivity (once at NodeScan level for the pushed predicates, once at any residual Filter above). This is correct — Filter applies on top of the already-reduced NodeScan estimate. ### 8. Edge SSTs Out of scope (see §“Out-of-scope”). `EdgesFwd/Inv` SSTs ship `DegreeHistogram` but no per-property stats. Adding edge-property stats requires the writer to track them per edge\_type — a separate RFC if/when needed. ## Alternatives considered ### A. Filter pushdown using DataFusion’s Expr DataFusion has a full Expr language with a `PhysicalPlanner` that converts pushable Expr to Parquet `RowFilter`. **Rejected**: bringing DataFusion as a dependency for the pushdown ergonomics alone is a mismatch — we’d still need a translator from our `Expression` to their `Expr`, and the rest of our executor wouldn’t share the path. A future morsel/vectorized iteration may revisit, but pushdown alone doesn’t pay it. ### B. Runtime adaptive sampling Detect that a query is selective by sampling the first N rows and deciding pushdown on the fly. **Rejected**: defeats EXPLAIN/PROFILE story (plan changes at runtime), and our static stats are good enough for the v0 regime. ### C. Encode predicates in a server-side filter pushdown to S3 Select S3 Select supports SQL filters server-side but requires the body to be in CSV/JSON. Parquet Select is not GA. **Rejected**: incompatible with our storage format. Future feasibility check goes with edge storage RFCs. ### D. Build a custom bloom filter per property column for eq pushdown The writer would emit a bloom over each property’s hashed values. Eq predicates probe the bloom before reading the row-group. **Deferred**: RFC-002 explicitly limits blooms to `node_id` (for point lookups). A property bloom is \~7×rows bytes — for 100k row SSTs that’s \~700 KiB per property column. The space cost only pays off on cardinalities that min/max-based pruning misses (which is rare — eq on high-NDV columns is already covered by min/max for the values inside a row-group and a NodeId-bloom alike). Track for follow-up if real workloads show it. ## Drawbacks 1. **No row-level filter in storage**. Surviving row-groups still decode in full; the executor’s Filter then drops non-matching rows. For row-groups with \~50% selectivity this double-touches values. Mitigated by the executor’s Filter living in the same process — it’s cheap. Row-level pushdown in storage would couple Arrow’s `filter` operator to the reader (morsel direction). 2. **Single-column predicates only**. Multi-column predicates (`a.x + a.y > 100`) stay as `Filter`. v0 accepted. 3. **No parameter substitution in v0**. `WHERE a.age > $minAge` keeps the Filter above the NodeScan since we don’t resolve `$minAge` until execution. A later optimization passes `&Params` into `optimize::optimize` to constant-fold them; deferred. 4. **OR predicates are not pushed**. `WHERE a.age > 30 OR a.firstName = "Alice"` is one conjunct in the AND-split (the OR root), and pushability requires single-column ⇒ rejected. The Filter survives. Could be added by extending `ScanPredicate` to a tree, but row-group verdict combination for OR gets messier (union of MaybePresent verdicts). 5. **`IsNotNull` is conservative**. We don’t have per-row-group row\_count to verify `null_count < row_count`. Always returns MaybePresent when stats exist; the executor’s Filter drops nulls at row level. Negligible cost. 6. **Cross-type predicate comparison returns MaybePresent**. By design (defensive). The optimizer doesn’t construct cross-type predicates because schemas declare property types — but if a future path introduces them, this is the safety net. 7. **Memtable predicate eval is row-by-row** (not vectorised). The memtable is typically small (<10k rows before flush), so this is negligible vs SST decoding. Morsel-driven execution can revisit. ## Open questions * **OQ1**. `predicates_selectivity` reuse path: should it translate `ScanPredicate` → `Expression` and call `selectivity`, or have its own simpler arm? Decided: translate (single source of truth for selectivity heuristics). * **OQ2**. Should `NodeById` also accept predicates? Today `NodeById` is a point-lookup. Predicates on the same alias COULD be applied during the lookup. v0: no — the lookup decodes one row group with at most one row anyway; Filter on top is fine. Track as follow-up if benchmarks show a hot path. * **OQ3**. Should the writer emit per-row-group HLL sketches (not just per-SST)? This would enable approx-NDV reasoning per row-group for eq pushdown. Deferred; the current per-SST HLL is sufficient for query-level cardinality estimates. ## References * *Parquet 2.0 column index + offset index* — Apache Parquet specification §6.2 (column index for page-level pruning). * DuckDB’s *predicate pushdown into Parquet readers* (Raasveldt 2022) — modern reference implementation. * `docs/rfc/002-sst-format.md` §4 — stats embedded in SST. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/010-cost-based-optimizer.md` — cost model this RFC reuses. * `docs/rfc/011-predicate-pushdown.md` — the structural pushdown this RFC builds on. ## Plan de implementación 1. **`crates/namidb-storage/src/sst/predicates.rs`** (\~250 LoC + 18 tests): * `ScanPredicate` enum + `RowGroupVerdict` + `eval_row_group` evaluator. Helpers `scalar_cmp(a, b) -> Ordering` and `scalar_eq(a, b) -> bool` (delegating to PartialOrd / PartialEq of `StatScalar`). * Module `pub` in `sst/mod.rs`. * Unit tests cubren: Eq in/out range, Lt boundary, GtEq with NULL min, IsNull positive/negative, In with single/multi values, Between, missing min/max → MaybePresent, type mismatch → MaybePresent. 2. **`crates/namidb-storage/src/sst/nodes.rs`** (\~150 LoC + 5 tests): * `NodeSstReader::scan_with_predicates(&[ScanPredicate])` implementing §2 algorithm. `scan()` becomes a wrapper of `scan_with_predicates(&[])`. * Helper `row_group_stats_for_column(rg, col_name, prop_def) -> Option` reusing the mapping from `compute_property_stats`. * Unit tests: predicate skips all row-groups, predicate skips some, no predicates fall through to full scan, `IsNull` with NULL row-group survives, multi-predicate AND, no-stats fallback keeps row-group. 3. **`crates/namidb-storage/src/read.rs`** (\~80 LoC + 3 tests): * `Snapshot::scan_label_with_predicates(label, &[ScanPredicate])` implementing §3 algorithm. `scan_label(label)` wraps it with `&[]`. * Memtable predicate eval helper using `node_view_matches_predicates` (NULL-safe 3VL). * Unit tests: memtable filtering, SST filtering, predicate over tombstoned row, ND ndv (just kidding — verifies catalog isn’t used in scan path). 4. **`crates/namidb-storage/src/sst/mod.rs` + `lib.rs`**: * `pub mod predicates` + re-export `ScanPredicate`, `RowGroupVerdict`, `eval_row_group`. 5. **`crates/namidb-query/src/plan/logical.rs`** (\~30 LoC + 1 test): * `LogicalPlan::NodeScan` adds `predicates: Vec`. * Type alias `pub use namidb_storage::sst::predicates::ScanPredicate` at module root so the rest of the query crate doesn’t need to know the storage path. * Updates to `children()` (no children added — predicates are flat), `operator_name()` (still “NodeScan”), `contains_write()` (still false). * Test ensures NodeScan with predicates equals NodeScan with same predicates and not equal when predicates differ. 6. **`crates/namidb-query/src/optimize/parquet_pushdown.rs`** (\~250 LoC + 14 tests): * `classify_pending_for_scan(pending: Vec, alias: &str, label_def: &LabelDef) -> (Vec, Vec)`. * Conversion `try_into_scan_predicate(expr, alias) -> Option` case-analysing each `Expression::kind`. Supports: BinaryOp {Eq/Lt/LtEq/Gt/GtEq} with `PropertyAccess(alias, prop)` on one side and `Literal(lit)` on the other; the literal converts to `StatScalar` via a helper. `IS NULL / IS NOT NULL`. `IN [list]` when every element is a literal. `BETWEEN` decomposes to Gte+Lte at lowering time so the AND-split already gives us two conjuncts. * Tests: eq pushable, eq with non-matching alias rejected, eq with literal-on-left, range pushable, IS NULL pushable, IS NOT NULL pushable, IN with all literals, IN with non-literal rejected, cross-alias rejected, non-declared property rejected, parameter rejected, complex arithmetic rejected, idempotency. 7. **`crates/namidb-query/src/optimize/pushdown.rs`** (\~30 LoC + 4 tests): * NodeScan arm in `pushdown_at` now consults `parquet_pushdown::classify_pending_for_scan`. The non-pushable conjuncts materialise as `Filter` above; the pushable accumulate into `predicates`. * Tests verify: filter eq on declared prop ends up in NodeScan predicates; filter on parameter stays as Filter; filter on undeclared prop stays as Filter; filter on different alias stays as Filter (and would have been pushed elsewhere by the CrossProduct arm). 8. **`crates/namidb-query/src/optimize/mod.rs`** (\~10 LoC): * `pub mod parquet_pushdown` + re-export `classify_pending_for_scan` so tests can reach it. * The `optimize` pipeline doesn’t add a new pass; the NodeScan arm change in `pushdown_at` covers it. 9. **`crates/namidb-query/src/optimize/normalize.rs`** (\~5 LoC): * `recurse_children` arm for NodeScan recurses on… nothing (NodeScan is a leaf). The change is to preserve `predicates` when the arm clones the variant — trivial. 10. **`crates/namidb-query/src/exec/walker.rs`** (\~10 LoC): * `execute_node_scan` callsite (line \~140) passes `predicates` to `snapshot.scan_label_with_predicates(label, predicates).await?`. 11. **`crates/namidb-query/src/cost/cardinality.rs`** (\~40 LoC + 3 tests): * NodeScan arm applies multiplicative selectivity over predicates using the existing `selectivity::selectivity` and `BindingStats` machinery. * Tests: NodeScan with eq predicate estimate drops below base; NodeScan with range predicate estimate proportional to range; NodeScan with empty predicates equals base. 12. **`crates/namidb-query/src/plan/explain.rs`** (\~50 LoC + 2 tests): * `write_header` arm for NodeScan with predicates renders as §6. Predicate Display uses `format_scan_predicate(p, alias)` helper, also unit-tested. 13. **`crates/namidb-query/src/plan/lower.rs`** (\~10 LoC): * Lowering creates `NodeScan { predicates: vec![] }`. Mecánica; sin test new. 14. **`crates/namidb-query/tests/cost_smoke.rs`** (+8 integration tests): * `parquet_pushdown_moves_eq_to_scan` * `parquet_pushdown_moves_range_to_scan` * `parquet_pushdown_keeps_cross_alias_in_filter` * `parquet_pushdown_keeps_undeclared_property_in_filter` * `parquet_pushdown_renders_in_explain` * `parquet_pushdown_estimate_drops_below_full_scan` * `parquet_pushdown_executes_with_parity_to_raw` * `parquet_pushdown_skips_all_row_groups_when_out_of_range` Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 528 → \~580 passed. * `cargo clippy --workspace --all-targets --exclude namidb-py -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~900 src + \~500 tests + \~650 RFC. # RFC 014: HashSemiJoin via decorrelation > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-012 (HashJoin) **Supersedes:** — > *Mirrored from [`docs/rfc/014-hash-semi-join.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/014-hash-semi-join.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-011 (predicate pushdown), RFC-012 (HashJoin) **Supersedes:** — ## Summary Cosecha el ÚLTIMO operador nested-loop del read-path: `SemiApply`. Hoy `SemiApply { input, subplan, negated }` ejecuta `subplan` UNA VEZ por cada row de `input` — O(N·M) sobre `EXISTS` subqueries. Para LDBC SF1 (3M outer rows × 100 avg degree por subplan = 3×10⁸ ops) eso es infactible. Esta RFC decorrelaciona el subplan (sustituye el `Argument` leaf por una `NodeScan` independiente), construye un hash table una sola vez, y filtra el outer probando contra él: O(N+M). Alcance: * Nuevo operador `LogicalPlan::HashSemiJoin { outer, inner, on, negated, residual }`. Forma EXACTAMENTE como `HashJoin` excepto que: * `outer` ↔ probe semantic (NO se duplican rows del inner — máx 1 output row por outer row matching); * `negated` flag para `AntiSemiJoin` (NOT EXISTS). * Rewriter `convert_semi_apply_to_hash_semi_join(plan, &catalog)` bottom-up. Detecta `SemiApply` cuyo subplan: 1. tiene EXACTAMENTE un `Argument` leaf, 2. cuyas `bindings` son un SUBSET de 1 alias `X`, 3. cuyo label puede inferirse del outer scope (NodeScan o Expand con `target_label`). Sustituye `Argument(X)` por `NodeScan { label: , alias: X, predicates: vec![] }` y emite `HashSemiJoin` con `on=[JoinKey{ build: Property(X,"id"), probe: Property(X,"id") }]`. * Executor `execute_hash_semi_join`: build phase materializa un `BTreeSet` (no full row buffering — solo necesitamos “any match”); probe phase emite outer row si lookup acierta (semi) o si NO acierta (anti). * Cardinality `HashSemiJoin` estima rows como `outer.rows · min(1.0, inner.rows / outer_X_distinct)` (semi-join retains at most all outer rows). * EXPLAIN VERBOSE: `HashSemiJoin on=[(a.id, a.id)] negated=false` o `AntiHashSemiJoin` cuando `negated=true`. Out-of-scope: * **SemiApply cuyo subplan tiene Argument con MÚLTIPLES bindings**. Requiere multi-column hash key — diferente shape. Iteración futura. * **SemiApply cuyo subplan no contiene Argument** (subplan independiente del outer). Es trivialmente “ejecutar una vez”, pero requiere otro rewrite path (cache+broadcast). Iteración futura. * **PatternList decorrelation**. Mismo shape pero materializa lista en vez de boolean. Iteración futura. * **Pushdown sobre HashSemiJoin**. Heredado del existing pushdown (`hash_semi_join` arm en `optimize::pushdown::pushdown_at`): conjuncts del outer pueden bajar al outer-side, conjuncts del inner-only no aplican (el inner no contribuye bindings al output). * **Multi-pattern EXISTS** (`EXISTS { (a)-[]->(b)-[]->(c) }`). El subplan tiene un solo Argument leaf; el cuerpo es un chain de Expands. Funciona automáticamente — el rewriter solo reemplaza el Argument; el resto del subplan se mantiene literal y se ejecuta como inner en build phase. ## Motivation Sin decorrelation, una query como: ```cypher MATCH (a:Person) WHERE EXISTS { (a)-[:KNOWS]->(b:Person) } RETURN a.firstName ``` produce el plan: ```plaintext Project [a.firstName] SemiApply { negated: false } NodeScan { Person, a } (outer) Expand { source=a, edge_type=KNOWS, target=b } (subplan) Argument { bindings: [a] } ``` Sobre micro-graph 6 Persons / 6 KNOWS, esto ya cuesta 6 × scan\_label = 6 evaluaciones del subplan (que itera todos los Persons + edges per outer row). Sobre LDBC SF1 (3M Persons / 100 avg degree), cuesta 3M × scan\_label = infinito. Con decorrelation el plan optimizado es: ```plaintext Project [a.firstName] HashSemiJoin on=[(a.id, a.id)] NodeScan { Person, a } (outer) Expand { source=a, edge_type=KNOWS, target=b } (inner) NodeScan { Person, a } (decorrelated leaf) ``` Build phase ejecuta el inner UNA vez: 6M edges. Probe phase: 3M outer rows × O(1) lookup = 3M ops. Total: 6M build + 3M probe = 9M ops. **Mejora de 3M / 30 = 100 000× para este caso típico**. ## Design ### 1. IR: `LogicalPlan::HashSemiJoin` ```rust HashSemiJoin { /// The "probe" side. Bindings from `outer` are the ones that /// survive into the output. outer: Box, /// The "build" side. Bindings introduced by `inner` are /// DROPPED — only used to decide whether each outer row matches. inner: Box, /// Equi-join keys. `build_side` is evaluated on each `inner` /// row at build time, `probe_side` on each `outer` row at /// probe time. Single-key in v0 (multi-key is OK if needed). on: Vec, /// `false`: keep outer rows that have at least one inner match /// (`EXISTS`). `true`: keep outer rows with NO inner match /// (`NOT EXISTS`). negated: bool, /// Residual predicate evaluated on the joined row (outer /// bindings + inner bindings, 3VL). Optional — most simple /// EXISTS lower to no residual. residual: Option, } ``` Semantics: * `outer.bindings` ⊆ output bindings. Inner bindings are dropped (the semi-join semantics). * Build phase: for each `inner` row, evaluate `JoinKey::build_side` expressions; if any key component is NULL, skip the row (3VL). Build a `BTreeSet>` (canonical key fingerprint, reusing `dedup_rows`’s helper). * Probe phase: for each `outer` row, evaluate `JoinKey::probe_side` expressions; if any is NULL, skip (3VL). Lookup in the set. Emit outer row iff `(matched, negated)` matches the desired truth table: * `(true, false)` → keep (EXISTS). * `(false, true)` → keep (NOT EXISTS). * else → drop. * Residual: when present, evaluate on the JOINED row (outer ∪ build’s full row recovered from a secondary `Vec` map). For v0 we default `residual = None` since the lowering of bare EXISTS doesn’t generate one. ### 2. Rewriter `optimize::decorrelation::convert_semi_apply_to_hash_semi_join` Pre-walk the plan to populate `outer_labels: BTreeMap>` — alias → declared label for every NodeScan and labeled Expand target in scope. Then walk top-down. For each `SemiApply { input, subplan, negated }`: 1. Recurse into `input` and `subplan` (independent decorrelation). 2. Detect `subplan` shape: * Has exactly ONE `Argument { bindings: [X] }` leaf (depth-first descent through Expand/Filter/NodeById; reject if multiple Arguments or any operator that’s not in the decorrelation-safe list). * The Argument’s `bindings` is exactly `[X]` (a single alias). * `outer_labels[X] == Some(L)` for some label `L`. 3. Build `inner` by walking the subplan, replacing the unique `Argument` with `NodeScan { label: L, alias: X, predicates: vec![] }`. 4. Emit `HashSemiJoin { outer: input, inner: new_subplan, on: vec![JoinKey { build_side: Property(X, "id"), probe_side: Property(X, "id") }], negated, residual: None }`. Decorrelation-safe operators (descend into to find Argument): * `Expand` * `Filter` (residual conjuncts stay on the inner; semantics is “the subplan still filters its rows; HashSemiJoin probes whether any filtered row matches the outer key”) * `NodeById` (only if its `input` is the Argument leaf) * `Project` with `discard_input_bindings: false` (rare in subplans but possible) * `NodeScan`, `Empty` — never contain Argument; the Argument has to be at the leaf. If any other operator appears (`Aggregate`, `TopN`, `Distinct`, `Union`, `CrossProduct`, `HashJoin`, write ops, `SemiApply` itself), the rewriter bails and the original SemiApply is kept. v0 keeps the detection conservative — false negatives leave performance on the table but never produce incorrect plans. Idempotency: `HashSemiJoin` is not a `SemiApply`, so the second pass of the fixpoint won’t re-trigger. Verified in unit tests. ### 3. Executor ```rust async fn execute_hash_semi_join( outer: &LogicalPlan, inner: &LogicalPlan, on: &[JoinKey], negated: bool, residual: &Option, snapshot: &Snapshot<'_>, params: &Params, outer_bindings: Option<&Row>, ) -> Result, ExecError> ``` Phase 1 (build): execute `inner` once (no outer context). For each inner row, evaluate every `JoinKey::build_side` expression. If ANY is NULL, skip the row (3VL). Otherwise, push the fingerprint into a `BTreeSet>`. Phase 2 (probe): execute `outer`. For each outer row, evaluate every `JoinKey::probe_side`. Compute matched = `set.contains(fingerprint)`. Keep iff `(matched, negated)` is `(true, false)` or `(false, true)`. Residual: when `residual.is_some()`, the build phase additionally stores the full inner row alongside the fingerprint, and the probe phase iterates matching inner rows to evaluate the residual on the joined binding map. v0 ships `residual = None` from the rewriter so this path is exercised only by future RFC iterations. ### 4. Cardinality ```rust LogicalPlan::HashSemiJoin { outer, inner, on, negated, residual: _ } => { let o = estimate_inner(outer, catalog); let i = estimate_inner(inner, catalog); // P(at least one inner match for an outer row) ≈ // 1 - (1 - i.rows/distinct(inner_key))^(o.rows/distinct(outer_key)) // Simplification: i.rows / max(distinct_outer, 1.0) treated as the // probability a random outer row matches. let frac_match = (i.rows / o.rows.max(1.0)).min(1.0); let rows = if negated { o.rows * (1.0 - frac_match) } else { o.rows * frac_match }; ... } ``` The estimate is folklore for now — multi-key correlation and inner NDV are revisited a futuro. The output is clamped to `[0, o.rows]`. ### 5. EXPLAIN VERBOSE ```plaintext HashSemiJoin on=[(a.id, a.id)] (est=4) NodeScan label=Person alias=a (est=6) Expand source=a edge_type=KNOWS target=b (est=12) NodeScan label=Person alias=a (est=6) ``` When `negated=true`, the operator name is `AntiHashSemiJoin` (mirrors the existing `AntiSemiApply` rendering). ### 6. Integration with the pipeline * `optimize::optimize` (in `optimize::mod`) runs `convert_semi_apply_to_hash_semi_join` AFTER `convert_cross_to_hash` in the same fixpoint round. Order: pushdown → normalize → cross-to-hash → semi-to-hashsemi. * `optimize::pushdown::pushdown_at` gets a `HashSemiJoin` arm: conjuncts that reference only `outer` bindings push to the outer side; conjuncts that reference inner-only bindings are nonsensical (inner bindings are dropped) so they stay above the HashSemiJoin defensively. * `optimize::normalize` gets a `HashSemiJoin` arm: recurse on outer and inner. ### 7. Bindings analysis A subtle point: `Argument { bindings: [X] }` may carry multiple names when the subplan needs more than one outer variable. v0 rejects those (`arg_bindings.len() != 1`). They will be common in deeper LDBC queries with chained patterns; una iteración futura lifts the restriction via multi-key joins. When the subplan introduces NEW bindings (e.g. `Expand` introduces `target_alias`), those are local to the subplan and DO NOT leak to the outer output — same as the existing SemiApply semantics. ## Alternatives considered ### A. Hash table over outer + iterate inner Build over outer (keyed by `X.id`), iterate inner and emit `outer_row` once when matched. **Rejected**: requires deduplication on the inner side (the same outer might match multiple inner rows emitting duplicates). Cheaper to build over inner. Actually we go the other way: **build over INNER** (because inner is the small EXISTS side typically — friends-of-friend, etc.) and probe outer. The build side stores the SET of key values; each outer is checked once. Result preserves outer row order. ### B. Adaptive at runtime Decide per-query whether to decorrelate based on the actual sizes of outer/inner. **Rejected**: defeats EXPLAIN/PROFILE story. The cost model picks the side (build vs probe) statically. ### C. Apply pushdown into the subplan A more aggressive rewrite: pushing outer predicates INTO the inner subplan so the inner produces only the relevant subset. **Rejected for v0**: requires correlation analysis beyond the equality on the `X.id` (parameter propagation). Una iteración futura may revisit. ## Drawbacks 1. **Restricted to single-Argument-binding subplans**. Multi-binding correlation is common in real LDBC queries — `EXISTS { (a)-[]->(b) ... b.x = a.y }`. v0 keeps these as nested-loop SemiApply. 2. **Inner duplication**. The decorrelated inner enumerates every X in the corpus (not just those referenced by the outer). When the outer is much smaller than the inner’s universe (e.g. outer is a single row), nested-loop SemiApply is cheaper — `inner.rows / outer.rows` becomes lopsided. v1 could compare estimates and choose accordingly; v0 always decorrelates when shape matches. 3. **NULL on join key drops outer / inner rows silently**. Same as `HashJoin` 3VL semantics. Documented inline. For typical `EXISTS { (a)-[]->(b) }` this never triggers because node ids are never NULL. 4. **Cost-model estimate is folklore**. The independence and uniform- distribution assumptions over-simplify. RFC-010 §“Drawbacks 1” tracks the broader observation; a futuro puede refinarse. ## Open questions * **OQ1**. Should the rewriter try to lift `Filter` arms from inside the subplan to the outer-side when the conjunct references only outer bindings? Today the subplan is rewritten verbatim. Defer. * **OQ2**. How does `HashSemiJoin` interact with `PatternList`? `PatternList` is semantically a multi-row apply that materialises a list. Decorrelation produces a `HashJoin` (NOT semi-join) with array aggregation per outer key. Separate RFC. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79) — semi-join cardinality. * Galindo-Legaria & Joshi, *Orthogonal Optimization of Subqueries and Aggregation* (SIGMOD ‘01) — formal decorrelation rewrites. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/012-hash-join.md` — HashJoin executor this RFC mirrors. ## Plan de implementación 1. **`crates/namidb-query/src/plan/logical.rs`** (\~50 LoC + 2 tests): * `LogicalPlan::HashSemiJoin` variant. * `operator_name` returns `"HashSemiJoin"` / `"AntiHashSemiJoin"`. * `children` → `[outer, inner]`. `contains_write` → false (rewriter never touches subtrees with writes). 2. **`crates/namidb-query/src/optimize/decorrelation.rs`** (\~250 LoC + 8 tests): * `convert_semi_apply_to_hash_semi_join(plan, &catalog)`. * `outer_label_map(plan)` walks NodeScan/Expand collecting alias → label. * `find_unique_argument(plan)` returns `Option<(&Argument bindings, parent path)>`. * `replace_argument(plan, x, label)` substitutes the unique Argument with a fresh `NodeScan { label, alias: X, predicates: vec![] }`. * Tests cover: simple EXISTS → decorrelates, NOT EXISTS → negated, no-Argument subplan → kept as SemiApply, multi-binding Argument → kept, label unknown → kept, EXISTS with extra Filter → still decorrelates (filter remains in inner), nested SemiApply → outer SemiApply NOT touched if its subplan has SemiApply, idempotency. 3. **`crates/namidb-query/src/optimize/mod.rs`** (\~10 LoC): * `pub mod decorrelation`. * `optimize` pipeline runs `convert_semi_apply_to_hash_semi_join(plan, catalog)` AFTER `convert_cross_to_hash`. 4. **`crates/namidb-query/src/optimize/pushdown.rs`** (\~30 LoC + 3 tests): * HashSemiJoin arm (split pending by outer/inner-aliases). 5. **`crates/namidb-query/src/optimize/normalize.rs`** (\~5 LoC): * HashSemiJoin arm in `recurse_children`. 6. **`crates/namidb-query/src/cost/cardinality.rs`** (\~40 LoC + 3 tests): * HashSemiJoin arm with §4 formula. 7. **`crates/namidb-query/src/exec/walker.rs`** (\~80 LoC + 2 tests): * `execute_hash_semi_join` build/probe phases. 8. **`crates/namidb-query/src/exec/writer.rs`** (\~10 LoC): * HashSemiJoin arm (defensive, never produced by writes). 9. **`crates/namidb-query/src/plan/explain.rs`** (\~30 LoC + 2 tests): * `HashSemiJoin` rendering. `negated` flag selects `AntiHashSemiJoin`. 10. **`crates/namidb-query/tests/cost_smoke.rs`** (+5 integration tests): * `decorrelation_converts_simple_exists`, * `decorrelation_preserves_results`, * `decorrelation_handles_not_exists`, * `decorrelation_keeps_multi_binding_subplan_as_semi_apply`, * `decorrelation_renders_hash_semi_join_in_explain`. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 596 → \~625 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~500 src + \~250 tests + \~400 RFC. # RFC 015: Projection pushdown / column pruning > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-013 (Parquet predicate pushdown) **Supersedes:** — > *Mirrored from [`docs/rfc/015-projection-pushdown.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/015-projection-pushdown.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (Logical Plan IR), RFC-010 (cost model), RFC-013 (Parquet predicate pushdown) **Supersedes:** — ## Summary Hoy `NodeSstReader::scan()` decodifica TODAS las columnas Parquet declaradas en el `LabelDef`, aunque la query referencie sólo una fracción. Sobre Person en LDBC SF1 (\~12 columnas, \~3M filas), un `RETURN a.firstName` decodifica 12× más datos del necesario. Esta RFC cierra el end-to-end del pushdown: 1. **Analyze** — walk del plan top-down recolectando, por alias, el conjunto de propiedades que las expresiones referencian (RETURN, WHERE residual, ORDER BY, predicados de filtro intermedios). 2. **Annotate** — `LogicalPlan::NodeScan` gana `projection: Option>` (None = todas las columns, default para back-compat). El rewriter lo populates con el set inferido del analyze step. 3. **Storage** — `NodeSstReader::scan_with_predicates_and_projection` construye un `ProjectionMask` de Parquet que sólo lee las column leafs necesarias. Las engine columns (`node_id`, `tombstone`, `lsn`, `__schema_version`, `__overflow_json`) se incluyen siempre. 4. **Reader** — el resto del path (`Snapshot::scan_label_*`) se adapta a transparentar la projection. 5. **EXPLAIN VERBOSE** — `NodeScan label=Person alias=a projection=[firstName]` cuando hay projection no-trivial. Sobre LDBC SF1 con `RETURN a.firstName` esperamos: 12× menos bytes leídos desde S3 + 12× menos decoding CPU. Alcance: * Property-column pruning para NodeScan. Edge SSTs y NodeById quedan out-of-scope v0. * Análisis conservador: cuando una expresión usa `Variable(a)` sin PropertyAccess (e.g. `RETURN a`), la projection es `None` (lee todas las columnas, incluso `__overflow_json`). * Análisis de subplans de SemiApply/PatternList/HashSemiJoin va recursivo dentro de cada scope (los inner emiten su propio projection). * Predicates ya pushados a `NodeScan.predicates` (RFC-013) cuentan como referencias a sus columnas — el storage los necesita para filtrarlas. Out-of-scope: * **EdgesFwd/Inv property streams**. Edges aún no emiten per-property streams (RFC-002 §3.2.7 follow-up). Sin streams separados no hay granularidad de proyección. * **NodeById**. Decodifica un row group con max 1 row; el ahorro de IO es marginal y el overhead de la projection mask para un point-lookup no se justifica v0. * **Overflow column elision**. Cuando la query NO referencia propiedades no-declaradas, podríamos omitir `__overflow_json`. v0 lo mantiene siempre (defensivo). * **Projection pushdown dentro de Project**. Cuando un `Project` deja bindings vivas (`discard_input_bindings: false`), todas las columnas downstream son potencialmente referenciadas. v0 trata Project no-discarding como barrera. * **Pruning de schema-version / lsn columns**. Solo de propiedades declaradas. Las engine columns son baratas (UInt64 chunks RLE-comprimidos) y removerlas rompería la semántica de tombstone/winner. ## Motivation Plan ejemplo pre-rewrite: ```plaintext Project [a.firstName] (est=3000000) NodeScan label=Person alias=a predicates=[] (est=3000000) ``` `NodeScan` decodifica `prop_firstName`, `prop_lastName`, `prop_birthday`, `prop_creationDate`, `prop_locationIP`, `prop_browserUsed`, `prop_gender`, `prop_email`, `prop_speaks`, … 12+ columns Parquet. El executor luego accede solo `row[a].get("firstName")`. Con projection pushdown: ```plaintext Project [a.firstName] (est=3000000) NodeScan label=Person alias=a projection=[firstName] predicates=[] (est=3000000) ``` `NodeSstReader::scan_with_predicates_and_projection` construye un `ProjectionMask::leaves(schema, &[firstName_leaf])` y Parquet sólo lee las column pages relevantes. **Reducción de bytes leídos: \~10× sobre Person SF1**. La mejora se acumula con el parquet predicate pushdown (ya descartó row-groups; ahora descartamos columnas dentro de los row-groups que sobreviven). ## Design ### 1. IR change ```rust NodeScan { label: String, alias: String, predicates: Vec, /// Optional projection: only these property columns are /// materialised. `None` = include every declared property /// (back-compat). The rewriter populates this from analysis; /// lowering emits `None`. projection: Option>, } ``` `PartialEq`, `Clone`, `Debug` derive over the new field. All existing constructions of `NodeScan` upgrade to `projection: None`. Two NodeScans with different projection are considered different plans (matters for the `optimize` fixpoint termination check). ### 2. Analysis Walk the plan TOP-DOWN with a `RequiredSet`: ```rust #[derive(Default, Clone)] struct RequiredSet { /// Properties accessed for each alias still in scope. by_alias: BTreeMap, } #[derive(Default, Clone)] enum RequiredProps { /// A specific set of properties. Set(BTreeSet), /// At least one expression accessed the binding as a whole /// (`Variable(alias)`) — we don't know which properties it /// references, so all of them must survive. All, } ``` Algorithm (`compute_required(plan: &LogicalPlan)`): * Start with `RequiredSet::default()` at the root (no projections referenced yet). * For each operator visited top-down, the operator’s *output* may be referenced by the parent. Compute the required set *of the operator’s output*, then determine what the operator’s inputs must produce: * **Project**: items contribute references; outputs are the project’s aliases. If `discard_input_bindings: true`, only items’ references survive; else inherit parent’s set + items’. * **Filter / TopN / Distinct**: predicate / keys contribute references on top of parent’s. * **Expand**: introduces target\_alias / rel\_alias. Their requirements are sourced by reading the target NodeView / EdgeView. Removed downstream when the Expand’s input is computed. * **NodeScan**: leaf. Its `alias`’s required set IS the projection we set on the NodeScan. * Each expression contributes via `collect_property_refs(expr)` which walks the AST and emits `(alias, key)` pairs from PropertyAccess nodes, plus `(alias, ALL)` from bare `Variable(alias)` (e.g. `RETURN a` requires all columns). * Predicates already pushed into `NodeScan.predicates` MUST also contribute — they reference column names via `ScanPredicate.column()`. ### 3. Rewriter `apply_projection_pushdown(plan: LogicalPlan) -> LogicalPlan`: 1. Compute the required set once (a single top-down pass). 2. Walk the plan bottom-up. For each NodeScan, look up its alias in the required set: * `Some(RequiredProps::Set(cols))` → set `projection = Some(cols.into_iter().collect())`. Sort alphabetically for determinism. * `Some(RequiredProps::All)` or `None` → leave `projection = None` (read everything). Idempotent: re-running on a plan that already has projections is a no-op because the analysis discovers exactly the same set. The rewriter is integrated into `optimize::optimize` as the LAST step of each fixpoint round (after predicate pushdown, normalize, HashJoin conversion, decorrelation). Putting it last ensures it sees the FINAL plan shape, including any predicates absorbed into NodeScan y any nodes the rewriters introduced. ### 4. Storage ```rust impl NodeSstReader { pub fn scan_with_predicates_and_projection( &self, predicates: &[ScanPredicate], projection: Option<&[String]>, ) -> Result>; } ``` Algorithm: 1. If `projection.is_none()` → fall through to `scan_with_predicates(predicates)` (no extra projection mask). 2. Build a `ProjectionMask::leaves(schema_descr, &leaf_indices)` where `leaf_indices` includes: * Engine columns: `node_id`, `tombstone`, `lsn`, `__schema_version`, `__overflow_json`. Always. * For each property name in `projection`, locate `prop_` in the Parquet schema and add its leaf index. Defensive: if a property is not in the schema (label evolution edge case), skip it. 3. Apply both row-group pruning AND the projection mask via `ParquetRecordBatchReaderBuilder::with_projection`. 4. The decoded `RecordBatch`es have a SCHEMA that includes only the selected columns. The reader returns them as-is; the caller (`Snapshot::scan_label_*`) is already defensive when looking up columns by name — missing columns map to `None` properties. Optimization note: Parquet’s `ProjectionMask` avoids decoding the column pages NOT in the projection. Combined with row-group skipping (parquet predicate pushdown) the cold read on S3 goes from `O(R * C)` page reads to `O(R_kept * C_proj)` where `R_kept ≪ R` and `C_proj ≪ C` (with projection pushdown). ### 5. Snapshot reader adaptation ```rust impl Snapshot<'_> { pub async fn scan_label_with_predicates_and_projection( &self, label: &str, predicates: &[ScanPredicate], projection: Option<&[String]>, ) -> Result>; } ``` Memtable handling: the `properties` BTreeMap is *constructed* in memory anyway (no IO to save). For consistency we filter the in-mem properties to only include the projected names — keeps `NodeView` shape uniform between memtable-sourced and SST-sourced rows. Cheap. `scan_label_with_predicates(label, predicates)` becomes a wrapper of `scan_label_with_predicates_and_projection(label, predicates, None)`. `scan_label(label)` continues to wrap `(label, &[], None)`. ### 6. EXPLAIN VERBOSE ```plaintext NodeScan label=Person alias=a projection=[firstName] predicates=[a.age > 30] ``` When `projection.is_none()` we omit the field (default behaviour). When `projection.is_some()`, sort alphabetically for stable output. ### 7. Cardinality No change — the row count out of a projected NodeScan is identical to the un-projected one (same rows, fewer columns). The cost model doesn’t track byte-level costs in v0; that lives behind a follow-up (cuando agreguemos CPU-weighted cost). ## Alternatives considered ### A. Project pushdown inside the executor Skip the storage layer adaptation; let the executor build the `NodeView` with all columns and discard unused ones. **Rejected**: defeats the entire purpose. The win is in the IO path (S3 reads fewer column pages). ### B. Per-property bloom filter Already deferred from parquet predicate pushdown. Not relevant to projection. ### C. Schema-aware col pruning at the manifest level The manifest already records `PropertyColumnStats` per column. We could elide columns whose stats are all-null (the column is missing in every SST). **Rejected**: dynamic — the schema may evolve; defensive read of “missing column” returns None and is cheap. ### D. Just rely on Parquet’s RLE for unused columns Parquet’s run-length encoding makes unused-column reads cheap if the column is mostly null or constant. **Rejected**: still pays the metadata cost (column index fetches) plus the page-header round-trip per column. The projection mask is strictly better. ## Drawbacks 1. **Missing projection ⇒ no win**. When the query references the bare alias (`RETURN a`), the analysis falls back to `RequiredProps::All` and we read every column. Same as today. 2. **PropertyAccess inside subqueries**. The analysis descends into subplans of SemiApply / HashSemiJoin / PatternList — IF the subplan introduces a NodeScan, the NodeScan inside the subplan gets its own projection. The decorrelated inner reads `a.id` plus whatever the subplan body references. Often just `id` ⇒ massive win. 3. **Schema evolution**. If a writer landed a SST with column X that the current schema doesn’t declare (extra prop), the projection mask might exclude it and accidentally drop a usable column. v0 only projects from declared properties (`label_def.properties`), so extra columns are simply not requested — same as no-projection behaviour for those columns. 4. **Composite types**. `FloatVector` and `Json` columns are projected as any other property. `Json` may carry overflow properties we don’t want to drop — but `__overflow_json` is always in the engine-columns list, separate from `prop_*` json columns. Documented inline. 5. **Cardinality doesn’t reflect IO savings**. Two NodeScans with identical row counts but different projection cost differently in bytes. v0 EXPLAIN VERBOSE shows `est=N` rows for both; un future PROFILE surface bytes. ## Open questions * **OQ1**. Should NodeById get projection too? Each NodeById decodes exactly one row group (≤ 1 row); the metadata overhead of building a projection mask may dominate the win. Defer. * **OQ2**. How does projection interact with `compact.rs` (LSM compactions)? Compactions read full rows to merge winners. They don’t go through `scan_label`. Unaffected. ## References * DuckDB’s *push-based execution* projection rewrites (Raasveldt 2022). * Parquet’s `ProjectionMask::leaves` API. * `docs/rfc/008-logical-plan-ir.md` — IR this RFC extends. * `docs/rfc/013-parquet-predicate-pushdown.md` — IO-pushdown this RFC composes with. ## Plan de implementación 1. **`crates/namidb-storage/src/sst/nodes.rs`** (\~60 LoC + 4 tests): * `NodeSstReader::scan_with_predicates_and_projection` extending `scan_with_predicates` with a `ProjectionMask`. Engine columns always included. * Unit tests: projection includes engine + named columns; missing property is silently skipped; projection=None falls through. 2. **`crates/namidb-storage/src/read.rs`** (\~50 LoC + 2 tests): * `Snapshot::scan_label_with_predicates_and_projection`. * Memtable view filtering: only declared+projected properties survive. 3. **`crates/namidb-query/src/plan/logical.rs`** (\~10 LoC + 1 test): * `NodeScan` adds `projection: Option>`. Update all constructions. 4. **`crates/namidb-query/src/optimize/projection_pushdown.rs`** (\~250 LoC + 8 tests): * `apply_projection_pushdown(plan)`. * `compute_required(plan) -> RequiredSet`. * `collect_property_refs(expr, out)` AST walker. 5. **`crates/namidb-query/src/optimize/mod.rs`** (\~5 LoC): * `pub mod projection_pushdown` + run as last step of pipeline. 6. **`crates/namidb-query/src/exec/walker.rs`** (\~5 LoC): * NodeScan callsite passes `projection.as_deref()` to snapshot. 7. **`crates/namidb-query/src/plan/explain.rs`** (\~10 LoC): * Render `projection=[col1, col2]` when present. 8. **`crates/namidb-query/tests/cost_smoke.rs`** (+5 integration tests): * `projection_pushdown_extracts_referenced_columns`, * `projection_pushdown_handles_bare_variable_as_all`, * `projection_pushdown_includes_predicate_columns`, * `projection_pushdown_executes_with_parity`, * `projection_pushdown_renders_in_explain`. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 612 → \~640 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~400 src + \~250 tests + \~600 RFC. # RFC 016: Join reorder DP/greedy > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-010 (cost model), RFC-012 (HashJoin), RFC-013/015 (storage pushdowns) **Supersedes:** — > *Mirrored from [`docs/rfc/016-join-reorder.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/016-join-reorder.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-010 (cost model), RFC-012 (HashJoin), RFC-013/015 (storage pushdowns) **Supersedes:** — ## Summary El HashJoin (RFC-012) produce localmente — escoge build/probe del par actual. Cuando hay un *chain* `HashJoin { HashJoin { ... }, R3 }` (3+ relations unidas), el orden importa: el par que se construye primero produce el hash table cuyos size domina la memoria, y el output intermedio cuyo size se propaga al siguiente join. Hoy el optimizer mantiene el orden literal del lowering, que sigue el orden textual del WHERE — frecuentemente sub-óptimo. Esta RFC enumera todos los órdenes left-deep para chains de HashJoins, usa `estimate()` (cost model RFC-010 + ndv real) para elegir el de menor cost intermedio total, y reconstruye el árbol con ese orden. Selinger ‘79 DP O(N²·2^N), capeado N≤8 (LDBC SF1 IC8/IC9 tienen 4-5 patterns → factible). Alcance v0: * **HashJoin chains únicamente**. Detectar un subtree donde TODOS los operadores internos son `HashJoin` y las hojas son sub-trees arbitrarios (NodeScan, Expand, Filter, NodeById, etc.). * **Equi-join keys preservadas**. Cada `HashJoin.on` queda como predicate sobre el par que se produce en ese paso. El rewriter re-distribuye las equalities a los pares correctos. * **Left-deep DP** (Selinger ‘79). En cada paso del DP elige el par (S, R) que minimiza el cost acumulado, donde S es un subset y R una relation. Bushy plans podrían ganar 10-30% más en queries particulares; diferido. * **Cap N≤8 relations**. Para N>8 (raro en LDBC), saltamos el reorder (mantenemos orden literal). 2^N=256 subsets manejables. Out-of-scope: * **Expand chain reordering**. Re-anclar un `(a)→(b)→(c)` chain a empezar desde `c` requiere reverse-direction Expand y conocer el label de cada alias. Complicado v0; diferido. * **Cross-product reorder**. CrossProduct sin equi-keys queda como nested-loop — no hay decisión de orden útil. * **HashSemiJoin reorder**. Los SemiJoins ya tienen orden fijo (outer probe, inner build); reorder no aplica directamente. * **Cost-based reorder of CrossProducts dentro de chain**. v0 solo reordena el subtree HashJoin-only. ## Motivation Plan IC8-like pre-rewrite: ```plaintext HashJoin on=[(b.id, c.knows_id)] HashJoin on=[(a.id, b.knows_id)] NodeScan(Person, a) predicates=[a.id=$personId] (est=1) NodeScan(Person, b) (est=1000000) NodeScan(Person, c) (est=1000000) ``` Intermediate sizes: * Inner HashJoin (a × b): 1 × 1M / ndv(KNOWS\_id≈100) = 10000 * Outer HashJoin (ab × c): 10000 × 1M / 100 = 100000 Si el optimizer reorderaría a `(a × c) × b` (asumiendo `a-b` y `a-c` joins son ambos válidos): * Inner HashJoin (a × c): 1 × 1M / 100 = 10000 (similar) * Outer HashJoin (ac × b): 10000 × 1M / 100 = 100000 (similar) Hmm — en este ejemplo no cambia mucho porque las cardinalidades de salida son similares. Pero cuando los predicados varían por selectividad, el orden ÓPTIMO marca la diferencia. La forma canónica es: ```plaintext Total cost = Σ |intermediate_i| ``` Y minimizamos la suma. Para 3 relations el óptimo siempre es construir sobre la relation con menor cardinalidad first. ## Design ### 1. Detección del subtree ```rust struct JoinSubtree { /// Each leaf is an "atomic relation" — a plan subtree that does /// NOT contain a HashJoin at the root. (It may contain nested /// joins below; those were chosen by earlier passes.) leaves: Vec, /// All the equi-join predicates pooled from every HashJoin in /// the subtree. Each is a pair of expressions; v0 supports only /// `(Property(alias_l, key_l), Property(alias_r, key_r))`. equalities: Vec, /// Residual expressions (non-equi) pooled from every HashJoin's /// `residual` field. Will be re-attached to whatever pair /// produces both halves of the binding map. residuals: Vec, } struct JoinEdge { left_leaf_idx: usize, right_leaf_idx: usize, build_expr: Expression, probe_expr: Expression, } ``` Pre-walk: recursive descent. When a HashJoin is hit, decompose into the `build`’s recursion + `probe`’s recursion + add the join edges to the pool. Any non-HashJoin descendant becomes a leaf with the aliases it produces tracked. ### 2. Left-deep DP (Selinger ‘79) ```rust struct DpState { /// Bitset of leaves included in this subplan. leaves_mask: u32, /// Best plan covering exactly `leaves_mask`. best_plan: LogicalPlan, /// Estimated cost (sum of intermediate sizes). cost: f64, /// Estimated rows of this subplan's output. rows: f64, } ``` Selinger ‘79: 1. Base case (single leaf): `cost=0, rows=estimate(leaf)`. 2. Build: for size = 2..=N: * For each subset S of leaves with |S|=size: * For each (a) sub-subset T ⊂ S with |T|=size-1, (b) the remaining single leaf r = S \ T: * If there’s an equi-key between T’s aliases and r’s aliases: candidate plan = HashJoin(build=DP\[T], probe=r) with cost = DP\[T].cost + cost\_of\_hash\_join(DP\[T].rows, r.rows, keys). * Pick the candidate with the lowest cost; record in DP\[S]. 3. Pick DP\[full\_set] as the final reorder. `cost_of_hash_join` v0: `build.rows + probe.rows + estimated_output_rows`. Estimated output rows = Selinger ‘79 formula (reusing `cost::cardinality::estimate_hash_join`). ### 3. Cap & fallback N>8 → skip reorder (keep literal). The cap is also a guard against catastrophic blow-up when the user writes pathological queries. ### 4. Pipeline integration `optimize::optimize` runs the join\_reorder AFTER `convert_cross_to_hash` and `convert_semi_apply_to_hash_semi_join` (so it sees the full HashJoin shape) but BEFORE `apply_projection_pushdown` (so the projection can prune the FINAL shape). Idempotency: re-running on a plan that was already optimal produces the same plan (the DP picks the same min-cost order deterministically). ### 5. Edge cases * **No equalities between subsets**. If the only way to bridge two subsets is a CrossProduct (no equi-key), we leave them as CrossProduct (HashJoin’s pre-condition still applies). The DP just picks the lower-cost arm available. * **Residuals**. After DP picks the final tree, residuals are re-attached to the LOWEST HashJoin whose bindings include all the residual’s referenced aliases. v0 attaches all residuals to the ROOT of the reordered tree (conservative). * **Multi-key joins**. When two relations share multiple equi-keys, the DP picks them all (the `on` list grows). ## Alternatives considered ### A. Greedy bottom-up (no DP) Pick the cheapest pair, merge, repeat. O(N²) instead of O(N²·2^N). **Rejected**: known to produce sub-optimal plans on chains with varying selectivities. DP at N≤8 is cheap enough. ### B. IKKBZ (Krishnamurthy-Kim-Boral) Optimal left-deep enumeration in polynomial time using ranking functions. **Rejected v0**: complex to implement; DP at N≤8 is fast enough (\~1ms even for N=8 = 256 subsets). ### C. Bushy DP Try every binary partition, not just left-deep. **Deferred**. Gains \~10-30% on specific cyclic patterns; doesn’t apply to most LDBC SNB queries. ### D. Hyper-graph reorder DSDP (Moerkotte) or similar. **Out of scope**. Diferido si benchmarks show v0 left-deep DP loses to bushy. ## Drawbacks 1. **Capped at N=8**. Queries with 9+ pattern parts (rare in LDBC) keep the literal order. Mitigated by `> 8` being uncommon. 2. **Residual placement is conservative**. v0 always attaches the union of residuals at the root. If a residual references only 2 leaves, attaching it lower would prune earlier. Defer. 3. **Doesn’t reorder Expand chains**. The biggest wins on LDBC IC2 would come from re-anchoring an Expand chain — that’s a structural rewrite this RFC explicitly punts. 4. **Cost model assumptions**. Selinger ‘79 assumes uniform distribution and independent keys. RFC-010 §“Drawbacks” tracks the broader issue. A futuro puede revisitarse. ## References * Selinger et al., *Access Path Selection in a Relational Database Management System* (SIGMOD ‘79). * Krishnamurthy, Kim, Boral, *Optimization of Nonrecursive Queries* (1986) — IKKBZ. * Moerkotte, *Building Query Compilers* — bushy enumeration. * `docs/rfc/012-hash-join.md` — HashJoin this RFC reorders. ## Plan de implementación 1. **`crates/namidb-query/src/optimize/join_reorder.rs`** (\~500 LoC + 10 unit tests): * `reorder_joins(plan, &catalog) -> LogicalPlan`. * Collect/decompose helpers. * Selinger DP with bitmask state. * Rebuild HashJoin tree from DP result. 2. **`crates/namidb-query/src/optimize/mod.rs`** (\~5 LoC): * Pipeline step `reorder_joins(...)` after decorrelation, before projection pushdown. 3. **`crates/namidb-query/tests/cost_smoke.rs`** (+4 tests): * `join_reorder_prefers_smaller_build_side`, * `join_reorder_keeps_plan_when_no_alternatives`, * `join_reorder_executes_with_parity`, * `join_reorder_caps_at_8_relations`. Snapshot esperado: * `cargo test --workspace --exclude namidb-py`: 627 → \~641 passed. * `cargo clippy --workspace --all-targets -- -D warnings`: clean. * `cargo fmt --all -- --check`: clean. * LoC nuevo: \~500 src + \~250 tests + \~420 RFC. # RFC 017: Factorized intermediate results > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (LogicalPlan IR), RFC-012 (HashJoin), RFC-015 (projection pushdown), RFC-016 (join reorder) **Supersedes:** — > *Mirrored from [`docs/rfc/017-factorization.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/017-factorization.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-008 (LogicalPlan IR), RFC-012 (HashJoin), RFC-015 (projection pushdown), RFC-016 (join reorder) **Supersedes:** — ## Summary El executor actual es Volcano-eager con un único tipo de intermediate result, `Vec` donde `Row = BTreeMap` (`exec/row.rs:11`). Cada operator materializa completamente su salida antes de pasarla al siguiente, y `Expand` (`exec/walker.rs:484`) clona el `BTreeMap` por cada edge expandido (`new_row.clone()` × 2 en walker.rs:544/554). Esto produce un blow-up cartesian explícito en multi-hop patterns: para `(p)-[:KNOWS]->(f)-[:KNOWS]->(fof)<-[:HAS_CREATOR]-(msg)` con fan-out de \~10 en cada hop, el executor materializa \~1500 `Row` con 3-4 bindings cada uno antes del `LIMIT 20`. Esta RFC introduce **factorized intermediate results**, una representación en la que los outputs de Expand / HashJoin / CrossProduct son cadenas de `FactorNode { parent: Option, binding: Slot }` apoyadas sobre una `FactorArena`. Cada nuevo binding agrega un único nodo al arena en vez de clonar el BTreeMap completo. La materialización a `Vec` se difiere hasta el operador que la requiere (TopN / Aggregate / Project final), y cuando se hace, solo se aplastan las chains alcanzables por el `LIMIT` / projection. Para IC09 esto reduce el footprint de O(fanout³) `Row`s a O(fanout³) `FactorNode`s de \~16 bytes cada uno, y la materialización final cae a O(LIMIT × profundidad) — exactamente la cota teórica de Olteanu (2015) y la representación f-rep que Kùzu (CIDR 2023) usa internamente. ### Alcance v0 * **Pointer-based, arena-allocated factorization**. Trie-based (Olteanu) queda como referencia teórica; el shape concreto es un DAG de `FactorNode`s con índices `usize` al arena (ver Design §2). * **Operators reescritos**: `Expand`, `CrossProduct`, `HashJoin` (build + probe en F-rep), `Filter`, `Project` intermedio. * **Materialización en sinks**: `TopN`, `Aggregate`, `Distinct`, `Project` final (RETURN), `Unwind`, `Union`, `PatternList`. Sinks consumen `FactorRowSet` y emiten `Vec`. * **Backwards compat por feature flag**: variable de entorno `NAMIDB_FACTORIZE=0` (default = on una vez estabilizado) restaura el path `Vec` para regresión semántica. La SemanticParity test suite compara outputs (row-set equality) entre ambos paths. * **Row parity en `exec_ldbc_snb.rs`**: 100% mantenido. Los tests Cypher e2e validan equivalence, no internal representation. ### Out-of-scope v0 * **WCOJ (Worst-Case Optimal Joins)**. RFC-009 (en draft) introduce leapfrog triejoin para queries cíclicas. WCOJ se compone con factorization (ambos operan sobre f-rep), pero la implementación del operator queda diferida. * **Operators columnar (Arrow-native)**. Mantenemos `RuntimeValue` por binding (un value individual). Arrow-vectorized batches quedan para una iteración futura (morsel-driven). * **Spilling a disco**. La FactorArena vive en memoria. Si el dataset excede RAM, fallback es flat path con stream spill — fuera de v0. * **DAG-level reuse** (CSE). Si dos branches del plan comparten un prefijo, no detectamos ni compartimos sub-arenas. Selinger ya elige un orden global; CSE-on-F-rep es follow-up. ## Motivation **Bench (smoke gate scale=0.1) revela el costo del path actual:** | Query | NamiDB p50 | Kùzu p50 | Ratio | | --------------------------------- | ---------- | -------- | -------- | | IC02 (KNOWS + HAS\_CREATOR) | 62 ms | 1.04 ms | **60×** | | IC07 (HAS\_CREATOR + LIKES) | 7 ms | 0.97 ms | 7× | | IC08 (HAS\_CREATOR + REPLY\_OF) | 7 ms | 1.10 ms | 6× | | IC09 (KNOWS·KNOWS + HAS\_CREATOR) | **624 ms** | 1.64 ms | **382×** | Row parity es 100% (compare.py confirma idéntico count y mismas filas) — la divergencia es puramente de motor. Kùzu mantiene factorized intermediate (Jin et al., CIDR 2023 §4.2) y emite plans que defer la materialización al `LIMIT`. **Plan IC09 actual:** ```plaintext TopN(20, msg.creationDate DESC) └─ Project [fof.firstName, fof.lastName, msg.content, msg.creationDate] └─ Expand HAS_CREATOR (msg ← post.has_creator) └─ Expand KNOWS (fof, hop 1..1) └─ Expand KNOWS (friend, hop 1..1) └─ NodeById Person p {id: $personId} ``` **Footprint en cada nivel (fanout ≈10 para KNOWS, ≈15 para HAS\_CREATOR):** | Operator | Rows | Bindings × Row | Bytes (BTreeMap + Node clones) | | ------------- | ---- | ------------------ | ------------------------------ | | NodeById | 1 | 1 (p) | \~200 B | | Expand friend | 10 | 2 (p, f) | \~4 KB | | Expand fof | 100 | 3 (p, f, fof) | \~60 KB | | Expand msg | 1500 | 4 (p, f, fof, msg) | \~1.2 MB | | TopN(20) | 20 | 4 | (descarta 1480) | La columna “Bytes” cuenta `Box` + `BTreeMap` allocs + `Arc` shared del binding name. Los 1.2 MB en Expand msg son \~80% allocator + \~20% clone CPU. **`new_row.clone()` en walker.rs:544 se invoca 1500 veces en este path**, cada clone copiando 3 entries previos del BTreeMap. **Plan IC09 con factorization:** ```plaintext TopN(20) ← materialize() aquí, solo 20 chains finales └─ Project ← pass-through factorizado (no allocates rows) └─ ExpandF HAS_CREATOR → FactorArena nodes for {msg} └─ ExpandF KNOWS (fof) → FactorArena nodes for {fof}, parent=friend_node └─ ExpandF KNOWS → FactorArena nodes for {friend}, parent=p_node └─ NodeById → 1 FactorNode root with {p} ``` | Operator | FactorNodes | Bytes/node | Total | | ---------------------- | ------------------------------------- | ------------------ | ------ | | NodeById | 1 | 24 (parent + Slot) | 24 B | | ExpandF friend | 10 | 24 | 240 B | | ExpandF fof | 100 | 24 | 2.4 KB | | ExpandF msg | 1500 | 24 | 36 KB | | TopN(20) materialize() | 20 × 4 bindings = 80 BTreeMap entries | flat | \~6 KB | **\~36 KB vs \~1.2 MB = 33× menos memoria intermediate.** El CPU ahorro es similar (no más BTreeMap clones; arena push es \~10 ns vs clone \~500 ns). ## Design ### 1. Tipos de datos Nuevo módulo `crates/namidb-query/src/exec/factor.rs`: ```rust /// Index into FactorArena. usize to keep arena traversal cache-friendly. pub type FactorIdx = u32; pub const FACTOR_ROOT: FactorIdx = 0; /// Single binding introduced by an operator: (name, value). Names are /// `Arc` so siblings share without allocating. #[derive(Debug, Clone)] pub struct Slot { pub name: Arc, pub value: RuntimeValue, } /// One factorized output node. `parent` chains upward to inherited /// bindings; `slot` is what THIS operator added. The root node /// (FACTOR_ROOT) has parent=None and an empty Slot vec. #[derive(Debug)] pub struct FactorNode { pub parent: Option, /// Bindings added at this level. Usually 1 (Expand adds {target_alias}, /// HashJoin adds the probe-side bindings) but can be N for CrossProduct /// or HashJoin output that emits multiple bindings at once. pub slots: SmallVec<[Slot; 2]>, } /// Arena of all factor nodes for one query execution. Grows monotonically; /// no reuse, no GC. Dropped at end of execute(). #[derive(Debug, Default)] pub struct FactorArena { nodes: Vec, } impl FactorArena { pub fn new() -> Self { let mut a = Self::default(); a.nodes.push(FactorNode { parent: None, slots: SmallVec::new() }); debug_assert_eq!(a.nodes.len(), 1, "root is at FACTOR_ROOT"); a } pub fn push(&mut self, parent: FactorIdx, slots: SmallVec<[Slot; 2]>) -> FactorIdx { let idx = self.nodes.len() as FactorIdx; self.nodes.push(FactorNode { parent: Some(parent), slots }); idx } /// Walk parent chain and accumulate bindings into a flat Row. Used /// only at materialization points. pub fn materialize(&self, leaf: FactorIdx, projection: Option<&[&str]>) -> Row { let mut row = Row::new(); let mut cur = Some(leaf); while let Some(idx) = cur { let node = &self.nodes[idx as usize]; for slot in node.slots.iter().rev() { if let Some(p) = projection { if !p.iter().any(|w| **w == *slot.name) { continue; } } // First occurrence wins (shadowing — child overrides parent). row.bindings.entry(slot.name.to_string()) .or_insert_with(|| slot.value.clone()); } cur = node.parent; } row } } /// What each operator passes to its parent. Replaces `Vec` as /// the intermediate type once factorization is enabled. pub struct FactorRowSet { pub arena: Arc>, pub leaves: Vec, } ``` **Decisión `usize` vs `u32`:** `u32` para mantener `FactorIdx` denso (4 bytes vs 8). Cap 4G nodes per query — más que suficiente. **Decisión `Arc` para `Slot.name`:** Los binding names son de \~10 chars promedio y se repiten en CADA nivel del DAG. Inline string costaría \~16 B/binding × millones de bindings = MBs desperdiciados. `Arc` shared = \~10 B/string + 8 B/Arc clone (ref count atomic). **Decisión `SmallVec<[Slot; 2]>`:** La mayoría de Expand añaden 1 binding (target). HashJoin añade los probe-side bindings (3-5 típicos). `SmallVec` inline 2 evita el alloc del 80% de casos sin heap-allocar para los menos. ### 2. Operators reescritos #### 2.1 `execute_expand` (walker.rs:484) **Antes:** ```rust async fn execute_expand(rows: Vec, ...) -> Result> { let mut out = Vec::new(); for row in rows { let mut frontier = vec![Step { tail, row: row.clone() }]; for hop in 1..=max { for step in frontier.drain(..) { for edge in neighbours { let mut new_row = step.row.clone(); // ← clone #1 new_row.set(target_alias, value); next_frontier.push(Step { row: new_row.clone() }); // ← clone #2 if hop >= min { out.push(new_row); } } } } } Ok(out) } ``` **Después:** ```rust async fn execute_expand_factor( input: FactorRowSet, target_alias: Arc, rel_alias: Option>, ... ) -> Result { let arena = input.arena.clone(); let mut out_leaves = Vec::new(); for leaf in input.leaves { // Find tail node id by walking up to the binding `source`. let tail = arena.borrow().lookup_binding(leaf, source)?; let mut frontier = vec![(leaf, tail)]; for hop in 1..=max { let mut next_frontier = Vec::new(); for (parent_idx, tail_id) in frontier.drain(..) { for edge in neighbours_of(snapshot, edge_type, dir, tail_id).await? { let target_id = partner_id(&edge, dir, tail_id); let target_view = lookup(...).await?; let mut slots = SmallVec::new(); if let Some(name) = &rel_alias { slots.push(Slot { name: name.clone(), value: RuntimeValue::Rel(...) }); } slots.push(Slot { name: target_alias.clone(), value: RuntimeValue::Node(Box::new(NodeValue::from(target_view))), }); let new_idx = arena.borrow_mut().push(parent_idx, slots); next_frontier.push((new_idx, target_id)); if hop >= min { out_leaves.push(new_idx); } } } frontier = next_frontier; } } Ok(FactorRowSet { arena, leaves: out_leaves }) } ``` **Clave:** ninguna clonación de Row. El `parent_idx` ya inherita todos los bindings ancestrales; solo se push un `FactorNode` con el nuevo binding. #### 2.2 `cross_product` (walker.rs:693) **Antes:** ```rust fn cross_product(left: Vec, right: Vec) -> Vec { let mut out = Vec::with_capacity(left.len() * right.len()); for l in &left { for r in &right { let mut merged = l.clone(); // ← clone left for (k, v) in &r.bindings { merged.set(...); } // ← copy entries out.push(merged); } } out } ``` **Después:** ```rust fn cross_product_factor(left: FactorRowSet, right: FactorRowSet) -> FactorRowSet { // Splice right's chains onto left's leaves. The arena must be merged // (offset right's indices). For v0 we copy right's nodes into left's // arena (O(|right.nodes|), one-time). let arena = left.arena; let right_offset = arena.borrow().nodes.len() as FactorIdx; arena.borrow_mut().splice_from(&right.arena.borrow()); let mut out_leaves = Vec::with_capacity(left.leaves.len() * right.leaves.len()); for &l in &left.leaves { for &r in &right.leaves { // Reparent right's chain from FACTOR_ROOT to l. let r_offset = r + right_offset; let bridge = arena.borrow_mut().splice_under(l, r_offset); out_leaves.push(bridge); } } FactorRowSet { arena, leaves: out_leaves } } ``` **`splice_under(parent, foreign_idx)`** reroutea la cadena del nodo foreign para que su root apunte al `parent`. Es O(altura(foreign\_idx)) worst case, pero típico altura ≤ 5 en LDBC. **Trade-off:** v0 hace `splice_from` (copia los nodos del right en el left). Alternative: dos arenas separadas + `MergedArenaView` que los presenta como uno solo. Más eficiente para outputs grandes pero complica la API de `materialize`. Defer. #### 2.3 `HashJoin` (walker.rs::execute\_hash\_join) **Build side** (la rama “build” de un HashJoin): materializa a `Vec` ahora porque necesita ser indexable por las claves. Mantenemos eso. La build side ya se aplasta — esa parte no cambia. **Probe side:** se mantiene como `FactorRowSet`. Para cada `probe.leaf`: 1. Look up `probe.lookup_binding(leaf, probe_key) → val`. 2. Hash table lookup → `Vec<&BuildRow>` (build side rows que matchean). 3. Por cada `BuildRow`, push un `FactorNode` con los bindings de build como slots, parent=`probe.leaf`. → un nuevo leaf en arena. Output es `FactorRowSet` cuyas leaves son los productos probe×build. **No reorder semantics**: HashSemiJoin sigue sin swap (RFC-016). #### 2.4 Sinks (materialization) `TopN`, `Aggregate`, `Distinct`, `Project` final, `Union`, `PatternList`, `Unwind` consumen `FactorRowSet` y emiten `Vec`: ```rust fn materialize_for_topn(set: FactorRowSet, n: usize, order_key: &str) -> Vec { // 1. Top-N by order_key value WITHOUT materializing — we only need // arena.lookup_binding(leaf, order_key) for the heap key. let mut heap = BinaryHeap::with_capacity(n + 1); for leaf in &set.leaves { let key = set.arena.borrow().lookup_binding(*leaf, order_key)?; heap.push((Reverse(key), *leaf)); if heap.len() > n { heap.pop(); } } // 2. Materialize only the N survivors. heap.into_iter() .map(|(_, leaf)| set.arena.borrow().materialize(leaf, None)) .collect() } ``` Para `Project` final (RETURN columns): materialize con projection `&[col_names]` para evitar copiar bindings que no se devuelven. Combina con RFC-015 (projection pushdown ya emite las columnas que necesita el RETURN). ### 3. Wiring en el optimizer y executor #### 3.1 Sin cambios en LogicalPlan `LogicalPlan` se mantiene igual (RFC-008). Factorization es un detalle del executor — el plan sigue siendo `Expand`, `HashJoin`, etc. #### 3.2 `execute()` toma una decisión arriba `execute(plan, snapshot, params)` decide entre dos paths: ```rust pub async fn execute(plan: &LogicalPlan, snapshot: &Snapshot, params: &Params) -> Result, ExecError> { if factorize_enabled() { let set = execute_factor(plan, snapshot, params).await?; Ok(materialize_top(set)) // root materialization } else { execute_flat(plan, snapshot, params).await } } ``` `factorize_enabled()` lee `NAMIDB_FACTORIZE` (default `1` una vez estabilizado, `0` durante el desarrollo). `execute_factor` y `execute_flat` son funciones paralelas. `execute_flat` es el path actual (renombrado). `execute_factor` es el nuevo path. **No share parcial:** intentamos mantenerlos como dos paths independientes para evitar regresiones. Cuando el path factorizado se estabilice, deprecate `execute_flat` con un `#[deprecated]` y remove en una iteración posterior (no v0). #### 3.3 Write operators `CREATE`, `MERGE`, `SET`, `REMOVE`, `DELETE` consumen el output de read clauses. v0: materializan F-rep al input de cada write — los writes ya son row-oriented y la cadena no se beneficia. ### 4. Tests #### 4.1 Unit tests `exec/factor.rs::tests`: * `arena_root_is_empty` — `materialize(FACTOR_ROOT)` returns empty Row. * `single_push_then_materialize` — push 1 slot, materialize == single binding. * `chain_inherits_parent` — push A then B, materialize(B) has both A and B. * `materialize_with_projection` — projection filter hides slots. * `child_shadows_parent` — same name, child value wins. * `splice_under_reparent` — splice respects topology. #### 4.2 Operator parity tests Cada operator que toca factorization tiene un test que ejecuta el MISMO plan con `NAMIDB_FACTORIZE=0` y `=1` y compara outputs por `HashSet` equality (orden no garantizado en ambos): ```rust #[tokio::test] async fn expand_factor_matches_flat() { let (flat, fact) = run_both_paths(plan, snapshot, params).await; assert_eq!(row_set(&flat), row_set(&fact), "Expand parity failed"); } ``` `row_set(rows) -> BTreeSet` para ignorar orden, mantener multiplicidad. #### 4.3 Integration tests `crates/namidb-query/tests/exec_ldbc_snb.rs` se ejecuta dos veces (build matrix con feature flag) — todos los tests existentes deben pasar en ambos paths. #### 4.4 Bench Re-correr el harness gate (`bench/README.md`). Comparar ratios pre- y post- factorization. Threshold de éxito v0: * IC09: < 50× Kùzu (era 382×). 8× mejora absoluta. * IC02: < 10× Kùzu (era 60×). * IC07/IC08: < 5× Kùzu (eran 6-7×). Si IC09 < 2× (gate smoke), avance a SF1 real LDBC. Si no, evaluar morsel-driven execution y/o WCOJ como siguientes. ### 5. Plan de implementación | Fase | Entregable | | ---------- | ------------------------------------------ | | Diseño | Este documento | | Tipos base | `factor.rs` + 6 unit tests | | Expand | `execute_expand_factor` + parity test | | Joins | `cross_product_factor`, `hash_join_factor` | | Sinks | Sinks + workspace integration tests verdes | | Validación | Re-bench gate | El alcance amplio justifica un RFC explícito antes de tocar walker.rs. ## Alternatives considered ### A1. Trie-based factorization (Olteanu 2015) F-trie nodes con shape `{level: usize, children: HashMap}`. Más cerca del paper, expresividad superior para WCOJ. **Rechazado v0** porque (a) requiere hash-keyed children → cuesta HashMap allocs por nivel; (b) la traversal pattern de NamiDB (walker.rs) es naturalmente pointer-up (cada step inherita parent), no key-down. El trade-off de Olteanu (memoria mínima asintóticamente) no compensa en datasets < 1B nodes donde RAM no es el bound. ### A2. Columnar vector batches (Arrow-native, à la DuckDB) Pasa `RecordBatch` entre operators, no `Vec`. Combina factorization * vectorization en un solo paso. **Rechazado v0** porque (a) requiere reescribir TODO el executor para trabajar en Arrow batches en vez de RuntimeValue por binding; (b) la ruta morsel-driven ya va por ese camino. Pre-condition para Arrow batches es resolver el factorization shape primero — si los outputs intermedios son flat tuples Cartesian-blown-up, los batches no ayudan. Hacemos factorization primero, vectorization después. ### A3. Just batch `Vec` reuse + clone-on-write Reemplazar `Row { bindings: BTreeMap }` con `Row { bindings: Arc }` y mutaciones via `Arc::make_mut`. Reduce clone cost pero no elimina cartesian materialization en operators. **Rechazado v0** porque ataca el síntoma (clone cost) no la causa (N×M rows allocated). Para IC09 el problema son las 1500 filas, no el cost de clonar cada una. Arc-on-Row ayudaría \~3-5× pero no los \~382× requeridos. ### A4. Sin feature flag, full migration directa Reemplazar `Vec` con `FactorRowSet` en todos los operators de una vez. **Rechazado v0** porque (a) imposibilita el SemanticParity test suite (no hay path de referencia); (b) bug fix workflow más peligroso; (c) revert difficult si IC\*‘s row counts difieren tras materialize() en algún edge case. ## Drawbacks 1. **Complejidad del executor.** Dos paths paralelos (`execute_flat` / `execute_factor`) durante el período de feature flag. Mitigado con strict parity tests. 2. **Materialize a flat puede regresar performance en queries que YA son flat-friendly.** Ej: `MATCH (a) RETURN a` con 1M nodes, F-rep allocates 1M `FactorNode` y luego aplasta a 1M `Row` — peor que el path actual. Mitigación: el sink reconoce “single-binding sets” y short-circuits . Si el set tiene un solo Slot por leaf, no allocate F-rep intermedio. 3. **RefCell en FactorArena.** Sharing entre operators implica interior mutability. Tokio + RefCell requiere `!Send` discipline o `Mutex`. v0 elige `RefCell` (single-threaded executor); cuando aterrice morsel parallelism, swap a `Arc>` o partitioned arenas. 4. **`lookup_binding(leaf, name)`** es O(depth). Para depth ≤ 5 (LDBC pattern típico) eso es \~100 ns por lookup. Si una query necesita el mismo binding muchas veces, mejor cachear en el operator o expandir el slot al arena root. 5. **Memory profile diferente.** Spike acumulativo en el arena hasta el sink, vs spike continuo en flat. Para queries largas (no LIMIT) el arena puede crecer mucho. Mitigación follow-up: stream sinks que drenan el arena progresivamente. ## Open questions * **¿`Arc>` o pass-by-value?** v0 propone `Arc>` para que `cross_product` pueda compartir el arena. Alternativa: cada operator construye su propio arena y `splice_from` los del input. Más allocations, menos contention. Decidir durante implementación cuando se vea el patrón real. * **¿Cómo manejar `OPTIONAL MATCH`?** Cuando un Expand opcional no encuentra neighbours, hoy emite el row sin el binding (NULL semantics). En F-rep: push un FactorNode con un Slot `{name, RuntimeValue::Null}`? ¿O dejar al sink que detecte “missing binding” → null? Decidir durante implementación. * **`Distinct` post-F-rep.** Hashing `FactorIdx` no funciona — Distinct compara por valor, no por identidad arena. Materialize-then- distinct o introducir un hash sobre la materialización del row? Probable: materialize-first para v0. * **Threshold para feature flag default.** ¿Encender cuando todos los parity tests pasen o esperar a bench results? Propuesta: encender con flag override disponible y flip default tras bench validation. ## References * Olteanu, Závodný (2015) — **Size Bounds for Factorised Representations of Query Results.** ACM TODS 40(1). * Jin, Mhedhbi, Lu, Sequoda (2023) — **Kùzu Graph Database Management System.** CIDR 2023. §4.2 describes pointer-based factorization. * Bakibayev, Olteanu (2012) — **FDB: A Query Engine for Factorised Relational Databases.** PVLDB 5(11). * Aberger et al. (2017) — **EmptyHeaded: A Relational Engine for Graph Processing.** SIGMOD. §3.1 motivates factorization in graph context. * Leis et al. (2014) — **Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age.** SIGMOD. Composing factorization with morsel-driven is a future follow-up. * `crates/namidb-query/src/exec/walker.rs:484` — current `execute_expand` blow-up point. * `crates/namidb-query/src/exec/walker.rs:693` — current `cross_product` blow-up point. * `crates/namidb-query/src/exec/row.rs:11` — current `Row` type. * RFC-008 (LogicalPlan IR), RFC-012 (HashJoin), RFC-015 (projection pushdown), RFC-016 (join reorder) — operators y plan shape que esta RFC reescribe a F-rep. # RFC 018: CSR-style adjacency materialised in-snapshot > **Status:** draft **Author(s):** Matías Fonseca queries — IC09 et al.) **Builds on:** RFC-002 (SST format §3 edges binary CSR), RFC-003 (ranged reads + SstCache), RFC-017 (factorization composes orthogonally with this) **S > *Mirrored from [`docs/rfc/018-csr-adjacency.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/018-csr-adjacency.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca queries — IC09 et al.) **Builds on:** RFC-002 (SST format §3 edges binary CSR), RFC-003 (ranged reads * SstCache), RFC-017 (factorization composes orthogonally with this) **Supersedes:** — ## Summary El path actual de `Snapshot::out_edges` / `in_edges` (`read.rs:358`, `read.rs:650` → `edge_lookup` en `read.rs:655`) hace, **por cada call**, los siguientes pasos contra cada SST candidato del manifest: 1. Bloom side-car probe (`bloom_admits`) — load body desde `SstCache` o `S3`, parse magic + size + xxhash, `BloomFilter::contains(key)`. 2. SST body GET (`get_sst_body`) — cached en `SstCache::inner`, todavía es un `Arc::clone`. 3. `EdgeSstReader::open(body)` — parsea header + footer + fence index, build `cumulative_edges: Vec` con un scan completo sobre `partners` (\~O(K) trabajo por open). 4. `EdgeSstReader::lookup(key)` — fence bracket → `position_of` (binary search en `key_ids`) → offset read → partner block decode → per-edge LSNs + tombstone bitmap slice. 5. Para edges sourced del SST, `read_overflow_strings` + `load_declared_streams` decode todas las property streams del SST aunque la query las ignore. Cada uno es O(deg + K) o O(K) y la mayoría del trabajo **es por-SST, no per-key**. En IC09 con scale=0.1 (un fanout total \~110 hops via \`KNOWS·KNOWS * HAS\_CREATOR`), eso es ~110 invocaciones de `edge\_lookup\` × \~3-5 SSTs por edge\_type → \~400-550 ciclos completos del pipeline (1)-(5) por query. Esta RFC introduce **`EdgeAdjacency`**: una in-RAM CSR slim materializada **una vez por `(manifest_version, edge_type, direction)`** por una `AdjacencyCache` Arc-compartida cross-snapshot. Cada `edge_lookup` post-rewrite es: * Cache probe (DashMap-like) → `Arc`. * `binary_search` en `keys: Vec` → idx (O(log K)). * Slice `partners[offsets[idx]..offsets[idx+1]]` + `lsns[...]` + `tombstones[...]` (O(deg)). * Memtable overlay para writes recientes (O(memtable\_size\_for\_type)). Para IC09: el build cost se paga una vez (la primera query del bench warm-up), y las 49 restantes pegan el cache. Cada `edge_lookup` cae de “\~10-30 µs async-pipeline” a “\~few µs sync slice + bool merge”. Este es **el architectural fix** que las optimizaciones previas (NodeView cache, edge cache) apuntaban sin resolver: el NodeView cache cosechó el reuse intra-query del lado nodos; el edge cache falló porque las edges no se reusan intra-query — pero **sí se reusan cross-query** y, más importante, **el costo no es decode redundancy, es per-call SST scan**. La CSR mata ambos vectores en un solo golpe. Es exactamente lo que Kùzu hace internamente (Jin et al., CIDR 2023 §3.1 “rel tables, CSR-indexed by src and dst”). ### Alcance v0 * **CSR slim** — `keys: Vec`, `offsets: Vec`, `partners: Vec`, `lsns: Vec`, `tombstones: Vec`. NO carga edge properties (decided con el usuario; ver Design §4). * **Cache Arc-compartido** `AdjacencyCache` cross-snapshot, keyed por `(manifest_version, edge_type, direction)`. LRU con memory budget configurable (default 512 MiB). * **Build** una vez en miss, heap-merge sobre todos los SSTs del `(kind, scope)` group via la `LoadedManifestIndex` que ya tenemos. * **Memtable overlay** por call — sweep O(memtable\_entries\_for\_type) + per-partner last-LSN-wins merge contra la CSR slice. * **Reroute en `Snapshot::edge_lookup`** atrás del feature flag `NAMIDB_ADJACENCY=0|1` (default `0` inicialmente; flip a `1` después de bench-validate). * **Properties fallback** — si el call site tiene declared properties + las necesita, retornar `EdgeView` con properties vacías y dejar al caller hacer el lookup secundario via SST path. Esto significa **caveat explícito**: con flag ON, `Snapshot::out_edges` retorna `EdgeView.properties = BTreeMap::new()` para edges SST-sourced. Memtable edges retienen sus properties (vienen del payload decoded). Documentado debajo en §6. * **Parity tests** comparan topología (src, dst, lsn, tombstone), NO properties. Storage unit tests que verifican properties usan flag OFF explícitamente. ### Out-of-scope v0 (siguen como follow-ups) * **Property-aware routing**. Una iteración futura detecta en plan-time si la query accede `r.something` y decide topology-only vs full-edge lookup per call site. Cuando aterrice, el caveat de v0 desaparece. * **Disk-tier `AdjacencyCache`**. v0 es memory-only. Cuando el dataset exceda el budget, evict via LRU. Spill-a-disk (foyer hybrid) llega cuando la memoria sea constraint real. * **CSR for vector / hybrid indexes**. RFC-007 mantiene su propio shape. * **Incremental refresh post-flush**. La CSR es invalidada-y-rebuild cuando `manifest_version` cambia. v0 paga el full rebuild en la primera query post-flush. Incremental merge layered on top queda diferido si bench lo amerita. ## Motivation **Bench actual (scale=0.1):** | Query | NamiDB p50 | Kùzu p50 | Ratio | Bottleneck dominante | | ------------------------------- | ---------- | -------- | -------- | -------------------------------- | | IC02 (KNOWS·HAS\_CREATOR) | 67 ms | 1.04 ms | **64×** | mixed: \~30% storage, \~30% expr | | IC07 (HAS\_CREATOR·LIKES) | 7 ms | 0.97 ms | 7× | mostly query planner overhead | | IC08 (HAS\_CREATOR·REPLY\_OF) | 7 ms | 1.10 ms | 6× | similar a IC07 | | IC09 (KNOWS·KNOWS·HAS\_CREATOR) | **578 ms** | 1.64 ms | **353×** | **storage I/O del Expand chain** | **Sin properties access para Rel binding en queries IC**\* — el `r` es anónimo en IC09 (`(p)-[:KNOWS]->(f)-[:KNOWS]->(fof)<-[:HAS_CREATOR]-(msg)`). Cada hop solo necesita `(src, dst)` para emitir la próxima row del Expand. Sin embargo el path actual decodifica todo: bloom probe + body get + reader open con `cumulative_edges` scan + position\_of bsearch + partner block decode + per-edge LSN read + per-edge tombstone read + overflow JSON parse + declared streams IPC decode. Cada call paga el costo completo. **Profiling estimado (sin flamegraph todavía, basado en read-code-infer):** | Stage por `edge_lookup` | Aproximado µs (warm cache) | | ----------------------------------------------- | ------------------------------------- | | `bloom_admits` (cached side-car) | \~5 µs (xxhash + bit probe) | | `get_sst_body` (cached) | \~1 µs (Arc clone) | | `EdgeSstReader::open` (build cumulative\_edges) | \~50-150 µs (depends on K) | | `position_of` (fence + bsearch) | \~2-5 µs | | `lookup` (partner decode + LSN/tomb reads) | \~5-15 µs | | `read_overflow_strings` (full SST decode) | \~100-500 µs (depends on edge\_count) | | `load_declared_streams` (per-property IPC) | \~50-200 µs | | **Total per call** | **\~200-900 µs** | Multiplicado por \~110 hops × 3-5 SSTs candidate por hop = \~330-550 invocations del pipeline. Lower bound: 330 × 200µs = **66 ms**. Upper bound: 550 × 900µs = **495 ms**. La medición real (578 ms para IC09) cae justo en el medio-alto del rango. **Confirma la hipótesis storage-I/O = bottleneck.** **Comparativa Kùzu (rationale ajeno, pero válido):** Kùzu mantiene “rel tables” CSR-indexed por src y dst en RAM (post-load). Cada hop es un binary search + slice — \~1-2 µs. Para los mismos \~110 hops: **\~110-220 µs total** = sub-millisecond. Compatible con el 1.64 ms p50 de Kùzu en IC09 (lo que sobra va a expression eval + materialise output rows). **Si NamiDB cae a \~1-5 µs por edge\_lookup post-rewrite:** | Reducción | IC09 estimado p50 | Ratio vs Kùzu | | ---------------------------------- | ----------------- | ------------- | | Conservative (5 µs × 110 = 550 µs) | \~30-50 ms | 20-30× | | Optimistic (1 µs × 110 = 110 µs) | \~10-20 ms | 6-12× | | Stretch (incl. node cache assist) | \~5-10 ms | 3-6× | **Sin alcanzar el gate (2× = 3.3 ms), pero acercándose a “demo-friendly” territory.** Las piezas que cubren el gap restante son property deferral plan-aware y node materialization batching (iteraciones futuras). Esta RFC abre el camino. ## Design ### 1. Tipos de datos Nuevo módulo `crates/namidb-storage/src/adjacency.rs`: ```rust use std::sync::Arc; use parking_lot::Mutex; // si no está, use std::sync::Mutex use lru::LruCache; // existing crate; ya usado o usar manual use crate::manifest::SstKind; use crate::sst::edges::EdgeDirection; use namidb_core::NodeId; /// In-RAM CSR slim adjacency para un (edge_type, direction) en un /// manifest_version dado. /// /// Memory layout (10M edges + 1M distinct keys): /// - `keys`: 16 B × 1M = 16 MB /// - `offsets`: 4 B × 1M = 4 MB /// - `partners`: 16 B × 10M = 160 MB /// - `lsns`: 8 B × 10M = 80 MB /// - `tombstones`: 1 B × 10M = 10 MB /// - Total: ~270 MB para 10M edges, ~27 MB para 1M edges, ~270 KB para 10K. /// /// Para scale=0.1 LDBC (50K edges per type, ~10K distinct srcs): /// - 50K × 24 B + 10K × 20 B = ~1.4 MB per (edge_type, direction). /// - Para 3 edge_types × 2 directions = ~8 MB total. Cabe en cualquier cache. #[derive(Debug)] pub struct EdgeAdjacency { pub edge_type: String, pub direction: EdgeDirection, pub manifest_version: u64, /// Sorted by NodeId. binary_search returns idx for offsets/partners. pub(crate) keys: Vec, /// Len = keys.len() + 1. partners[offsets[i]..offsets[i+1]] for keys[i]. pub(crate) offsets: Vec, pub(crate) partners: Vec, pub(crate) lsns: Vec, pub(crate) tombstones: Vec, } impl EdgeAdjacency { /// Slim per-key projection. None when `key` is not present in the SSTs /// (caller will still consult the memtable overlay). pub fn lookup(&self, key: NodeId) -> Option> { let idx = self.keys.binary_search(&key).ok()?; let lo = self.offsets[idx] as usize; let hi = self.offsets[idx + 1] as usize; Some(EdgeSlice { partners: &self.partners[lo..hi], lsns: &self.lsns[lo..hi], tombstones: &self.tombstones[lo..hi], }) } /// Approximate memory footprint in bytes — for LRU weighting. pub fn approx_bytes(&self) -> usize { self.keys.len() * 16 + self.offsets.len() * 4 + self.partners.len() * 16 + self.lsns.len() * 8 + self.tombstones.len() + self.edge_type.len() + 64 // overhead allowance } } #[derive(Debug, Clone, Copy)] pub struct EdgeSlice<'a> { pub partners: &'a [NodeId], pub lsns: &'a [u64], pub tombstones: &'a [bool], } /// Cache key. Hash by all three components. #[derive(Debug, Clone, PartialEq, Eq, Hash)] pub struct AdjacencyKey { pub manifest_version: u64, pub edge_type: String, pub direction: EdgeDirection, } /// Process-wide LRU cache de adyacencias materialised. Arc-compartido entre /// `WriterSession` y todos los `Snapshot`s que emite. pub struct AdjacencyCache { inner: Mutex>>, /// Cota de bytes — sumamos `approx_bytes()` de cada entry. Excedido = /// evict del menos-recientemente-usado. capacity_bytes: usize, /// Bytes en uso. Tracked al insert / evict. used_bytes: Mutex, // counters opcionales (hits / misses / builds) → debug/observability. } ``` ### 2. Build process ```rust async fn build_adjacency( snapshot_manifest: &LoadedManifest, store: &dyn ObjectStore, paths: &NamespacePaths, cache: &SstCache, edge_type: &str, direction: EdgeDirection, ) -> Result { let want_kind = match direction { EdgeDirection::Forward => SstKind::EdgesFwd, EdgeDirection::Inverse => SstKind::EdgesInv, }; // 1. Enumerate SSTs from the manifest index. let sst_idxs: Vec = snapshot_manifest .index .scope_descriptors(want_kind, edge_type) .iter() .copied() .collect(); if sst_idxs.is_empty() { return Ok(EdgeAdjacency { edge_type: edge_type.to_string(), direction, manifest_version: snapshot_manifest.manifest.version, keys: Vec::new(), offsets: vec![0], partners: Vec::new(), lsns: Vec::new(), tombstones: Vec::new(), }); } // 2. Per-SST: fetch body (cached) + open reader + scan_all_edges. // No paralelizamos en v0 (SSTs típicamente small, body cached, build // is one-time per manifest version). let mut per_partner: BTreeMap<(NodeId, NodeId), (u64, bool)> = BTreeMap::new(); for idx in sst_idxs { let desc = &snapshot_manifest.manifest.ssts[idx]; let absolute = format!("{}/{}", paths.namespace_prefix().as_ref(), desc.path); let body = fetch_with_cache(store, cache, &absolute).await?; let reader = EdgeSstReader::open(body)?; for row in reader.scan_all_edges()? { let key_id = NodeId::from_uuid(Uuid::from_bytes(row.key_id)); let partner_id = NodeId::from_uuid(Uuid::from_bytes(row.partner_id)); // last-LSN-wins across SSTs (compaction usually leaves at most one // SST per (key, partner) but we cannot assume). match per_partner.entry((key_id, partner_id)) { Entry::Vacant(v) => { v.insert((row.lsn, row.tombstone)); } Entry::Occupied(mut o) => { if row.lsn > o.get().0 { o.insert((row.lsn, row.tombstone)); } } } } } // 3. Group by key (BTreeMap iter already sorts by (key, partner)) and // materialise into the parallel arrays. let mut keys: Vec = Vec::new(); let mut offsets: Vec = vec![0]; let mut partners: Vec = Vec::new(); let mut lsns: Vec = Vec::new(); let mut tombstones: Vec = Vec::new(); let mut cur_key: Option = None; for ((k, p), (lsn, tomb)) in per_partner { match cur_key { Some(prev) if prev == k => { /* same key, just append */ } _ => { if cur_key.is_some() { offsets.push(partners.len() as u32); } keys.push(k); cur_key = Some(k); } } partners.push(p); lsns.push(lsn); tombstones.push(tomb); } offsets.push(partners.len() as u32); // sentinel debug_assert_eq!(offsets.len(), keys.len() + 1); Ok(EdgeAdjacency { edge_type: edge_type.to_string(), direction, manifest_version: snapshot_manifest.manifest.version, keys, offsets, partners, lsns, tombstones, }) } ``` Build complexity: O(total\_edges · log(total\_edges)) por el BTreeMap. Para scale=0.1 con \~50K edges: \~50K × 20 = \~1 M cmp, <50 ms estimado. Para 10M edges: \~10M × 23 = \~230 M cmp, \~2-5 s. **Aceptable como cold-start cost porque es one-time per manifest version.** Alternative considerada (rejected v0): heap-merge cursors stream-style, evita BTreeMap. Implementación más compleja, similar perf en este rango. Voy con BTreeMap por claridad. Si bench muestra problema en namespaces grandes, switch. ### 3. Cache integration ```rust impl AdjacencyCache { pub fn new(capacity_bytes: usize) -> Self { /* ... */ } /// Resolve (or build) the EdgeAdjacency for the given key. Builds happen /// at most once per (manifest_version, edge_type, direction) — concurrent /// callers race on the cache slot; whoever wins inserts. pub async fn get_or_build( &self, key: AdjacencyKey, build: F, ) -> Result> where F: FnOnce() -> Fut, Fut: std::future::Future>, { // 1. Probe under lock. { let mut lru = self.inner.lock(); if let Some(arc) = lru.get(&key) { return Ok(arc.clone()); } } // 2. Miss → build outside lock to avoid serialising builds. let built = build().await?; let weight = built.approx_bytes(); let arc = Arc::new(built); // 3. Insert + evict to budget. { let mut lru = self.inner.lock(); // Recheck — another caller may have inserted concurrently. if let Some(existing) = lru.get(&key) { return Ok(existing.clone()); } lru.put(key.clone(), arc.clone()); *self.used_bytes.lock() += weight; self.evict_to_capacity(&mut lru); } Ok(arc) } fn evict_to_capacity(&self, lru: &mut LruCache>) { let mut used = self.used_bytes.lock(); while *used > self.capacity_bytes && lru.len() > 1 { if let Some((_, evicted)) = lru.pop_lru() { *used = used.saturating_sub(evicted.approx_bytes()); } else { break; } } } } ``` ### 4. `EdgeView.properties` contract con flag ON Decisión explícita (ver Summary §“Properties fallback”): ```rust // Snapshot::edge_lookup con flag ON: async fn edge_lookup_via_csr(...) -> Result { let adj = self .adjacency_cache .as_ref() .ok_or_else(|| Error::invariant("CSR enabled but cache absent"))? .get_or_build( AdjacencyKey { manifest_version, edge_type, direction }, || build_adjacency(&self.manifest, ...), ) .await?; // SST-sourced edges: NO properties. let mut sst_edges: BTreeMap)> = BTreeMap::new(); if let Some(slice) = adj.lookup(key) { for i in 0..slice.partners.len() { let partner = slice.partners[i]; let lsn = slice.lsns[i]; let view = if slice.tombstones[i] { None } else { let (src_id, dst_id) = match direction { EdgeDirection::Forward => (key, partner), EdgeDirection::Inverse => (partner, key), }; Some(EdgeView { edge_type: edge_type.to_string(), src: src_id, dst: dst_id, properties: BTreeMap::new(), // ← caveat documented lsn, }) }; sst_edges.insert(partner, (lsn, view)); } } // Memtable overlay: retain full properties (decoded from MemOp::Upsert payload). for (mk, entry) in self.memtable.iter() { // ... como en el path actual; properties full porque vienen del payload. } Ok(EdgeListView { edges: /* sort by partner, drop tombstones */ }) } ``` **Caveat con caller:** una query que accede `r.weight` (donde `r` es un Rel binding) verá `BTreeMap::new()` cuando la edge viene de un SST y el flag está ON. **Mitigación v0**: storage unit tests que verifican properties via `out_edges` quedan con flag OFF. LDBC IC\* no acceden edge properties → flag ON OK para el bench gate. **Mitigación v0.5 si el caveat duele en tests:** chequear schema antes del reroute. Si `manifest.schema.edge_type(edge_type).properties.is_empty()` → CSR; si tiene properties declaradas → fallback al SST path. Eso preserva todos los tests existentes, sub-óptimo pero seguro. **Mitigación v1 — IMPLEMENTADA:** plan-aware routing en `namidb_query::exec::walker`. Una pasada al root del `LogicalPlan` recolecta las variables referenciadas por toda expresión del plan (Filter/Project/TopN/ Aggregate/Join/Unwind/etc., reusando `collect_referenced_variables`). Para cada `Expand`, si su `rel_alias` aparece en ese set, el executor llama `Snapshot::out_edges_via_sst` / `in_edges_via_sst` (forzando full-property SST path) en vez del dispatch default `out_edges`. Cuando el alias está ausente o es bound pero nunca leído, se mantiene la ruta CSR. El default de `adjacency_enabled()` quedó en ON (set `NAMIDB_ADJACENCY=0` para desactivar). El caveat queda invisible para query callers — storage callers que necesitan properties full deben llamar a `edge_lookup_via_sst` directamente. 7 tests integration en `crates/namidb-query/tests/exec_plan_aware_routing.rs` cubren: alias ausente (CSR), alias unused (CSR), `RETURN r.prop` (SST), `RETURN r` whole (SST), `WHERE r.prop` filter (SST), `ORDER BY r.prop` (SST), y dos Expands con routing mixto en una misma query. ### 5. Memtable overlay El memtable contiene puts/deletes recientes que NO están en SSTs aún (no flushed). La CSR solo representa SSTs. Para correctness: ```rust // Sweep del memtable filtra por edge_type. Pequeño O(memtable_entries) — el // memtable está bounded ~64 MiB de payload, típicamente <100K entries en run. for (mk, entry) in self.memtable.iter() { let MemKey::Edge { edge_type: et, src, dst } = mk else { continue }; if et != edge_type { continue; } let (my_key, partner) = match direction { EdgeDirection::Forward => (*src.as_bytes(), *dst.as_bytes()), EdgeDirection::Inverse => (*dst.as_bytes(), *src.as_bytes()), }; if my_key != key.as_bytes() { continue; } // ... merge into latest with last-LSN-wins } ``` La búsqueda lineal sobre el memtable por edge\_type es estable y existing — no la podemos optimizar sin más cambios. Para v0, ese cost es el mismo que hoy. Si el memtable es grande, el read path ya pagaba ese cost; CSR no empeora. ### 6. Invalidation `manifest_version` es parte del cache key. Cuando el writer commits (post- flush, post-compaction, post-ingest), el manifest\_version incrementa. Snapshots viejos siguen viendo la entry vieja (Arc clone) hasta que se droppeen. Snapshots nuevos miss y rebuild. LRU eventualmente evicts entries viejas. Eso es **invalidation por construcción**. No hay race entre el writer y readers — el manifest CAS protocol garantiza linearizability en el path manifest. La CSR refleja una version atomica del manifest. ### 7. Wiring crates/namidb-storage/src/ingest.rs ```rust pub struct WriterSession { // ... existing fields adjacency_cache: Option>, } impl WriterSession { pub async fn open(...) -> Result { // ... existing logic let adjacency_cache = adjacency_enabled().then(|| { Arc::new(AdjacencyCache::new(adjacency_budget_bytes())) }); Ok(Self { /* ..., */ adjacency_cache }) } pub fn snapshot(&self) -> Snapshot<'_> { Snapshot::new_with_caches( self.current.clone(), &self.memtable, self.manifest_store.store().clone(), self.manifest_store.paths().clone(), self.sst_cache.clone(), // o None self.adjacency_cache.clone(), ) } } // crates/namidb-storage/src/read.rs pub struct Snapshot<'mt> { // ... existing fields adjacency_cache: Option>, } impl<'mt> Snapshot<'mt> { pub fn new_with_caches( manifest: LoadedManifest, memtable: &'mt Memtable, store: Arc, paths: NamespacePaths, sst_cache: Option, adjacency_cache: Option>, ) -> Self { /* ... */ } async fn edge_lookup(...) -> Result { if let Some(adj_cache) = &self.adjacency_cache { return self.edge_lookup_via_csr(adj_cache.clone(), ...).await; } self.edge_lookup_via_sst(...).await // path actual renombrado } } ``` `adjacency_enabled()` reads `NAMIDB_ADJACENCY`: * `"0"` / unset → `None` → SST path (status quo). * `"1"` → `Some(Arc)` → CSR path. `adjacency_budget_bytes()` reads `NAMIDB_ADJACENCY_BUDGET_MIB` (default 512 MiB): * Big enough para LDBC SF1 (\~100M edges → \~2.7 GB CSR — would exceed budget, evicts on demand, fine). * Small enough para no comer toda la RAM en machines compartidos. ### 8. Tests #### 8.1 Unit (`adjacency.rs::tests`) * `cache_get_or_build_builds_once`: dos concurrent `get_or_build` con misma key → build closure invoked **una vez**. * `cache_evicts_lru_on_capacity`: insert N entries excediendo budget → oldest evicted, `used_bytes` decreciendo. * `edge_adjacency_lookup_returns_slice`: build manual + `lookup(key)` retorna `Some(slice)` con partners esperados. * `edge_adjacency_lookup_absent_key`: returns `None`. * `build_adjacency_merges_two_ssts`: setup memtable + 2 SSTs flush distintos → build returns merged CSR. #### 8.2 Integration (`tests/csr_adjacency.rs`) * `csr_serves_out_edges_topology_correctly`: writer + 2 SSTs + memtable layered overlay → snapshot.out\_edges retorna expected topology con flag ON. * `csr_invalidates_on_new_manifest_version`: snapshot1 → flush → snapshot2 hit cache key diferente. * `csr_tombstones_preserved_for_correctness`: SST con tombstone @ LSN=N + memtable upsert @ LSN=N+1 → edge surfaced. Caso reverso → edge hidden. #### 8.3 Parity (`tests/csr_adjacency_parity.rs`) ```rust #[tokio::test] async fn parity_topology_ic09_shape() { let result_off = with_flag(false, || run_ic09()).await; let result_on = with_flag(true, || run_ic09()).await; // Compare topology only: ignore EdgeView.properties. let topo_off: BTreeSet<(NodeId, NodeId, u64)> = result_off.iter() .map(|e| (e.src, e.dst, e.lsn)) .collect(); let topo_on: BTreeSet<(NodeId, NodeId, u64)> = result_on.iter() .map(|e| (e.src, e.dst, e.lsn)) .collect(); assert_eq!(topo_off, topo_on); } ``` ### 9. Bench plan ```bash # Pre-rewrite baseline (already measured): # IC02 67ms | IC07 7ms | IC08 7ms | IC09 578ms # Post-rewrite (NAMIDB_ADJACENCY=1): NAMIDB_ADJACENCY=1 cargo run --release -p namidb-bench -- run \ --queries IC02,IC07,IC08,IC09 --scale 0.1 --warm-runs 50 # Comparativa: # Expected IC09: 30-80 ms (10-20× mejora local; ~20-50× vs Kùzu) # Expected IC02: 25-50 ms (~30% mejora local; ~25-50× vs Kùzu) # IC07/IC08 no esperan mejora notable (Expand chain corto + node-dominant). ``` ## Alternatives considered ### A. Per-call CSR build (rebuilt per query) Sin Arc-shared cache. Cada Snapshot.new() rebuilds. Pros: zero shared mutable state, simplicidad. Cons: no aprovecha el reuse cross-query — REPL, benchmark de N runs, workshop interactivo, todos pagan el build cost repetidamente. **Rechazada:** la motivación de bench es que las queries son repetidas; el cache cross-snapshot es la clave del speedup amortizado. ### B. CSR fat (con properties) Cargar `Vec>` paralelo a partners. Pros: coverage 100%, sin caveat de §4. Cons: memoria 5-10× (KNOWS.creationDate single u64 property → \~30 B extra per edge), Vec cache-unfriendly. Para 10M edges fat = 1-3 GB; budget excedido rápido. **Rechazada por memoria.** Alternativa: lazy-on-demand property fetch contra el SST cuando caller invoca `e.properties` — funciona pero el API se vuelve incomodo (synchronous fn returning future). Pospuesto a plan-aware routing que es más limpio. ### C. Kùzu-style dense u64 IDs Kùzu renumera node IDs a u64 densos contiguos al load, así `offsets[u64 node_id]` es directo (sin keys vec ni binary search). Pros: O(1) lookup puro. Cons: requiere ID translation table NodeId → u64 (16 → 8 bytes; \~half saved on partners too via translation). Pero NamiDB usa UUID v7 — el ID space es público (clientes pasan UUIDs en query parameters). Translation table en RAM agrega complexity: build cost, invalidation, query API change. **Rechazada para v0** — el log K binary search en sorted Vec es cache-friendly y suficiente. v1 puede explorar si bench muestra que es el nuevo bottleneck. ### D. SstCache extension (in-place) Reusar `SstCache` agregando un nuevo `metadata: HashMap` keyed por `(manifest_version, scope, direction)`. Pros: una cache, una config. Cons: SstCache es path-keyed; el CSR es semantic-keyed. Mezclar las dos abstracciones genera fricción tipo-system (Bytes vs Arc). **Rechazada por separation of concerns.** AdjacencyCache es un sibling, no un sub-cache. ### E. Sin feature flag (swap directo) Reemplazar `edge_lookup` wholesale, confiar en test suite. Cons: si la implementación tiene un bug, regresión silenciosa en todos los tests. La parity strategy del flag te da bench-comparativo claro + rollback trivial. **Rechazada — los costos del flag son <50 LoC; los beneficios son diagnósticos + safety net.** ## Drawbacks 1. **Properties caveat con flag ON** (§4). v0 no cubre queries que acceden `r.something` desde edges SST-sourced. Mitigado por documentación + flag default OFF inicialmente. Eliminado completamente con plan-aware routing. 2. **Cold-start cost por edge\_type**. Primera query toca cada (edge\_type, direction) paga \~50-500 ms de build cost depending on scale. Para benchmarking warm-runs el cost se amortiza en run 2+. Para producción “snap-to-cold-query” hay un punzón. Mitigation: opcional eager build en `WriterSession::open` para edge\_types declarados — pospuesto si bench shows que es necesario. 3. **Memory budget se vuelve un knob operacional**. Demasiado bajo: thrashing (build → evict → build). Demasiado alto: OOM en machines compartidos. Default 512 MiB cubre LDBC SF1 cómodamente (\~50 MB needed). Para namespaces gigantes (>100 GB edges) el operador debe knobear. 4. **No paraleliza build**. Múltiple SSTs procesados secuencialmente. Para namespaces con cientos de SSTs por edge\_type, build cost crece linealmente. Compaction lo mantiene low en práctica, pero un cold L0-heavy namespace pre-compaction pagaría más. v0 acepta; v1 puede paralelizar si dolor real. 5. **BTreeMap allocation en build**. Para 10M edges: 10M × \~80 B node = 800 MB temporario antes de materialise. Por una vez per manifest\_version, aceptable. Si problema, switch a heap-merge cursors. 6. **`scan_all_edges` decodifica todo el SST**. Hoy ya tiene esa complejidad cuando se usa para compaction; estamos haciendo el mismo trabajo per (manifest\_version, scope, direction) en el read path. Net change: trabajo amortizado en muchos lookups en vez de hecho per-call. ## Open questions * **Q1: ¿Eager build en `WriterSession::open` para edge\_types declarados en el schema?** Reduces cold-start latency. Cost: linear en total edges. Pros para production interactive. Contra para batch workloads. **Decidir post- bench**. * **Q2: ¿Cuándo se elimina el feature flag `NAMIDB_ADJACENCY`?** Mi propuesta: default OFF inicialmente (validating phase), default ON una vez property-aware-routing aterrice, flag eliminated en iteración posterior. * **Q3: Memory budget default.** Empezamos con 512 MiB. Si LDBC SF10 (10× scale) lo necesita superior, ajustamos. Open until SF1/SF10 medidos. ## References * **Jin et al., CIDR 2023** “Kùzu: An Embeddable Graph DBMS” §3.1 (“Rel Tables, CSR-indexed by src and dst”, “Adjacency information cached in-memory per direction”). * **RFC-002** §3 (“CSR binario on-disk” — el on-disk format ya es CSR; v0 materialise the same in-RAM sin parse overhead). * **RFC-003** §“Ranged reads + page index” — RFC-003 redujo per-call cost de body GET; el bottleneck restante post-RFC-003 es el decode + scan per-call. Esta RFC cierra esa puerta. * **RFC-017** §“Out-of-scope WCOJ” — factorization y CSR son ortogonales: la CSR sirve por edge\_type, los factor nodes apilan multiple bindings sin copiar BTreeMaps. Composición natural. # RFC 019: Cross-snapshot NodeView cache > **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-018 (CSR adjacency confirmed `lookup_node` as the dominant remaining cost), NodeView intra-snapshot cache (preserved as the L1 of the 3-tier lookup) **Supe > *Mirrored from [`docs/rfc/019-node-view-cache-shared.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/019-node-view-cache-shared.md) in the engine repo. Source of truth lives there.* **Status:** draft **Author(s):** Matías Fonseca **Builds on:** RFC-018 (CSR adjacency confirmed `lookup_node` as the dominant remaining cost), NodeView intra-snapshot cache (preserved as the L1 of the 3-tier lookup) **Supersedes:** — ## Summary A previous step introduced a per-`Snapshot` `Mutex>>` and harvested \~12% in IC09 cosechando el reuse intra-query (\~10× repeated access for friends-of-friends, joins probing the same node from both sides). The profile run (`NAMIDB_PROFILE_DUMP=1 NAMIDB_ADJACENCY=1`, IC09 scale=0.1, 3 params × 51 runs = 153 query executions) made the next lever obvious: ```plaintext stage count total_ms avg_us Snapshot::lookup_node 253,317 87,030.871 343.565 Snapshot::lookup_node_uncached 229,908 86,938.611 378.145 Snapshot::lookup_node.cache_hit 23,409 0.000 0.000 ``` `lookup_node` is **99.4% of the IC09 wall-clock** and the intra-snapshot cache hits exactly **9% of calls** (23.4K hits / 253K total). The remaining 91% pays the full SST candidate walk + bloom probe + parquet decode (\~378 µs each on average). The reason for the low hit rate is structural, not algorithmic: the bench runner builds a **fresh `Snapshot` per query execution** (`runner.rs:107`, `writer.snapshot()` inside the warm-run loop). The intra-snapshot cache fills during one query and is dropped at the next. **Cross-snapshot the LDBC fixture has fewer than 5K unique nodes**; if the cache survived across snapshots tied to the same `manifest_version`, the post-warmup hit rate would be \~99% and lookup\_node calls would collapse from 253K to \~5K. Esta RFC introduces **`NodeViewCache`** — an `Arc`-shared, cross-snapshot cache keyed by `(manifest_version, label, NodeId)` and storing `Option` (yes, including cached “not found” so subsequent lookups for a tombstoned or missing key skip the SST walk too). The shape is intentionally a near-clone of `AdjacencyCache` (RFC-018 §3): same eviction policy, same invalidation contract, same wiring pattern, same env-var-gated feature flag. The two caches compose orthogonally — together they cover \~99.5% of the IC09 wall-clock that the CSR adjacency plus this node cache can reach. ### Alcance v0 * **`NodeViewCache`** — `HashMap>` guarded by `Mutex`. `Option` so misses (tombstones, absent rows) are cached too (negative-cache; same correctness contract because the key includes `manifest_version`). * **3-tier `Snapshot::lookup_node`**: 1. **L1 (intra-snap)**: existing `node_cache: Mutex` — fast short-circuit when the same `(label, NodeId)` was hit earlier in **this** query. 2. **L2 (cross-snap)**: `NodeViewCache` shared `Arc`. Promotes the answer into L1 on hit so subsequent intra-snap calls bypass L2. 3. **L3 (cold path)**: `lookup_node_uncached` (the existing SST walk). Inserts result into both L2 and L1. * **Memory budget** configurable via `NAMIDB_NODE_CACHE_BUDGET_MIB` (default 256 MiB). For LDBC scale=0.1 (\~5K nodes × \~1 KiB / NodeView) \~5 MiB; for SF1 (\~500K nodes) \~500 MiB — operator knob. * **Routing**: `NAMIDB_NODE_CACHE=1` enables the L2; default OFF preserves the previous L1-only behaviour exactly. Same env-var pattern as `NAMIDB_ADJACENCY` and `NAMIDB_FACTORIZE`. * **Parity tests** — `tests/node_cache_parity.rs` mirror the CSR adjacency pattern: same Snapshot, two public APIs (`lookup_node_via_uncached` vs the default 3-tier path) compared. ### Out-of-scope v0 * **Negative-cache TTL or invalidation beyond manifest\_version**. v0 treats `None` results identically to `Some` — both cached, both invalidated when the manifest advances. If a writer commits then another reader queries the same key under the new version, the new version forces a fresh L2 entry (new key, new lookup). Edge case for LDBC: nonexistent. * **Pre-warming on `WriterSession::open`**. The cache fills lazily. Production interactive workloads pay a single cold-query penalty per `(manifest_version, label, node_id)` triple — same as AdjacencyCache. * **Disk-tier overflow**. Memory-only. When budget bites, evict by oldest `manifest_version` first (same FIFO-by-version as `AdjacencyCache`). * **Per-property invalidation**. v0 caches the full `NodeView` (every declared + ad-hoc property). When the writer commits an upsert changing a single property, the new `manifest_version` invalidates the whole entry — coarse but correct. ## Motivation Already covered by the profile data in the Summary. Repeating the expected impact: **Pre-rewrite (only CSR adjacency, NAMIDB\_ADJACENCY=1):** * IC09 p50 = **520 ms**. * `lookup_node` = 99.4% of wall-clock. * L1 hit rate = 9% (23K / 253K). **Post-rewrite (NAMIDB\_ADJACENCY=1 + NAMIDB\_NODE\_CACHE=1):** * Estimated L2 hit rate post-warmup = **\~98-99%** because LDBC IC\* has \~5K unique person/post/comment nodes and 153 runs touch the same \~hundreds repeatedly. * L3 (cold path) calls collapse from 230K to \~5K. Wall-clock saved: \~85 seconds across 153 runs = \~555 ms per run = **IC09 \~50-80 ms p50**. * Gate vs Kùzu IC09 estimated **30-50×** (was 317× post-CSR). * Other queries (IC02/07/08) reap similar relative wins since they’re also lookup\_node-dominated. **If the bench delivers <60 ms IC09**, this would be the **first iteration to cross the order-of-magnitude line** vs Kùzu in NamiDB history. ## Design ### 1. Tipos de datos crates/namidb-storage/src/node\_cache.rs ```rust use std::collections::HashMap; use std::sync::atomic::{AtomicU64, Ordering}; use std::sync::{Arc, Mutex}; use namidb_core::NodeId; use crate::read::NodeView; /// Compound key. (manifest_version, label, node_id). Two snapshots that /// share the manifest version share the cache slot. #[derive(Debug, Clone, PartialEq, Eq, Hash)] pub struct NodeCacheKey { pub manifest_version: u64, pub label: String, pub node_id: NodeId, } /// Cached NodeView outcome. `None` means the cold path resolved to /// "absent / tombstoned" — we cache the negative answer to avoid /// repeating the SST walk. pub type CachedNodeView = Option; #[derive(Debug, Default)] struct CacheStats { hits: AtomicU64, misses: AtomicU64, inserts: AtomicU64, evictions: AtomicU64, } pub struct NodeViewCache { inner: Mutex>, capacity_bytes: usize, used_bytes: Mutex, stats: Arc, } ``` ### 2. API ```rust impl NodeViewCache { pub fn new(capacity_bytes: usize) -> Self; pub fn get(&self, key: &NodeCacheKey) -> Option; pub fn insert(&self, key: NodeCacheKey, view: CachedNodeView); pub fn hits(&self) -> u64; pub fn misses(&self) -> u64; pub fn inserts(&self) -> u64; pub fn evictions(&self) -> u64; pub fn entries(&self) -> usize; pub fn used_bytes(&self) -> usize; } ``` `get` returns `Option` which is `Option>`: * `None` → cache miss, caller goes to L3. * `Some(Some(view))` → cached hit, view available. * `Some(None)` → cached miss, key was absent at this manifest version. ### 3. 3-tier `Snapshot::lookup_node` ```rust pub async fn lookup_node(&self, label: &str, id: NodeId) -> Result> { namidb_core::profile_scope!("Snapshot::lookup_node"); // L1: intra-snapshot cache. let intra_key = (label.to_string(), id); if let Some(cached) = self.node_cache.lock().unwrap().get(&intra_key).cloned() { namidb_core::profile::record("Snapshot::lookup_node.l1_hit", 0); return Ok(cached); } // L2: cross-snapshot cache. Optional — controlled by // NAMIDB_NODE_CACHE + WriterSession-supplied Arc. if let Some(shared) = &self.shared_node_cache { let shared_key = NodeCacheKey { manifest_version: self.manifest.manifest.version, label: label.to_string(), node_id: id, }; if let Some(cached) = shared.get(&shared_key) { namidb_core::profile::record("Snapshot::lookup_node.l2_hit", 0); // Promote into L1 for the rest of this snapshot's life. self.node_cache.lock().unwrap().insert(intra_key, cached.clone()); return Ok(cached); } } // L3: cold SST walk. let result = self.lookup_node_uncached(label, id).await?; // Insert into L1. self.node_cache.lock().unwrap().insert(intra_key, result.clone()); // Insert into L2 (if attached). if let Some(shared) = &self.shared_node_cache { let shared_key = NodeCacheKey { manifest_version: self.manifest.manifest.version, label: label.to_string(), node_id: id, }; shared.insert(shared_key, result.clone()); } Ok(result) } ``` ### 4. `WriterSession` wiring ```rust pub struct WriterSession { // ... existing adjacency_cache: Option>, node_cache: Option>, // ← NEW } impl WriterSession { pub async fn open(...) -> Result { // ... existing let adjacency_cache = adjacency_enabled().then(...); let node_cache = node_cache_enabled().then(|| { Arc::new(NodeViewCache::new(node_cache_budget_bytes())) }); Ok(Self { ..., adjacency_cache, node_cache }) } pub fn snapshot(&self) -> Snapshot<'_> { let mut snap = Snapshot::new(...); if let Some(c) = &self.adjacency_cache { snap = snap.with_adjacency_cache(c.clone()); } if let Some(c) = &self.node_cache { snap = snap.with_shared_node_cache(c.clone()); } snap } } ``` ### 5. Memory accounting ```rust fn approx_size(view: &CachedNodeView) -> usize { match view { None => 32, // overhead allowance Some(v) => { v.label.capacity() + v.properties.iter().map(|(k, _)| k.capacity() + 64).sum::() + 128 // NodeId + lsn + schema_version + Box/Map overhead } } } ``` The estimate is conservative. For the LDBC fixture, \~1 KiB per cached NodeView × \~5K unique nodes × 2 labels (Person, Post, Comment all small) = a few MiB. Far below the 256 MiB default budget. When `used_bytes + new_entry_size > capacity_bytes`, evict by oldest `manifest_version` first (same FIFO-by-version as AdjacencyCache). For LDBC the eviction path is rare — only triggers when the manifest advances rapidly. ### 6. Tests #### 6.1 Unit (`node_cache.rs::tests`, \~4 tests) * `cache_get_miss_returns_none`. * `cache_insert_then_get_returns_view`. * `negative_cache_returns_inner_none_on_hit` — insert `Some(None)`, get returns `Some(None)` (not `None` of the outer Option). * `cache_evicts_oldest_version_when_over_budget`. #### 6.2 Integration (`tests/node_cache_parity.rs`, \~3 tests) Same shape as `tests/csr_adjacency_parity.rs`. Helpers: ```rust async fn lookup_via_uncached(snap, label, id) -> Result>; async fn lookup_via_tiered (snap, label, id) -> Result>; ``` * `node_cache_parity_pure_sst` — flush some nodes, lookup via both paths, assert equal. * `node_cache_parity_with_tombstone_overlay` — memtable tombstone hides SST upsert; cache promotes the negative answer; subsequent calls hit L2. * `node_cache_reuses_across_snapshots` — snapshot1 misses + inserts; snapshot2 (same manifest\_version) hits. ### 7. Bench plan **Triple run, scale=0.1, 50 warm runs, 3 params:** 1. **Baseline** (no flags) — intra-snapshot L1 only. 2. **CSR only** (`NAMIDB_ADJACENCY=1`) — RFC-018 path. 3. **CSR + NodeCache** (`NAMIDB_ADJACENCY=1 NAMIDB_NODE_CACHE=1`) — this RFC. Expected delta vs baseline: * IC02: 64 → \~25 ms. * IC07: 7 → \~3 ms (already near Kùzu). * IC08: 7 → \~3 ms. * IC09: 596 → \~50-80 ms (\~8-12× mejora). **Gate vs Kùzu 30-50×**. Profile dump (`NAMIDB_PROFILE_DUMP=1`) confirma: * `lookup_node.l2_hit` count >> `lookup_node_uncached` count * post-warmup hit rate \~98-99%. ## Alternatives considered ### A. Make NodeView cache `'static` on the namespace, not per-WriterSession Pro: any tool building snapshots against the same namespace shares the cache (CLI, future REST API). Con: cross-process invalidation requires real coordination (manifest version is monotonic but the cache lives in RAM; restart a process, lose the cache). v0 stays per-WriterSession. Adequate for the bench harness; upgrade . ### B. Negative-cache-as-policy (don’t cache misses) Rejected. Misses are the EXPENSIVE path. Caching them is the entire point of the L2: a Snapshot probing for a deleted node should NOT redo the SST walk after the first time. Same correctness contract as positive caching because the key includes `manifest_version`. ### C. Per-label sharding to reduce mutex contention Premature. Single mutex contention at 1500 concurrent `lookup_node` calls × 343µs avg = \~500 µs of held-mutex time per second per snapshot. For a single tokio runtime executor that’s fine. If contention shows in multi-core production, switch to a `DashMap` or per-label `parking_lot` shards. ### D. Lift the cache into `SstCache::metadata`-style HashMap Rejected for same reasons as RFC-018 §“Alternative D”: SstCache is path-keyed (`String`), NodeViewCache is semantic-keyed (`(manifest_version, label, NodeId)`). Mixing types in one cache muddles the abstraction. ## Drawbacks 1. **First-query latency unchanged** — cold start is still `lookup_node_uncached` cost (\~378 µs each, \~5K calls = \~2 s warmup for the full LDBC fixture). Mitigation: eager `pre_warm` helper on `WriterSession::open` that walks the manifest and pre-loads `NodeView`s. Pospuesto a v1. 2. **Cache memory pressure under schema churn**. If the writer flushes N times during one read burst, N×current\_entries get retained until eviction kicks in. Budget guard limits the damage but each `(manifest_version, label, node_id)` slot is a separate entry. Mitigation: aggressive FIFO-by-version eviction (same as AdjacencyCache). 3. **Negative-cache amplification on schema explorations**. A query that probes lots of “does this exist?” gets every miss pinned. Long-running interactive sessions may grow the cache to dataset size faster than the read working set. Memory budget keeps this bounded. 4. **3-tier path adds branches in the hot path**. Two extra `if let Some(...)` checks per lookup\_node call. Negligible (\~5-10 ns) but present. Profile shows current `lookup_node` at \~343 µs avg; the additional branches are \~0.002% overhead. 5. **Tests now have THREE pathways** to maintain parity over: `_uncached` (cold), `intra-snap` (L1 only), `tiered` (L1+L2+L3). Mitigated by exposing the path-forcing public APIs and writing targeted parity tests. ## Open questions * **Q1: Default flag — when to flip?** Mi propuesta: default OFF inicialmente (validation phase), flip default ON once property-aware routing and additional bench validation cierren. * **Q2: Should AdjacencyCache and NodeViewCache share a global memory budget?** v0 keeps them separate (512 MiB + 256 MiB). If operator complaints come in, unify under a single configurable pool. * **Q3: Drop the per-Snapshot L1?** Once L2 is enabled, L1 is mostly redundant (after the first L2 promotion). But L1 hit is free (no hash + mutex), so keeping it as a fast-path is cheap. ## References * **Profile data** — `/tmp/bench-profile-ic09.stderr` reproducible via `NAMIDB_PROFILE_DUMP=1 NAMIDB_ADJACENCY=1 cargo run --release -p namidb-bench -- run --only ic09 --scale 0.1 --warm-runs 50 --param-count 3`. * **RFC-018** — same Arc-shared cache pattern for adjacency. * **Intra-snapshot `node_cache`** — original per-snapshot design retained as L1. * **Kùzu CIDR 2023 §3.2** — “Node tables, materialized in an in-memory page-cached buffer; lookups are direct array index on internal NodeOffset”. NamiDB’s equivalent at slice-of-Snapshot vs Kùzu’s lifetime-of-database is the analogous trade-off; we keep manifest-version-bounded freshness instead of paying for invalidation protocols. # RFC 020: Cross-snapshot edge SST caches > **Status:** accepted **Author(s):** Matías Fonseca **Supersedes:** none > *Mirrored from [`docs/rfc/020-edge-sst-caches.md`](https://github.com/namidb/namidb/blob/main/docs/rfc/020-edge-sst-caches.md) in the engine repo. Source of truth lives there.* **Status:** accepted **Author(s):** Matías Fonseca **Supersedes:** none ## Summary Two cross-snapshot caches in `SstCache` that close the gate at SF10. Both work the same way: `Arc>>` guarded by a `Mutex`, populated lazily on first read of each SST, shared across every `Snapshot` the namespace emits. SSTs are immutable per UUIDv7-keyed path so cached entries never go stale. * **`edge_streams: HashMap>`** — decoded `__overflow_json` + every declared property column (`Vec>` per column). Eliminates the `O(edge_count)` zstd-decompress + JSON-parse done on every `edge_lookup_via_sst` call. * **`edge_readers: HashMap>`** — parsed header + footer + fence index + precomputed `cumulative_edges` prefix sum. Eliminates the `O(edge_count)` partner-block walk done by `EdgeSstReader::open` on every call. Together they take `edge_lookup_via_sst` from `O(edge_count)` to `O(deg + log edge_count)` in the warm path, which is what shipping plan-aware routing and the gate at SF10 demanded. ## Motivation After plan-aware routing closed the property caveat (queries that read `r.prop` route through the SST path, queries that only need topology route through CSR), the SST path became a hot path for any query whose plan reads a relationship’s properties. Profile data from `NAMIDB_PROFILE_DUMP=1` on IC07 at SF1 + SF10 showed every `edge_lookup_via_sst` call doing two pieces of `O(edge_count)` work: 1. **`read_overflow_strings` + `load_declared_streams`** — pulled every property stream off the SST, zstd-decoded each one, parsed JSON for every row. For LIKES at SF1 (100K edges) this was \~1.4 ms/call. The original comment on the code already named the fix: *“the foyer-rs follow-up will cache the parsed vector per SST id”*. 2. **`EdgeSstReader::open`** — walked every partner block to build the `cumulative_edges: Vec` prefix sum that the binary search in `EdgeSstReader::lookup` indexes into. For LIKES at SF10 (1M edges) this dominated the call. The edge-stream cache added (1). The edge-reader cache added (2). Together they cut IC07 from 9942 µs to 2262 µs at SF10 (4.4× warm-path speedup, gate ratio 7.70× → 1.75×). ## Design ### Data shape crates/namidb-storage/src/cache.rs ```rust pub struct EdgeStreamBundle { pub overflow: Option>>, // RFC-002 __overflow_json pub declared: Vec<(String, Vec>)>, // RFC-002 §3.2.7 } pub struct SstCache { inner: Arc>, // body cache metadata: Arc>>>, // RFC-003 edge_streams: Arc>>>, edge_readers: Arc>>>, stats: Arc, } ``` ### Lookup flow ```rust async fn edge_lookup_via_sst( &self, edge_type: &str, key: NodeId, direction: EdgeDirection, ) -> Result { for idx in candidates { let desc = &self.manifest.manifest.ssts[idx]; if !self.bloom_admits(desc, &key_bytes).await? { continue; } let absolute = format!("{}/{}", self.paths.namespace_prefix(), desc.path); // Arc from cache (build only on miss). let reader = self.fetch_edge_reader(&absolute).await?; let Some(lookup) = reader.lookup(&key_bytes)? else { continue; }; // Arc from cache (decode only on miss). let streams = self.fetch_edge_streams(&absolute, edge_type, &reader)?; // ... O(deg) loop over lookup.partners, decoding from streams.* ... } Ok(EdgeListView { edges: /* ... */ }) } ``` ### Helper functions Both helpers live on `Snapshot`. They short-circuit on cache hit and do the expensive work + insertion on miss. Multiple callers can race on miss (no per-key locking) — the slow second writer’s insert simply overwrites the same `Arc`, which is harmless because the data is content-addressable by SST path. ```rust async fn fetch_edge_reader(&self, absolute: &str) -> Result> { namidb_core::profile_scope!("Snapshot::fetch_edge_reader"); if let Some(cache) = self.cache.as_ref() { if let Some(reader) = cache.get_edge_reader(absolute) { return Ok(reader); } } let body = self.fetch_bytes(absolute).await?; let reader = Arc::new(EdgeSstReader::open(body)?); if let Some(cache) = self.cache.as_ref() { cache.insert_edge_reader(absolute.to_string(), reader.clone()); } Ok(reader) } fn fetch_edge_streams(&self, absolute: &str, edge_type: &str, reader: &EdgeSstReader) -> Result> { namidb_core::profile_scope!("Snapshot::fetch_edge_streams"); if let Some(cache) = self.cache.as_ref() { if let Some(bundle) = cache.get_edge_streams(absolute) { return Ok(bundle); } } let declared_property_names = /* read from manifest schema */; let bundle = Arc::new(EdgeStreamBundle { overflow: reader.read_overflow_strings()?, declared: load_declared_streams(reader, &declared_property_names)?, }); if let Some(cache) = self.cache.as_ref() { cache.insert_edge_streams(absolute.to_string(), bundle.clone()); } Ok(bundle) } ``` ### Memory footprint * **`EdgeStreamBundle`**: roughly `n_edges × n_declared_columns × avg_json_size`. For LIKES at SF10 with one declared `creationDate` (Int64): 1M × 1 × \~10 B JSON = \~10 MB per SST. The overflow column is empty when all props are declared. * **`EdgeSstReader`**: \~8 B per edge for `cumulative_edges` plus the SST body’s `Bytes` refcount. For SF10 LIKES that is \~8 MB per SST. The total across the LDBC SF10 dataset (7 edge types × \~1 SST each post-bulk-load) is below 100 MB — comfortably below the default `NAMIDB_SST_CACHE_BUDGET_MIB=256`. ### Invalidation None needed. SST paths are UUIDv7-derived and never overwritten; compaction emits new SSTs and atomically swaps the manifest. A cached entry whose backing SST was compacted away is harmless dead weight in the HashMap. Eviction-by-LRU is a TODO when the cache size becomes a real concern. ## Alternatives considered ### A. Foyer hybrid cache for everything The `SstCache.inner` already uses `foyer::Cache` for raw bodies. Putting `Arc` into the same `foyer` cache would give automatic eviction-by-LRU and weight accounting. **Rejected for v0**: `foyer::Cache` requires `Send + Sync + 'static` values with `Weighter` traits. `Arc` is `Send + Sync` but the weighter needs to read its `cumulative_edges.len()` — doable but introduces a tighter coupling between the cache and reader internals. The plain `Mutex` is 30 lines of code and matches the lifetime story (immutable-per-path). ### B. Cache the decoded streams inside `EdgeSstReader` via `OnceCell` Make `read_overflow_strings` and `read_declared_property_strings` memoize their results inside the reader itself. Combined with B’s reader cache, the streams come along for the ride. **Rejected**: would require changing the `EdgeSstReader` public API from `read_*` returning `Result>>` to either `Result<&Option>>` (lifetime ties results to reader borrow) or owning + `Arc>`. The two-cache approach keeps the reader side stateless and the cache responsibility scoped to one module. ### C. Compaction-side baked layout (RFC-005 follow-up) If the SST writer pre-built the `cumulative_edges` prefix sum and serialised it into the footer as a separate section, `open` would be `O(section_read)` instead of `O(edge_count)`. **Deferred** but not rejected. The cache makes the warm path free today; the on-disk layout change is the right v1 once the per-SST size grows beyond what fits comfortably in RAM. It is a write-time * format-version change. ## Drawbacks * **Memory cost**: unbounded HashMap maps grow until the namespace is closed. For long-running multi-tenant servers we will need LRU eviction tied to `SstCache.inner`’s weighter. Tracked as a follow-up. * **Race-on-miss**: two concurrent readers that both miss the cache will both decode + insert. The work is duplicated but the result is identical and the cache state stays consistent (the second writer overwrites the first’s `Arc` with content-identical data). No correctness issue, minor wasted CPU. * **Cold first query is unchanged**: the first SST scan still pays the `O(edge_count)` decode + reader build. Pre-warming on `WriterSession::open` would amortise this, but it complicates the open path and is best left as an optional follow-up. ## Open questions * **LRU eviction tied to the SST body cache budget.** When the body cache (foyer) evicts an SST body, should `edge_streams[k]` and `edge_readers[k]` evict in lockstep? Probably yes, but it requires foyer eviction callbacks we have not wired yet. * **`NodeSstReader` analogue.** Nodes go through Parquet which has its own metadata cache (RFC-003). Is there a `lookup_node_via_sst` hot path that would benefit from a third cache? Profiling so far says no — the L1/L2 `NodeViewCache` covers the node side. ## Bench impact Before S17.3 + S18.B (SF10, IC07, p50 of 3 params): ```plaintext Query NamiDB p50 Kùzu p50 Ratio IC07 9942 µs 1292 µs 7.70x ← FAIL gate (2x) ``` After both caches (default ON, no other changes): ```plaintext Query NamiDB p50 Kùzu p50 Ratio IC07 2262 µs 1292 µs 1.75x ← PASS gate (2x) ``` Other queries (IC02 / IC08 / IC09) are unaffected because their plans either avoid the SST path (CSR routing) or read nodes more than edges. Test count: +0 (caches are perf, no semantics change; the existing LDBC + storage unit tests cover correctness). # Backup & restore > `aws s3 sync`. There is no separate metadata to capture. NamiDB stores **everything** in the bucket — manifest, WAL, SSTs, schema. There is no external lock table, no separate metadata service, no per-tenant state living anywhere else. **Backup is `aws s3 sync`. Restore is `aws s3 sync` in the other direction.** ## Per-namespace backup ```bash aws s3 sync \ s3://my-bucket/data/tenant-acme/ \ ./backups/tenant-acme-2026-05-19/ ``` ## Cross-region / cross-bucket replication ```bash aws s3 sync \ s3://my-bucket/data/ \ s3://my-backup-bucket-dr/data/ ``` You can do this **online** — readers and writers can be active during the sync. The worst case is that the backup captures a slightly older manifest version than what’s currently live. NamiDB’s snapshot semantics + epoch fencing guarantee the captured state is internally consistent. ## Restore To restore a snapshot to a fresh bucket: ```bash aws s3 sync ./backups/tenant-acme-2026-05-19/ s3://my-new-bucket/data/tenant-acme/ ``` Then open the namespace at the new URI. NamiDB reads the manifest and boots normally. ## Migrating between backends Same idea — sync between any two URIs: ```bash # file:// → s3:// aws s3 sync /var/lib/namidb/prod/ s3://my-bucket/data/prod/ # s3:// → file:// aws s3 sync s3://my-bucket/data/prod/ /var/lib/namidb/prod/ # s3:// → gs:// (use gcloud rsync) gsutil -m rsync -r s3://my-bucket/data/prod/ gs://my-bucket/data/prod/ ``` After the sync, point your client at the new URI. The graph is the same. ## What about in-flight writes during a restore? Restoring while writers are active will fence them via epoch CAS — the manifest the writer is racing against gets overwritten, the writer’s next commit attempt returns `412 Precondition Failed`, and the writer re-bootstraps. **No corruption is possible** because the manifest is the only authoritative root. For a clean restore, stop writers, do the sync, then bring writers back up. ## See also * [Snapshots & epoch fencing](/en/concepts/snapshots-and-epoch-fencing) * [The bucket is the database](/en/concepts/bucket-is-the-database) # Configuration > Every environment variable that tunes the NamiDB engine and the namidb-server daemon. NamiDB defaults are sane for most workloads. Reach for these env vars when you’re debugging performance, memory, or correctness. ## Engine | Env var | Default | What it does | | --------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | `NAMIDB_ADJACENCY` | `on` | Process-wide CSR adjacency cache shared across snapshots ([RFC-018](/en/internals/rfcs/018-csr-adjacency)). Set to `off` to disable. | | `NAMIDB_NODE_CACHE` | `on` | Cross-snapshot `NodeView` lookup cache ([RFC-019](/en/internals/rfcs/019-node-view-cache-shared)). | | `NAMIDB_SST_CACHE` | `on` | SST body + decoded edge property streams + parsed `EdgeSstReader` ([RFC-020](/en/internals/rfcs/020-edge-sst-caches)). | | `NAMIDB_FACTORIZE` | `off` | Factorized intermediate representation in the executor ([RFC-017](/en/internals/rfcs/017-factorization)). Turn on for path-heavy queries. | | `NAMIDB_PROFILE_DUMP` | `off` | Dump per-stage profile counters to stderr after each query. | ### Cache budgets | Env var | Default | Notes | | ----------------------------- | ------- | ------------------------------------- | | `NAMIDB_ADJACENCY_BUDGET_MB` | `512` | RAM ceiling for the CSR cache. | | `NAMIDB_NODE_CACHE_BUDGET_MB` | `512` | RAM ceiling for the `NodeView` cache. | | `NAMIDB_SST_CACHE_BUDGET_MB` | `512` | RAM ceiling for the SST cache. | For server workloads, bump these to 2–8 GiB per cache. For embedded use inside a Lambda or a small container, halve them. ## `namidb-server` | Env var | Default | What it does | | ----------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------ | | `NAMIDB_STORE` | — (required) | Storage URI (e.g. `s3://bucket?ns=prod®ion=us-east-1`). | | `NAMIDB_LISTEN` | `0.0.0.0:8080` | TCP bind address. | | `NAMIDB_AUTH_TOKEN` | unset (open) | Bearer token. **When unset the server warns and accepts all requests** — do not expose unauthenticated to the public internet. | | `NAMIDB_FLUSH_INTERVAL` | `30s` | Background memtable → L0 flush cadence. `0s` disables the loop. | CLI flags mirror the env vars: `--store`, `--listen`, `--auth-token`, `--flush-interval`. ## Cloud / storage credentials NamiDB reads the **standard env vars** for each cloud: | Backend | Env vars | | -------------------- | --------------------------------------------------------------------------------------- | | AWS S3 | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_DEFAULT_REGION` | | Cloudflare R2 | Same as S3 — set R2 token as `AWS_*` | | Google Cloud Storage | `GOOGLE_APPLICATION_CREDENTIALS` (path to JSON key) | | Azure Blob | `AZURE_STORAGE_ACCOUNT_NAME`, `AZURE_STORAGE_ACCESS_KEY` | | MinIO / LocalStack | Same as S3 — point at `endpoint=…` in the URI | IAM roles on EC2 / EKS / Lambda / ECS work transparently — no NamiDB-specific auth to wire. ## Tracing & logs NamiDB uses [`tracing`](https://crates.io/crates/tracing). Standard env-filter applies: ```bash export RUST_LOG=namidb=info,namidb_storage=debug ``` JSON logs: ```bash export NAMIDB_LOG_FORMAT=json ``` ## See also * [Tuning](/en/operations/tuning) * [URI grammar](/en/operations/uri-grammar) * [Observability](/en/operations/observability) # Observability > Tracing, per-stage profiling, cache stats. Where to plug NamiDB into Prometheus, Grafana, OpenTelemetry. NamiDB is instrumented with [`tracing`](https://crates.io/crates/tracing) on every `pub` async function in the workspace. OpenTelemetry export is on the roadmap; today you have: ## Tracing ```bash export RUST_LOG=namidb=info,namidb_storage=debug export NAMIDB_LOG_FORMAT=json # optional, structured ``` Every Cypher call emits a `tracing` span with the parsed plan, the chosen physical operators, and per-stage timings. ## Per-stage profile dump ```bash export NAMIDB_PROFILE_DUMP=1 ``` After every query, NamiDB prints a per-stage counter block to stderr: parse, lower, optimise, execute, with row counts and µs / stage. ## Cache stats (Python) ```python print(client.cache_stats()) # { # "adjacency": {"hits": ..., "misses": ..., "bytes": ...}, # "node_view": {...}, # "sst": {...} # } ``` Hook this into your dashboards to spot working-set vs budget mismatches. ## Server health endpoints ```bash curl http://your-host:8080/v0/health | jq . curl http://your-host:8080/v0/version | jq . ``` `/v0/health` returns the manifest version, the epoch, and last commit timestamp. ## Roadmap * **`/v0/metrics`** — Prometheus exposition (counters, latency histogram, cache hit rates). * **OpenTelemetry export** — spans + metrics over OTLP. * **Structured `EXPLAIN ANALYZE`** — runtime row counts per operator. Track progress in [github.com/namidb/namidb/issues](https://github.com/namidb/namidb/issues). ## See also * [Configuration](/en/operations/configuration) * [Tuning](/en/operations/tuning) # Self-host with Docker Compose > A complete, self-contained graph database in one file — MinIO + namidb-server, authenticated REST on :8080. A copy-paste **self-hosted NamiDB stack** that runs anywhere Docker Compose runs. MinIO holds the bucket, `namidb-server` serves the namespace over an authenticated REST API. ## docker-compose.yml ```yaml services: minio: image: minio/minio command: server /data --console-address ":9001" environment: MINIO_ROOT_USER: minioadmin MINIO_ROOT_PASSWORD: minioadmin volumes: - minio-data:/data healthcheck: test: ["CMD", "mc", "ready", "local"] interval: 3s retries: 30 bucket-init: image: minio/mc depends_on: minio: condition: service_healthy entrypoint: > sh -c " mc alias set local http://minio:9000 minioadmin minioadmin && mc mb --ignore-existing local/namidb " namidb-server: image: namidb-server:0.3 # built from crates/namidb-server/Dockerfile depends_on: bucket-init: condition: service_completed_successfully environment: NAMIDB_STORE: "s3://namidb?ns=prod&endpoint=http://minio:9000®ion=us-east-1&allow_http=true" NAMIDB_LISTEN: "0.0.0.0:8080" NAMIDB_AUTH_TOKEN: "${NAMIDB_AUTH_TOKEN:?set NAMIDB_AUTH_TOKEN in your env}" NAMIDB_FLUSH_INTERVAL: "30s" AWS_ACCESS_KEY_ID: "minioadmin" AWS_SECRET_ACCESS_KEY: "minioadmin" ports: - "8080:8080" volumes: minio-data: {} ``` ## Bring it up 1. **Build the server image** (one-time, from the engine repo root): ```bash docker build -t namidb-server:0.3 \ -f crates/namidb-server/Dockerfile . ``` 2. **Generate an auth token**: ```bash export NAMIDB_AUTH_TOKEN=$(openssl rand -hex 32) ``` 3. **Start the stack**: ```bash docker compose up -d ``` 4. **Smoke-test**: ```bash curl -s http://localhost:8080/v0/health | jq . curl -s -X POST http://localhost:8080/v0/cypher \ -H "Authorization: Bearer $NAMIDB_AUTH_TOKEN" \ -H 'Content-Type: application/json' \ -d '{"query": "CREATE (a:Person {name: \"Alice\"}) RETURN a.name AS name"}' \ | jq . ``` That’s it. **A graph database, your data on disk in MinIO, an authenticated REST API on `:8080`.** ## Move it to a real cloud Swap one env var. Everything else stays: ```yaml environment: NAMIDB_STORE: "s3://my-bucket?ns=prod®ion=us-east-1" # NAMIDB_STORE: "s3://my-bucket?ns=prod&endpoint=https://.r2.cloudflarestorage.com®ion=auto" # NAMIDB_STORE: "gs://my-bucket?ns=prod" # NAMIDB_STORE: "az://acct/container?ns=prod" ``` Same engine, same Docker image. **Object storage is the source of truth.** ## See also * [HTTP API reference](/en/sdk/http) * [Configuration](/en/operations/configuration) * [MinIO / Tigris / LocalStack](/en/operations/storage/minio-tigris-localstack) # AWS S3 > NamiDB's primary, headline backend. Credentials, IAM permissions, multi-region considerations. **AWS S3 is the primary path** for NamiDB. The engine was designed against S3’s conditional-write semantics; every other S3-compatible backend (R2, MinIO, LocalStack, Tigris) is validated against the same test suite. ## Open a namespace ```python import namidb as tg client = tg.Client("s3://my-bucket/data?ns=prod®ion=us-east-1") ``` ```rust let (store, paths) = parse_uri("s3://my-bucket/data?ns=prod®ion=us-east-1")?; ``` ## Credentials Credentials read from the standard AWS env vars: ```bash export AWS_ACCESS_KEY_ID=AKIA... export AWS_SECRET_ACCESS_KEY=... export AWS_SESSION_TOKEN=... # if using temporary creds export AWS_DEFAULT_REGION=us-east-1 ``` **IAM roles on EC2 / EKS / Lambda / ECS work transparently** — no NamiDB-specific auth to wire. The Rust `object_store` crate uses the standard provider chain. Query-string `region=…` overrides `AWS_DEFAULT_REGION`. ## IAM permissions The minimum IAM permissions NamiDB needs on the bucket: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::my-bucket/*" }, { "Effect": "Allow", "Action": "s3:ListBucket", "Resource": "arn:aws:s3:::my-bucket" } ] } ``` That’s it. **No DynamoDB lock table. No separate metadata service.** ## Cross-region considerations * Pick the region closest to your readers. Cross-region GET latency is the dominant factor in cold-read time. * NamiDB caches ([RFC-018](/en/internals/rfcs/018-csr-adjacency), [RFC-019](/en/internals/rfcs/019-node-view-cache-shared), [RFC-020](/en/internals/rfcs/020-edge-sst-caches)) hide most of the cost for warm working sets. * For multi-region read replicas, run a `namidb-server` per region pointed at the same bucket — only one will be allowed to commit writes; the rest serve reads. ## Pricing knobs * **Storage**: standard S3 storage class. Cold tenants can be moved to S3 Intelligent-Tiering or Glacier IR (re-warmup latency applies). * **Egress**: NamiDB issues only as many GETs as the working set requires. Cache hits cost nothing. * **Requests**: writes do 1 WAL PUT + 1 manifest PUT per commit batch. Compaction adds background PUTs. ## See also * [Cloudflare R2 (zero-egress alternative)](/en/operations/storage/cloudflare-r2) * [URI grammar](/en/operations/uri-grammar) · [Configuration](/en/operations/configuration) * [Backup & restore](/en/operations/backup-restore) # Azure Blob Storage > az:// scheme. Storage-account key, managed identity, or Azurite emulator. ## Open a namespace ```python import os os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "myacct" os.environ["AZURE_STORAGE_ACCESS_KEY"] = "..." import namidb as tg client = tg.Client("az://myacct/mycontainer?ns=prod") ``` URI form: `az:///[/]?ns=`. ## Managed Identity (AKS / VMSS / App Service) When NamiDB runs inside Azure, prefer Managed Identity over a long-lived access key. The Rust `object_store` crate picks up the managed identity through the standard credential chain — no NamiDB-specific config needed. ## Azurite (local emulator) For development and CI: ```bash docker run -p 10000:10000 mcr.microsoft.com/azure-storage/azurite ``` ```python client = tg.Client( "az://devstoreaccount1/mycontainer?ns=test&use_emulator=true" ) ``` The `use_emulator=true` flag rewrites the endpoint to `http://127.0.0.1:10000/devstoreaccount1`. ## Conditional writes Azure Blob supports `If-Match` on `PutBlob`, which `object_store` maps to NamiDB’s CAS protocol. Single-writer-per-namespace + epoch-CAS invariants apply identically. ## See also * [URI grammar](/en/operations/uri-grammar) * [Configuration](/en/operations/configuration) # Cloudflare R2 > Zero-egress alternative to S3 with full conditional-write support. Same URI scheme, R2 endpoint. **R2 is the zero-egress alternative.** It speaks the S3 API, supports the conditional writes NamiDB depends on, and charges **no egress fees**. If you’re running NamiDB outside AWS — on Cloudflare Workers, Fly.io, your own VPS, your laptop — **R2 is almost always the right call**. ## Open a namespace ```python import os os.environ["AWS_ACCESS_KEY_ID"] = "" os.environ["AWS_SECRET_ACCESS_KEY"] = "" client = tg.Client( "s3://my-bucket?ns=prod" "&endpoint=https://.r2.cloudflarestorage.com" "®ion=auto" ) ``` The R2 endpoint lives at `https://.r2.cloudflarestorage.com`. Region must be `auto`. ## Creating a bucket + token 1. In the [Cloudflare dashboard](https://dash.cloudflare.com), go to **R2** → **Create bucket**. 2. **R2** → **Manage R2 API Tokens** → **Create API Token** with “Object Read & Write” on your bucket. 3. The token download includes an `Access Key ID` and a `Secret Access Key`. Set them as `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` and you’re done. ## What works | Capability | Status | | ------------------------------------------------- | --------------------- | | Conditional writes (`If-Match` / `If-None-Match`) | ✅ | | Multi-region replication | ✅ (R2 jurisdictions) | | Per-bucket lifecycle policies | ✅ | | Custom endpoints / DNS | ✅ (R2 public buckets) | ## Cost considerations * **Storage**: $0.015/GB-month (Standard). * **Class A operations** (PUT, COPY, LIST): $4.50 per million. * **Class B operations** (GET, HEAD): $0.36 per million. * **Egress**: $0. For agent-memory workloads with read-heavy patterns, R2 is usually \~30% cheaper than equivalent S3, dominated by zero egress. ## See also * [AWS S3](/en/operations/storage/aws-s3) — the primary backend * [URI grammar](/en/operations/uri-grammar) # Google Cloud Storage > GCS backend via the gs:// URI scheme. Service-account auth. ## Open a namespace ```python import os os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/etc/gcs-key.json" import namidb as tg client = tg.Client("gs://my-bucket/data?ns=prod") ``` Or pin the service-account path per URI: ```python client = tg.Client( "gs://my-bucket?ns=prod&service_account=/etc/gcs-key.json" ) ``` ## Permissions The service account needs: * `storage.objects.get` * `storage.objects.create` * `storage.objects.delete` * `storage.objects.list` The pre-built `Storage Object User` role covers all four. ## Workload Identity (GKE / Cloud Run) When NamiDB runs inside GCP, prefer [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) over a long-lived JSON key. NamiDB picks up the federated identity through the standard `GOOGLE_APPLICATION_CREDENTIALS` chain — no NamiDB-specific config needed. ## Conditional writes GCS supports the `x-goog-if-generation-match` header, which `object_store` maps to the same conditional-write semantics NamiDB uses on S3. The same single-writer-per-namespace + epoch-CAS invariants apply. ## See also * [URI grammar](/en/operations/uri-grammar) * [Configuration](/en/operations/configuration) # Local filesystem (file://) > Durable single-machine storage via flock-based CAS + atomic rename. Same correctness story as S3, no network. For development, single-machine deployments, and CI fixtures. ```python import namidb as tg client = tg.Client("file:///var/lib/namidb?ns=prod") # Relative paths work too: client = tg.Client("file://./data?ns=dev") ``` ## How CAS works on a filesystem NamiDB implements **manifest CAS** on a local filesystem via: * Per-namespace `flock(2)` for write-side serialisation * Atomic `rename(2)` for the manifest swap This passes the same concurrency test suite as `s3://`. **Single- writer-per-namespace + epoch fencing applies identically.** ## When to use it * **CI fixtures** — durable, no network, fast. * **Single-machine production** — when “all my data fits on one disk and one process” is the right shape. * **Local development** — when you want to test the full LSM cycle (flush, compaction, snapshot lifetime) without spinning up MinIO. ## Layout on disk ```plaintext /var/lib/namidb/{namespace}/ ├── manifest.json ├── manifest.lock ← flock target ├── wal/ └── sst/ ``` You can `aws s3 sync` between `file://` and `s3://` paths to migrate. ## What you give up * **Durability beyond the disk.** No replication, no multi-AZ, no versioning. Whatever your filesystem and your backup story give you, that’s what NamiDB gives you. * **Multi-machine reads.** A `file://` namespace is owned by one host. Two hosts mounting the same NFS / EFS path are **not supported** (NFS `flock` semantics aren’t strict enough). ## See also * [URI grammar](/en/operations/uri-grammar) * [Backup & restore](/en/operations/backup-restore) # MinIO / Tigris / LocalStack > Any S3-compatible endpoint. Self-hosted (MinIO), edge-native (Tigris), or local-only (LocalStack). NamiDB’s `s3://` scheme works against **any S3-compatible endpoint**. The only thing that changes is `endpoint=…` in the URI. ## MinIO (self-hosted) The canonical “S3 on my own metal” backend. ```bash docker run -d --rm -p 9000:9000 -p 9001:9001 \ -e MINIO_ROOT_USER=minioadmin \ -e MINIO_ROOT_PASSWORD=minioadmin \ --name minio minio/minio server /data --console-address ":9001" docker exec minio mc alias set local http://127.0.0.1:9000 minioadmin minioadmin docker exec minio mc mb local/namidb ``` ```python import os os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin" os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin" client = tg.Client( "s3://namidb?ns=dev" "&endpoint=http://127.0.0.1:9000" "®ion=us-east-1" "&allow_http=true" ) ``` The `allow_http=true` flag is required because MinIO does not serve TLS by default. For a production-style **MinIO + `namidb-server` + docker-compose** stack, see [Self-host with Docker Compose](/en/operations/self-host-docker-compose). ## Tigris [Tigris](https://www.tigrisdata.com/) is an edge-native S3-compatible storage service. ```python client = tg.Client( "s3://my-bucket?ns=prod" "&endpoint=https://fly.storage.tigris.dev" "®ion=auto" ) ``` Set `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` to your Tigris token. ## LocalStack For tests that need an in-process S3 mock: ```bash docker run -p 4566:4566 -e SERVICES=s3 localstack/localstack aws --endpoint-url=http://localhost:4566 s3 mb s3://namidb-dev export AWS_ACCESS_KEY_ID=test export AWS_SECRET_ACCESS_KEY=test ``` ```python client = tg.Client( "s3://namidb-dev?ns=local" "&endpoint=http://localhost:4566" "&allow_http=true" "®ion=us-east-1" ) ``` Not for production LocalStack’s free tier doesn’t implement every S3 nuance. Treat it as a development/test convenience. **Use MinIO for self-hosted production.** ## See also * [Self-host with Docker Compose](/en/operations/self-host-docker-compose) * [URI grammar](/en/operations/uri-grammar) # Tuning > Knobs for cache budgets, factorization, flush cadence, and the cost-based optimizer. Most workloads run well on defaults. Reach for these knobs when you’ve measured a specific bottleneck. ## When latency is the problem 1. **Profile per-stage**: ```bash export NAMIDB_PROFILE_DUMP=1 ``` This pinpoints whether time is in parse, lower, optimise, or execute. 2. **`EXPLAIN VERBOSE`** to see the plan and selectivity estimates. 3. If execute dominates and the plan looks reasonable, **try factorization**: ```bash export NAMIDB_FACTORIZE=1 ``` Big win on path-heavy queries (IC09, IC11) where intermediate results explode under non-factorized evaluation. ## When memory is the problem Cap the caches: ```bash export NAMIDB_ADJACENCY_BUDGET_MB=256 export NAMIDB_NODE_CACHE_BUDGET_MB=256 export NAMIDB_SST_CACHE_BUDGET_MB=512 ``` All three are eviction-bounded — never exceed the budget. Or disable one entirely: ```bash export NAMIDB_ADJACENCY=off export NAMIDB_NODE_CACHE=off ``` For the embedded use case inside small containers / Lambdas, a common profile is “small NodeCache, no SstCache, full AdjacencyCache”. ## When write throughput is the problem * **Bulk-stage** instead of per-row `CREATE` — use `merge_nodes` / `merge_edges` from Python, or `upsert_node` / `upsert_edge` from Rust. These amortise a single `commit_batch()` over thousands of rows. * **Increase flush interval** on `namidb-server` if you’re doing burst-write workloads: ```bash --flush-interval 5m ``` Larger memtables → fewer L0 SSTs → less compaction work. * **Shard by namespace.** Each namespace has one writer. Two unrelated workloads on two namespaces double your write throughput. ## When cold-read latency is the problem This is dominated by SST fetch from the bucket. Options: * **Move the daemon closer to the bucket.** Same region, same VPC. * **Pre-warm** with a one-time query that touches the working set — subsequent queries hit the SST cache. * **Bump `NAMIDB_SST_CACHE_BUDGET_MB`** so the working set fits. ## See also * [Configuration](/en/operations/configuration) * [Caches](/en/concepts/caches) * [RFC-017 — Factorization](/en/internals/rfcs/017-factorization) * [RFC-010 — Cost-based optimizer](/en/internals/rfcs/010-cost-based-optimizer) # URI grammar > Every storage URI scheme NamiDB understands, with query-string flags. The URI is how every NamiDB client — Python, Rust, CLI, `namidb-server` — addresses a namespace on a backend. **The URI carries both the bucket/path and the namespace.** ## Schemes | Scheme | Backend | | ----------------------------------------------- | ----------------------------------------------------- | | `memory://` | In-process, ephemeral — testing only | | `file:///abs/dir?ns=` | Local filesystem with CAS via `flock` + atomic rename | | `s3://[/]?ns=` | AWS S3, Cloudflare R2, MinIO, Tigris, LocalStack | | `gs://[/]?ns=` | Google Cloud Storage | | `az:///[/]?ns=` | Azure Blob Storage | ## Query-string flags Common across S3-compatible backends: | Flag | Purpose | | ----------------------- | --------------------------------------------------------------------- | | `?ns=` | **Required.** Names the namespace inside the bucket/prefix. | | `®ion=` | Region. For R2 use `auto`. | | `&endpoint=` | Override the endpoint (R2, MinIO, LocalStack, Tigris, S3-compatible). | | `&allow_http=true` | Permit `http://` endpoints (LocalStack, local MinIO). | GCS-specific: | Flag | Purpose | | ------------------------- | ------------------------------------------ | | `&service_account=` | Override `GOOGLE_APPLICATION_CREDENTIALS`. | Azure-specific: | Flag | Purpose | | -------------------- | ------------------------------------- | | `&use_emulator=true` | Talk to Azurite (the local emulator). | ## Examples ### AWS S3 ```text s3://my-bucket/data?ns=prod®ion=us-west-2 ``` ### Cloudflare R2 ```text s3://my-bucket?ns=prod &endpoint=https://.r2.cloudflarestorage.com ®ion=auto ``` ### Google Cloud Storage ```text gs://my-bucket/data?ns=prod ``` ```text gs://my-bucket?ns=prod&service_account=/etc/gcs-key.json ``` ### Azure Blob ```text az://myacct/mycontainer?ns=prod ``` ```text az://devstoreaccount1/mycontainer?ns=test&use_emulator=true ``` ### MinIO (local) ```text s3://namidb?ns=dev &endpoint=http://127.0.0.1:9000 ®ion=us-east-1 &allow_http=true ``` ### LocalStack (S3 mock for tests) ```text s3://namidb-dev?ns=local &endpoint=http://localhost:4566 &allow_http=true ®ion=us-east-1 ``` ### Local filesystem ```text file:///var/lib/namidb?ns=prod file://./data?ns=dev ``` ### Ephemeral memory ```text memory://acme ``` ## See also * [Configuration](/en/operations/configuration) * [AWS S3](/en/operations/storage/aws-s3) · [Cloudflare R2](/en/operations/storage/cloudflare-r2) · [GCS](/en/operations/storage/gcs) · [Azure](/en/operations/storage/azure) * [MinIO / Tigris / LocalStack](/en/operations/storage/minio-tigris-localstack) * [Local filesystem](/en/operations/storage/local-filesystem) # CLI > namidb parse / explain / run — ad-hoc query work against any backend. The `namidb` command-line tool wraps the engine for **ad-hoc query work**: parse, explain, run — against any supported storage backend. ## Install From source: ```bash git clone https://github.com/namidb/namidb.git cd namidb cargo install --path crates/namidb-cli ``` The resulting `namidb` binary needs no daemon. Without `--store`, it spins up an ephemeral in-memory namespace for one-shot work. With `--store `, it opens a durable namespace on any supported backend. ## Subcommands ### `namidb run` Run a Cypher query. ```bash # Ephemeral, in-memory namidb run "CREATE (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'})" namidb run "MATCH (p:Person) RETURN p.name" # Persistent — any URI scheme is accepted namidb run --store "file:///var/lib/namidb?ns=prod" \ "CREATE (a:Person {name: 'Alice'})" namidb run --store "s3://my-bucket?ns=prod®ion=us-west-2" \ "MATCH (p:Person) RETURN count(*) AS n" namidb run --store "gs://my-bucket?ns=prod" \ "MATCH (p:Person) RETURN count(*) AS n" namidb run --store "az://acct/container?ns=prod" \ "MATCH (p:Person) RETURN count(*) AS n" ``` ### `namidb explain` Show the optimised logical plan with cost / selectivity annotations. ```bash namidb explain \ "MATCH (a:Person)-[:KNOWS]->(b) RETURN b LIMIT 20" # Full physical-operator tree namidb explain --verbose \ "MATCH (a:Person)-[:KNOWS]->(b) RETURN b ORDER BY b.id LIMIT 20" ``` ### `namidb parse` Show the canonical form of a query (lexer + parser round-trip). Useful for debugging grammar surprises. ```bash namidb parse \ "MATCH (a:Person)-[:KNOWS]->(b:Person) RETURN b.name LIMIT 5" ``` ## URI grammar The full URI grammar — including endpoint overrides for R2 / MinIO / LocalStack, GCS service-account paths, and Azure emulator mode — is documented at [Operations / URI grammar](/en/operations/uri-grammar). ## Output Default output is a human-readable table. JSON mode is on the roadmap. ```text $ namidb run "MATCH (p:Person) RETURN p.name AS name, p.age AS age LIMIT 3" ┌────────┬─────┐ │ name │ age │ ├────────┼─────┤ │ Alice │ 30 │ │ Bob │ 25 │ │ Carol │ 41 │ └────────┴─────┘ 3 rows · 12 ms ``` ## See also * [Cypher reference](/en/cypher/supported-subset) * [URI grammar](/en/operations/uri-grammar) * [Quickstart](/en/get-started/quickstart) # HTTP API (namidb-server) > REST endpoints exposed by namidb-server, with auth, JSON ↔ Cypher mapping, and curl examples. `namidb-server` is a single Rust binary (also published as a Docker image) that opens a namespace and exposes it over HTTP. Same engine as the embedded library — the only thing this binary adds is the HTTP boundary, bearer-token auth, and a periodic flush loop. ## Install * Cargo ```bash cargo install --path crates/namidb-server ``` * Docker ```bash docker build -t namidb-server:0.3 \ -f crates/namidb-server/Dockerfile . ``` ## Run ```bash namidb-server \ --store 's3://my-bucket?ns=prod®ion=us-east-1' \ --listen 0.0.0.0:8080 \ --auth-token "$NAMIDB_AUTH_TOKEN" \ --flush-interval 30s ``` Every flag is also an env var: `NAMIDB_STORE`, `NAMIDB_LISTEN`, `NAMIDB_AUTH_TOKEN`, `NAMIDB_FLUSH_INTERVAL`. No token, no production If `--auth-token` is unset, the server boots in **unauthenticated** mode and prints a loud warning. Do not expose that port to the public internet. ## Endpoints (v0) | Method | Path | Auth | Description | | ------ | ----------------- | ------ | ----------------------------------- | | `GET` | `/v0/health` | public | Liveness + manifest version + epoch | | `GET` | `/v0/version` | public | Server build version | | `POST` | `/v0/cypher` | bearer | Run a Cypher query (read or write) | | `POST` | `/v0/admin/flush` | bearer | Force memtable → L0 SST flush | ### `POST /v0/cypher` **Request** ```json { "query": "MATCH (p:Person) WHERE p.age >= $min RETURN p.name AS name", "params": {"min": 18} } ``` **Response — read** ```json { "columns": ["name"], "rows": [{"name": "Alice"}, {"name": "Bob"}] } ``` **Response — write** ```json { "columns": ["a"], "rows": [{"a": {"_kind": "node", "id": "…", "label": "Person", "properties": {}}}], "write_outcome": { "nodes_created": 1, "edges_created": 0, "nodes_deleted": 0, "edges_deleted": 0, "properties_set": 0 } } ``` ### curl round-trip ```bash TOKEN=$(openssl rand -hex 32) namidb-server --store memory://demo --listen 127.0.0.1:8080 --auth-token "$TOKEN" & curl -s http://127.0.0.1:8080/v0/health | jq . curl -s -X POST http://127.0.0.1:8080/v0/cypher \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ -d '{"query": "CREATE (a:Person {name: \"Alice\", age: 30}) RETURN a.name AS name"}' \ | jq . curl -s -X POST http://127.0.0.1:8080/v0/cypher \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ -d '{"query": "MATCH (p:Person) RETURN p.name AS name, p.age AS age"}' \ | jq . ``` ## Type mapping (JSON ↔ Cypher) | Cypher `RuntimeValue` | JSON | | --------------------- | ----------------------------------------------------------- | | `Null` | `null` | | `Bool` | `true` / `false` | | `Integer` | number (i64) | | `Float` | number (f64) | | `String` | string | | `Bytes` | base64 string | | `Vector(f32)` | array of numbers | | `List` | array | | `Map` | object | | `Date` | ISO-8601 date string | | `DateTime` (UTC µs) | RFC-3339 timestamp string | | `Node` | `{"_kind": "node", "id", "label", "properties"}` | | `Rel` | `{"_kind": "rel", "edge_type", "src", "dst", "properties"}` | | `Path` | array of alternating node/rel objects | ## Concurrency model `namidb-server` opens one `WriterSession` per process and serialises every request behind a tokio `Mutex`. This is the single-writer-per-namespace invariant from [RFC-001](/en/internals/rfcs/001-storage-engine) lifted up to the request layer: at most one Cypher statement is in flight against the namespace at a time. Read latency stays predictable; throughput is bounded by the slowest mutator. If you need horizontal read scale today, point multiple `namidb-server` processes at the same `--store` URI: each can serve reads off the same manifest version. Only one will be allowed to commit writes (the rest get fenced via epoch CAS). Concurrent read fan-out without holding the writer mutex is [RFC-021](https://github.com/namidb/namidb/blob/main/docs/rfc/) work. ## Periodic flush `--flush-interval` (default `30s`) controls how often the background task converts the memtable to L0 SSTs. Set it to `0s` to disable the loop and call `POST /v0/admin/flush` from cron / a sidecar instead. ## Roadmap * `/v0/cypher/stream` — NDJSON streaming response for large result sets. * `/v0/cypher/arrow` — Arrow IPC body for zero-copy DataFrame ingestion. * `/v0/metrics` — Prometheus exposition. * Bolt protocol compatibility — drivers in every language that already speaks Neo4j today. ## See also * [Self-host with Docker Compose](/en/operations/self-host-docker-compose) * [Configuration](/en/operations/configuration) # Python SDK > pip install namidb. Sync + async. Arrow / pandas / polars output. Backed by Rust via pyo3. `namidb` is the Python wrapper around the NamiDB storage + query engine. Backed by Rust via [`pyo3`](https://pyo3.rs/) and built with [`maturin`](https://www.maturin.rs/). ## Install ```bash pip install namidb # core pip install 'namidb[pandas]' # + DataFrame interop pip install 'namidb[polars]' # + Polars interop ``` Pre-built abi3 wheels for Python ≥ 3.9 on Linux (x86\_64 + aarch64), macOS (arm64), and Windows (x86\_64). Intel macOS falls back to sdist. `pyarrow >= 14` is a hard transitive dependency. ## Open a namespace ```python import namidb as tg client = tg.Client("s3://my-bucket?ns=prod®ion=us-east-1") ``` All six URI schemes are supported: `memory://`, `file://`, `s3://`, `gs://`, `az://`. ## Cypher `Client.cypher(query, params=None)` runs a query and returns a `QueryResult`: ```python client.cypher("CREATE (a:Person {name: 'Alice', age: 30})") client.cypher("CREATE (a:Person {name: 'Bob', age: 25})") result = client.cypher( "MATCH (p:Person) WHERE p.age > $min RETURN p.name AS name, p.age AS age", params={"min": 26}, ) print(result.columns) # ['name', 'age'] print(len(result)) # 1 print(result.first()) # {'name': 'Alice', 'age': 30} for row in result.rows(): print(row) ``` ## Async API The same surface is available as a coroutine via `Client.acypher`: ```python import asyncio import namidb as tg async def main() -> None: client = tg.Client("memory://acme") await client.acypher("CREATE (p:Person {name: 'Alice'})") result = await client.acypher( "MATCH (p:Person {name: $name}) RETURN p.name AS name", params={"name": "Alice"}, ) print(result.rows()) asyncio.run(main()) ``` Driven by the `pyo3-async-runtimes` tokio bridge — every call runs on the same multi-threaded tokio runtime that backs the synchronous API. **Mixing sync + async from the same `Client` is safe.** ## Type mapping (Cypher ↔ Python) | Cypher `RuntimeValue` | Python type | | --------------------- | ----------------------------------------------------------- | | `Null` | `None` | | `Bool` | `bool` | | `Integer` | `int` | | `Float` | `float` | | `String` | `str` | | `Bytes` | `bytes` | | `Vector(Vec)` | `list[float]` | | `List` | `list` | | `Map` | `dict[str, ...]` | | `Date` | `datetime.date` | | `DateTime` (UTC µs) | `datetime.datetime` UTC | | `Node` | `{"_kind": "node", "id", "label", "properties"}` | | `Rel` | `{"_kind": "rel", "edge_type", "src", "dst", "properties"}` | | `Path` | `list[Node\|Rel]` alternating | `bool` is intentionally checked before `int` so `True` / `False` don’t round-trip as `Integer(1)` / `Integer(0)`. ## Bulk inserts For thousands of rows, prefer `merge_nodes` / `merge_edges`: ```python import uuid import namidb as tg client = tg.Client("memory://acme") client.merge_nodes( "Person", [{"id": str(uuid.uuid4()), "name": f"p{i}", "age": 20 + i} for i in range(10_000)], ) client.merge_edges( "KNOWS", [ {"src": "uuid-a", "dst": "uuid-b", "since": 2020}, {"src": "uuid-b", "dst": "uuid-c", "since": 2021}, ], ) client.commit() # WAL + manifest CAS client.flush() # memtable -> L0 SSTs ``` ## Arrow / pandas / polars output ```python result = client.cypher( "MATCH (p:Person) RETURN p.name AS name, p.age AS age ORDER BY p.age DESC" ) table = result.to_arrow() # pyarrow.Table df = result.to_pandas() # pandas.DataFrame (needs pandas extra) pl_df = result.to_polars() # polars.DataFrame (needs polars extra) ``` Column order follows the `RETURN` projection from the parsed plan, so `RETURN p.name AS name, p.age AS age` always yields columns `["name", "age"]` even when zero rows match. Label-wide scans without the Cypher round-trip: ```python table = client.scan_label_arrow("Person") ``` ## Cache stats ```python print(client.cache_stats()) # {"adjacency": {"hits": ..., "misses": ..., "bytes": ...}, ...} ``` ## Storage backends See [Operations / Storage backends](/en/operations/storage/aws-s3) for the full credential matrix. ## Build from source ```bash pip install maturin git clone https://github.com/namidb/namidb.git cd namidb/crates/namidb-py maturin develop --release --extras test ``` # Rust SDK (embedded) > Use NamiDB as a Rust library — the same engine as the server, with zero network overhead. The `namidb` crate is the **public façade** of the engine: it re-exports the stable surface of `namidb-core`, `namidb-storage`, `namidb-graph`, and `namidb-query` so downstream code only needs one line. ## Add it Cargo.toml ```toml [dependencies] namidb = "0.3" tokio = { version = "1", features = ["full"] } anyhow = "1" ``` MSRV is **Rust 1.85**. ## Open a namespace ```rust use namidb::storage::{parse_uri, WriterSession}; #[tokio::main] async fn main() -> anyhow::Result<()> { let (store, paths) = parse_uri("s3://my-bucket?ns=prod®ion=us-east-1")?; let mut writer = WriterSession::open(store, paths).await?; // ... upserts, commit_batch, snapshot reads ... Ok(()) } ``` Any URI scheme works: `memory://`, `file://`, `s3://`, `gs://`, `az://`. ## Run Cypher ```rust use namidb_query::{execute, lower, parse, Params}; let query = parse( "MATCH (a:Person) WHERE a.age > $min RETURN a.name AS name" )?; let plan = lower(&query)?; let snap = writer.snapshot(); let params = Params::from_iter([("min", 18.into())]); let rows = execute(&plan, &snap, ¶ms).await?; println!("{rows:?}"); ``` ## Direct write API For the highest-throughput ingestion path (bypassing the parser): ```rust use namidb_core::id::{NodeId, EdgeType, Label}; use namidb_core::value::Value; let alice = NodeId::generate_v7(); let bob = NodeId::generate_v7(); writer.upsert_node(Label::new("Person"), alice, [ ("name", Value::String("Alice".into())), ("age", Value::Integer(30)), ].into())?; writer.upsert_node(Label::new("Person"), bob, [ ("name", Value::String("Bob".into())), ].into())?; writer.upsert_edge( EdgeType::new("KNOWS"), alice, bob, [("since", Value::Integer(2020))].into(), )?; writer.commit_batch().await?; // WAL append + manifest CAS writer.flush().await?; // memtable -> L0 SSTs ``` ## Snapshots ```rust let snap_a = writer.snapshot(); // ... writes happen, manifest advances ... let snap_b = writer.snapshot(); // snap_a still references the older manifest; both are valid. ``` Snapshots are cheap: they pin a manifest version and the SSTs it references. The GC won’t reclaim SSTs while any snapshot holds them. ## Cache configuration | Env var | Default | Cache | | ------------------- | ------- | ------------------------------------------------------------------------------------ | | `NAMIDB_ADJACENCY` | ON | CSR adjacency ([RFC-018](/en/internals/rfcs/018-csr-adjacency)) | | `NAMIDB_NODE_CACHE` | ON | `NodeView` lookups ([RFC-019](/en/internals/rfcs/019-node-view-cache-shared)) | | `NAMIDB_SST_CACHE` | ON | SST body + edge property streams ([RFC-020](/en/internals/rfcs/020-edge-sst-caches)) | | `NAMIDB_FACTORIZE` | OFF | Factorized executor ([RFC-017](/en/internals/rfcs/017-factorization)) | ## Crate layout | Crate | Surface | | ---------------- | ------------------------------------------------------------------- | | `namidb` | Public façade (re-exports the stable bits of the others) | | `namidb-core` | `NamespaceId`, `NodeId`, `EdgeId`, `Lsn`, `Value`, `Schema`, errors | | `namidb-storage` | LSM, WAL, manifest, SST, memtable, URI parser, `file://` CAS | | `namidb-graph` | Property columns + CSR adjacency | | `namidb-query` | Cypher / GQL parser, optimizer, executor | For day-to-day use, depend on `namidb` only. Reach for the sub-crates when you’re embedding deep (e.g. writing custom optimizer rules). ## See also * [Quickstart](/en/get-started/quickstart) · [Your graph in S3](/en/get-started/your-graph-in-s3) * [Cypher reference](/en/cypher/supported-subset) * [Operations / Configuration](/en/operations/configuration)