Loxation Team
- 19 minutes read - 3927 wordsPorting 200K Lines of C++ to Rust with Claude, Part 2: The Query Optimization Loop
This is Part 2 of a two-part series. Part 1 described the systematic workflow for porting kuzu's C++ graph database to Rust, the pluggable backend architecture, and the initial benchmark results. This part follows the optimization process in real time.
Where We Left Off
In Part 1, we established the baseline: rukuzu (the Rust reimplementation) was 105× faster than kuzu (the C++ original) on database writes, but 5.7× slower on reads. The diagnosis was clear — rukuzu's query engine was young and unoptimized, while kuzu had years of query planning refinements. We set out to close that gap using the same Claude Code custom command workflow that guided the initial port.
This article documents what happened next. Not the polished retrospective you'd write after the fact, but the live process — benchmarks feeding back into optimization decisions, with Claude as an active development partner following the porting workflow at each step.
First Optimization Pass: Establishing the Feedback Loop
The first thing we did was re-run the full benchmark suite after an initial round of rukuzu updates. The results:
Parse: 191 µs (rukuzu) vs 194 µs (kuzu) — identical, as expected
Classify: 383 µs vs 383 µs — identical
Export: ~135 µs (rukuzu) vs ~14.6 ms (kuzu) — rukuzu ~108× faster
Query: 1.65 ms (rukuzu) vs 331 µs (kuzu) — kuzu ~5× faster
The numbers are essentially unchanged from the Part 1 baseline. The export ratio held steady at ~108× (previously ~105×). The query gap held at ~5× (previously ~5.7×). Whatever changes had been made to rukuzu in the interim hadn't regressed or meaningfully improved the benchmarked paths.
This is actually the right outcome for the first pass. It confirms two things: the benchmark is stable and reproducible (critical for trusting future comparisons), and the low-hanging fruit isn't in the code paths we've touched so far. The query bottleneck is deeper — in the query planner, the storage access patterns, or the result materialization. Finding it requires the disciplined approach the custom command enforces.
Feeding Data Back Into the Loop
Here's where the workflow earns its keep. A developer without structure would see "query is 5× slower" and start guessing: maybe it's the hash joins, maybe it's the string comparisons, maybe it's allocation overhead. You'd pick one, optimize it, re-benchmark, and if the numbers don't move, pick another. This is the trial-and-error approach, and it works eventually, but it's slow and frustrating.
The custom command's Step 2 — "Research Both Implementations" — forces a different approach. Before touching any code, Claude reads kuzu's C++ query execution path for the specific queries DEALER runs: taxonomy lookups like MATCH (a:Concept)-[:DirectSubClassOf]->(b:Concept) WHERE a.iri = $iri RETURN b.iri. Then it reads rukuzu's implementation of the same query path. The comparison reveals not just what's different, but why kuzu is faster.
The benchmark data itself is part of the prompt. Claude sees:
Export: 135 µs (rukuzu) vs 14.6 ms (kuzu). Explanation: no FFI overhead. This is structural and won't change.
Query: 1.65 ms (rukuzu) vs 331 µs (kuzu). Explanation: TBD — this is what we're investigating.
Parse and classify: identical. Confirmation that the trait abstraction has zero overhead.
The 1.65 ms vs 331 µs gap is the signal. 1.65 ms is 5× slower, but it's also only 1.3 ms of absolute difference. On a query that touches a small taxonomy (the pizza ontology has 16 nodes), this suggests per-query fixed costs rather than scaling problems. If the gap were in traversal itself, it would be proportional to graph size. A fixed 1.3 ms overhead points to query parsing, plan compilation, or result materialization.
The Anatomy of a 5× Gap
To understand where 1.3 ms goes, consider what happens when DEALER issues a query against each backend.
With kuzu (C++), the query follows a mature pipeline: the Cypher string is parsed into an AST, the planner generates an optimized physical plan using pre-computed statistics, the executor runs the plan against columnar storage with vectorized scans, and the results come back as typed tuples. Years of optimization have shaved microseconds off each stage.
With rukuzu (Rust), the same Cypher string goes through a younger pipeline. But "younger" doesn't mean "uniformly slower" — it means specific code paths have specific problems. The custom command says: "Go back to the C++ reference implementation. Read how it handles the case." So we handed Claude the benchmark data and the specific query pattern (a taxonomy lookup on the pizza ontology), and let the workflow play out.
What Claude found wasn't what we expected.
The Diagnosis: Two Compounding Bugs
We had hypothesized that the 5× gap would come from classic query optimization deficits — missing plan caching, poor index exploitation, or un-batched result materialization. The kind of thing you'd find in any young database engine.
Instead, Claude's investigation of rukuzu's storage layer revealed two concrete, compounding root causes that accounted for nearly all the overhead.
Root cause 1: Full table scans per edge. rukuzu's from_rel_tables() function, which resolves edge endpoints to node data, calls scan_all() for each edge. On the pizza ontology with ~200 edges and ~100 nodes, this means each edge resolution materializes ALL ~100 node rows as a Vec<Vec<Value>>, then indexes into that vector to find the one row it needs. That's 200 full table scans instead of 200 single-row reads — a 100× amplification of work. And it contained a latent correctness bug: scan_all() skips deleted rows, so get(offset) returns the wrong row when deletions exist. A performance problem hiding a data integrity problem.
Root cause 2: O(n) linear search for strings. rukuzu's string_data storage uses HashMap<usize, Vec<(u64, String)>> — a map from column index to a vector of (offset, value) pairs. Every string read does a linear scan of the vector to find the matching offset. For 100 concepts, each with an IRI string, every string read scans ~100 entries. Since the taxonomy query returns string IRIs for every result, this compounds with the per-edge scan to produce the observed overhead.
Neither of these would have been found by studying kuzu's C++ query planner. They're not query optimization problems — they're storage access pattern problems in a layer below the query engine. But the custom command's "research both implementations" discipline found them anyway, because Claude reads the full call chain, not just the layer you'd expect the problem to live in.
The Fixes
Following the command's phase template, each fix was implemented as an independent, testable step.
Step 1: Change string_data from Vec to HashMap. Replace HashMap<usize, Vec<(u64, String)>> with HashMap<usize, HashMap<u64, String>>. This turns every string read from O(n) linear scan to O(1) hash lookup. One file changed (crates/rukuzu-storage/src/node_table.rs), one data structure replaced.
Step 2: Replace scan_all() with direct row access. Instead of materializing the entire node table for each edge, implement a get_row(offset) method that reads a single row by offset. This eliminates the 100× amplification and fixes the latent deletion bug simultaneously — direct offset access doesn't skip deleted rows, so the indexing is correct regardless of deletion state.
After each step: cargo test --workspace (all tests pass), cargo clippy --workspace (zero warnings), benchmark suite (measure improvement). The numbers either move or they don't.
The Baseline: Confirming Stability
Before the fixes landed, we re-ran the benchmark suite to confirm stability. The numbers held: export at ~135 µs (108× faster), query at 1.65 ms (5× slower). Within noise of the Part 1 baseline. This matters because it means any future change that moves the numbers is real, not an artifact of measurement variance.
For anyone building a benchmark harness for optimization work: invest in stability before speed. Run enough iterations (our export benchmark runs 40,000 iterations), use criterion.rs or equivalent for statistical rigor, and verify that back-to-back runs produce numbers within 10% of each other. Without this, you're optimizing against noise.
The Result: 29× Speedup, Clean Sweep
Then the fixes landed. We ran the benchmark.
Read that last row again. rukuzu db_query went from 1.65 ms to 56 µs — a 29× speedup. It didn't just close the gap with kuzu. It blew past it. rukuzu is now 6× faster than kuzu's mature C++ query engine on the same workload.
The write/read tradeoff story from Part 1 is gone. There is no tradeoff. rukuzu now wins on both reads and writes. A clean sweep.
What Just Happened
Let's trace the full arc of these numbers:
The export numbers barely moved — they were already dominated by the FFI-vs-native advantage, and nothing we changed touches the export path. This is expected and reassuring: the fixes were surgical, targeting only the query path.
The query numbers tell the story. From 1.65 ms to 56 µs is a 29× improvement. The two root causes Claude identified — the full table scan per edge and the O(n) linear string search — were not just contributing to the 5× gap. They were almost the entire gap. Fixing them didn't just bring rukuzu to parity; it revealed that rukuzu's query architecture, once the storage access patterns were correct, is fundamentally faster than kuzu's for this workload.
Why? The same reason rukuzu wins on writes: no FFI boundary. kuzu's query result still crosses from C++ to Rust through the foreign function interface. rukuzu's results are native Rust objects. For a query that returns a small number of string IRIs, the FFI marshaling overhead is a significant fraction of the total query time. At 337 µs, kuzu is spending meaningful time converting C++ strings to Rust strings across the language boundary. At 56 µs, rukuzu isn't spending any time on that at all.
What This Teaches About AI-Assisted Optimization
The diagnosis deserves scrutiny, because it illustrates something important about how AI assistants work in this context.
We had predicted the fixes would close the gap to "possibly under 2×." The actual result was 6× in the other direction. Our prediction was wrong — not because the diagnosis was wrong (it was right), but because we underestimated the compounding effect. The full table scan per edge wasn't just slow; it was doing 200× the work. The linear string search wasn't just suboptimal; it was O(n) where O(1) was trivial. Fixing both simultaneously didn't produce an additive improvement — it produced a multiplicative one, because both were in the same hot path.
A human developer looking at the 5× gap would likely have started with the query engine — query planner, execution strategy, result batching. The textbook optimization targets. You might spend days profiling the query pipeline before realizing the bottleneck was two layers down, in the storage access pattern.
Claude, following the custom command's "research both implementations" discipline, read the entire call chain from Cypher query to disk access. It didn't have preconceptions about where the bottleneck "should" be. It read from_rel_tables(), noticed it called scan_all() in a loop, computed the amplification factor (200 edges × 100 rows = 20,000 row reads), and flagged it. Then it read string_data, noticed the Vec<(u64, String)> structure, recognized the linear search pattern, and flagged that too.
This is the "research both implementations" step earning its keep. Not because Claude is a better optimizer than a human — a senior database engineer would have found these eventually — but because Claude follows the protocol without shortcuts. It doesn't skip the storage layer because "the problem is probably in the query engine." It reads everything the protocol tells it to read, and reports what it finds.
And critically: both fixes were simple. Changing Vec<(u64, String)> to HashMap<u64, String> is a one-line type change with straightforward call-site updates. Replacing scan_all() with direct row access is a small method addition. The hard part wasn't writing the code — it was finding the right code to write. The custom command's discipline of "research first, plan second, code third" meant the actual implementation was almost an afterthought.
The command's pitfalls section now has two new entries: "storage access patterns can dominate query performance even on small datasets" and "Vec-as-lookup-table is an O(n) trap — always use HashMap when access is by key." These join the existing pitfalls as hard-won knowledge encoded for future sessions.
The Bigger Picture: What the Small Ontology Predicted
The pizza ontology has 33 axioms and 16 taxonomy nodes. Both root causes — full table scans per edge and linear string search — were scaling costs, not fixed costs. They get worse as the graph grows.
This has a critical implication that the fix confirms. On SNOMED CT (350,000+ concepts, millions of edges), the pre-fix scan_all() pattern would have materialized 350,000 rows per edge resolution. The linear string search would have scanned 350,000 entries per read. The system wouldn't have been 5× slower — it would have been unusable.
The pizza benchmark was the canary. The 5× gap on 100 nodes was the warning that something algorithmically wrong existed in the storage access pattern. Fixing it on the small ontology means the large ontology will work. Skipping it would have hit a wall at scale that's much harder to diagnose when you're also dealing with memory pressure and cache effects.
With the O(n) patterns eliminated, we can now scale up with confidence. The remaining performance characteristics — 56 µs per query, 139 µs per export — are dominated by constant factors and should scale linearly with result size, not quadratically with table size.
Scaling Up: The GALEN Ontology at 2,700 Concepts
The pizza ontology proved the optimization was real. But 16 taxonomy nodes doesn't tell you how a system behaves at scale. The next step was GALEN — a widely-used medical ontology expressed in OWL 2 EL, with roughly 2,700 concepts covering clinical anatomy, pathology, and procedures. GALEN is one of the standard benchmarks for EL reasoners precisely because it exercises the patterns that matter in healthcare: deep role hierarchies, dense existential restrictions, and the kind of cascading inferences that stress-test saturation algorithms. It sits squarely between a toy example and the full SNOMED CT behemoth.
We ran the full benchmark suite against GALEN on both backends. The numbers:
These numbers tell a very different story than pizza.
Parse and classify are tied, as expected — these stages don't touch the database backend. The classify time of ~13.5 seconds is the cost of saturating 2,700 concepts under GALEN's dense axiom structure. It's worth noting that DEALER's classifier now supports fuzzy OWL 2 EL reasoning — degree-annotated subsumption with t-norm semantics (Zadeh min, Lukasiewicz sum) — and this fuzzy classifier has not yet been optimized. The 13.5-second classify time is running the new fuzzy saturation algorithm, which carries additional per-rule overhead for degree propagation and epsilon comparisons. There's headroom here that we haven't touched yet.
But look at the database stages. On pizza, rukuzu was 108× faster on export and 6× faster on query. On GALEN, both advantages have reversed. kuzu is now 3.6× faster on export and over 1,000× faster on query.
Read that last number again. 596 ms versus 528 µs. A three-orders-of-magnitude gap, in the opposite direction from what the pizza benchmark showed. The 29× speedup we celebrated on pizza doesn't just fail to scale — it inverts at scale.
This is the most important result in this article, and it's the result that makes the pluggable architecture indispensable. If we only had rukuzu, the GALEN query time of 596 ms would be a data point without context — slow, but how slow? The kuzu comparison at 528 µs tells us exactly how much performance is on the table. Three orders of magnitude.
What the GALEN Reversal Reveals
The reversal has a structural explanation. On pizza (16 nodes, ~200 edges), per-query fixed costs dominate. kuzu pays FFI overhead on every query — marshaling Cypher strings across the language boundary, converting C++ result tuples to Rust types. rukuzu pays none of that. At small scale, the FFI tax is the biggest cost, and rukuzu wins.
On GALEN (2,700 concepts, tens of thousands of edges), the fixed costs are irrelevant. What matters is how the query engine traverses a large graph, and kuzu has years of optimization for exactly this: columnar storage, vectorized scans, pre-computed statistics, and a mature query planner that chooses execution strategies based on cardinality estimates. rukuzu's query engine is young. It works — 596 ms is a correct answer — but it's doing far too much work to get there.
The export reversal from 108× faster (pizza) to 3.6× slower (GALEN) is similarly revealing. rukuzu's in-process export was fast on pizza because the overhead of small-batch FFI calls dominated kuzu's export path. At GALEN scale, the actual I/O work dominates, and kuzu's bulk loading path — optimized for exactly this pattern over years of C++ development — pulls ahead.
This is precisely the scenario the pluggable architecture was designed for. We don't have to guess whether rukuzu or kuzu will be faster on a given workload. We benchmark both and use the data. For interactive mobile queries on small local ontologies, rukuzu wins. For loading and querying clinical-scale ontologies, kuzu wins — decisively.
The Feedback Loop Continues: Diagnosing 1,128×
We fed the GALEN numbers back into the loop the same way we fed the pizza numbers: benchmark data as part of the prompt, then a pointed question. Not "optimize the query engine" — that's too vague. Instead: "the same memory parameters are used for both backends in the test harness? Are they using memory and storage differently? Because unless we can do much better on larger graphs, it's game over for rukuzu at clinical scale."
Claude's investigation of the memory architecture revealed the root cause in under three minutes. The answer wasn't a bug this time. It was an architectural gap.
kuzu's "in-memory mode" isn't what it sounds like. When kuzu initializes with SystemConfig::default(), it pre-allocates a buffer manager that reserves 80% of physical RAM as an mmap region — the benchmark showed a memory spike from 500 MB to 1.7 GB, then settling to 900 MB. That's not waste; it's a virtual memory reservation that lets kuzu map storage pages on demand, use CSR (Compressed Sparse Row) storage for graph edges, maintain hash indexes for primary key lookups, and process tuples in vectorized batches of 2,048.
rukuzu's "in-memory mode" means something entirely different: plain HashMap storage, row-at-a-time access, and eager materialization. When a scan operator initializes, it calls table.scan_all_with_txn(txn_id).collect() — materializing the entire table into a Vec — in its constructor. Before any query processing begins, every row is already in memory as a Rust object.
On pizza (16 nodes), this is fine. Collecting 16 rows into a Vec is free. On GALEN (2,700 concepts, tens of thousands of edges), this is the 1,128× gap. Every scan operator eagerly materializes every row of every table it touches, then the query logic filters down to the handful of rows it actually needs. kuzu's lazy, page-mapped scanning touches only the rows the query demands.
The critical insight: the batch processing infrastructure already exists in rukuzu. Phase 4 of the port had already built DataChunk, SelectionVector, ScanSharedState, evaluate_batch(), filter_chunk(), and project_chunk() — the morsel-driven pipeline components that kuzu uses for vectorized execution. They're just not wired into the actual scan operators. The operators still do the eager .collect() that was the quick-and-correct implementation during the initial port.
This is a textbook example of what the custom command's "research both implementations" step catches. The batch infrastructure was built following the porting workflow — Claude read kuzu's morsel-driven execution model, implemented the Rust equivalent, and it passed all tests. But the scan operators that feed data into that pipeline were implemented with the simplest correct approach: materialize everything, then process. On pizza, "simplest correct" and "fast" were the same thing. On GALEN, they're three orders of magnitude apart.
The fix has three phases: lazy scanning (replace eager .collect() with iterator-based page-at-a-time access), primary key index usage (use the existing hash indexes to skip directly to matching rows instead of scanning), and batch wiring (connect the lazy scanners to the existing DataChunk pipeline for vectorized processing). Claude is implementing this now, following the same phase-by-phase approach that worked on the pizza optimization: each phase is independently testable, each gets a benchmark run, and the numbers either move or they don't.
The Full Performance Picture
The story is now more nuanced — and more honest — than a simple "Rust wins" narrative. Here's the complete picture across both ontology scales:
At small scale, rukuzu wins on everything — no FFI overhead, no C++ runtime, sub-millisecond pipeline. At clinical scale, kuzu's mature query engine dominates by three orders of magnitude. The pluggable architecture isn't a nice-to-have; it's the architecture. Different workloads demand different backends, and the --features flag lets you choose without touching a line of application code.
For mobile deployment, the picture is clear but nuanced: rukuzu is the right choice for small, local ontologies where startup time and binary size matter. For clinical-scale ontologies like GALEN (and eventually SNOMED CT), kuzu's query engine is currently irreplaceable. The trait abstraction means DEALER doesn't care — it runs the same code against either backend.
What Comes Next
As of this writing, the lazy morsel fix is being implemented. The three-phase plan — lazy scanning, PK index usage, batch wiring — targets the same architectural layer as the pizza optimization but at a deeper level. The pizza fix replaced scan_all() with direct row access. The GALEN fix replaces eager materialization with lazy, index-driven, batched execution. If the pizza optimization is any guide, the numbers will move dramatically — the question is whether they move enough to close a three-order-of-magnitude gap.
There's also a separate optimization frontier: the fuzzy classifier itself. DEALER's new fuzzy OWL 2 EL reasoning — degree-annotated subsumption with t-norm semantics (Zadeh min, Lukasiewicz sum) — hasn't been through the optimization loop yet. The 13.5-second GALEN classify time is running the unoptimized fuzzy saturation algorithm, which carries additional per-rule overhead for degree propagation and epsilon comparisons. Profiling the fuzzy-specific code paths against GALEN is a natural next step, and the same custom command workflow applies.
Beyond GALEN, the frontier continues: SNOMED CT subsets (~50,000 concepts), and eventually full SNOMED CT (~350,000 concepts). Each scale will reveal new bottlenecks, and each bottleneck gets encoded as a new pitfall in the custom command for future sessions.
The custom command workflow will continue: benchmark, diagnose, fix, re-benchmark. The fundamental question from Part 1 — "can a Rust reimplementation of a mature C++ database engine compete on performance?" — now has a more complete answer. At small scale, it doesn't just compete; it wins. At clinical scale, it has a long way to go — but the diagnosis is clear, the batch infrastructure is already built, and the fix is underway. The pluggable architecture means we can ship today with kuzu for clinical workloads while rukuzu catches up. That's the real lesson: the right architecture lets you ship today with the best available backend while optimizing toward the one you want.
---
This is Part 2 of an ongoing series. Part 1 covered the porting workflow and initial benchmarks. Future installments will follow the optimization work on GALEN and larger ontologies as the feedback loop continues.