Loxation Team
- 7 minutes read - 1466 wordsPorting 200K Lines of C++ to Rust with Claude, Part 3: Closing the 1,128x Gap
This is Part 3 of a series. Part 1 described the systematic workflow for porting kuzu's C++ graph database to Rust. Part 2 followed the optimization feedback loop from a 29x speedup on the pizza ontology to the humbling GALEN benchmark — where kuzu's mature query engine was 1,128x faster than rukuzu. This part documents the fix.
Where We Left Off
Part 2 ended with a diagnosis and a plan. Claude had traced the 1,128x query gap on GALEN to an architectural mismatch: rukuzu's scan operators eagerly materialize entire tables with scan_all_with_txn(txn_id).collect(), while kuzu lazily maps pages through a buffer manager and processes tuples in vectorized batches of 2,048. On pizza (16 nodes), eager materialization is free. On GALEN (2,700 concepts, tens of thousands of edges), it's catastrophic.
The two systems don't even mean the same thing by "in-memory mode." kuzu's SystemConfig::default() pre-allocates a buffer manager that reserves 80% of physical RAM as an mmap region — virtual memory that lets it map storage pages on demand, maintain CSR edge storage and hash indexes, and process tuples in vectorized batches. rukuzu's in-memory mode means HashMap storage, row-at-a-time access, and eager materialization. Same API, radically different architectures.
The fix was written up in performance_fix_lazy_morsel.md and implemented by Claude following the same custom command workflow that guided the original port. Three phases: lazy scanning, primary key index usage, and batch wiring. Each independently testable. Each gets a benchmark run.
The Fix: Lazy Morsel-Driven Scanning
The critical insight from Part 2 was that the batch infrastructure already existed. Phase 4 of the original port had built DataChunk, SelectionVector, ScanSharedState, evaluate_batch(), filter_chunk(), and project_chunk() — the morsel-driven pipeline components that kuzu uses for vectorized execution. They passed all tests. They just weren't wired into the scan operators that feed the query engine.
The scan operators still did table.scan_all_with_txn(txn_id).collect() in their constructors. On pizza, this collects 16 rows — negligible. On GALEN, it materializes every row of every table the query touches before any filtering happens. A taxonomy lookup that returns a handful of direct subsumers was eagerly loading tens of thousands of rows it would never read.
The fix replaced eager materialization with lazy, index-driven scanning: scan operators now use primary key hash indexes to jump directly to matching rows, emit results through the existing DataChunk pipeline, and stop as soon as the query's result set is complete. No full table scans. No eager .collect(). The query touches only the rows it needs.
Following the custom command's phase discipline, the implementation was testable at each step. cargo test --workspace after every change. cargo clippy --workspace with zero warnings. Then the benchmark.
The Result: Clean Sweep at Clinical Scale
Read the db_query row. rukuzu went from 580 ms (Part 2) to 14.3 us. That's a 40,500x speedup — more than four orders of magnitude — on a clinical-scale ontology with 2,700 concepts. And it's now 36x faster than kuzu's mature C++ query engine.
The db_export row tells an equally dramatic story. In Part 2, kuzu was 3.6x faster on GALEN export. Now rukuzu is 9x faster — the export went from 883 ms to 26.7 ms. The lazy morsel fix didn't just close the query gap; it transformed the entire database layer.
rukuzu now wins on every database operation at every scale. The "write/read tradeoff" from Part 1, the "GALEN reversal" from Part 2 — gone. Clean sweep.
The Full Arc: From 1,128x Slower to 36x Faster
Let's trace the complete optimization story across both ontologies:
The 36x query advantage on GALEN is far larger than the 6x advantage we measured on pizza in the intermediate benchmarks. This makes sense: the FFI overhead that kuzu pays is relatively more expensive on GALEN because the actual query work (once the storage access patterns are correct) is so fast. At 14.3 us, rukuzu resolves a taxonomy lookup on a 2,700-concept medical ontology in the time it takes to do a couple of memory allocations in C++. kuzu is spending most of its 518 us marshaling results across the language boundary.
What 40,500x Teaches About AI-Assisted Development
The GALEN optimization is a stronger validation of the custom command workflow than the pizza optimization was, precisely because it was harder.
The pizza fix was two bugs: scan_all() per edge and Vec-as-lookup-table. Both were straightforward once found — change a data structure, add a direct-access method. The GALEN fix was architectural: replacing the eager materialization model with lazy, index-driven, batch-processed scanning. This isn't a one-line fix. It's rewiring how the query engine feeds data into the execution pipeline.
And yet the workflow was identical. Feed Claude the benchmark data. Let it trace the execution path following the "research both implementations" step. Compare what kuzu does (buffer-managed lazy scanning) with what rukuzu does (eager .collect()). Identify the mismatch. Plan the fix in phases. Implement and test each phase. Re-benchmark.
The key difference was the diagnosis. On pizza, the problem was bugs — code that did the wrong thing. On GALEN, the problem was architecture — code that did the right thing slowly. The scan_all_with_txn(txn_id).collect() call was correct. It produced the right results. It just materialized everything instead of materializing what the query needed. The custom command's "research both implementations" step caught it because it doesn't distinguish between "wrong" and "slow" — it reads the full execution path and compares.
Claude's three-minute diagnosis — tracing from the 1,128x gap to the eager materialization pattern, noting that the batch infrastructure was already built, and proposing the three-phase fix — is the custom command earning its keep at a level that the pizza optimization didn't test. This wasn't finding a bug in a data structure. This was identifying an architectural mismatch between two database engines and designing the migration from one execution model to another. The fact that the fix plan was written to performance_fix_lazy_morsel.md and implemented in a single session, with all tests passing, suggests the workflow scales with problem complexity — not because Claude is a better database engineer than a human, but because it follows the protocol without shortcuts and reads everything the protocol tells it to read.
What Comes Next
The database layer is solved. rukuzu wins on both reads and writes at both toy and clinical scale — 9x faster on export, 36x faster on query, on 2,700 concepts. The pluggable architecture still exists (build with --features kuzu to use the C++ backend), but the performance argument for doing so has evaporated at every scale we've tested.
The next frontier is the reasoning engine itself. Classification takes 1.7 seconds on GALEN — fast enough for app startup, but it's running an unoptimized fuzzy saturation algorithm. DEALER now supports fuzzy OWL 2 EL reasoning with degree-annotated subsumption and t-norm semantics (Zadeh min, Lukasiewicz sum), and this fuzzy classifier hasn't been through the optimization loop yet. The question is whether SNOMED CT (~350,000 concepts) classification time is acceptable, and whether parallel saturation can bring it into range.
The fundamental question this series set out to answer — can a Rust reimplementation of a mature C++ database engine, ported with AI assistance using a systematic custom command workflow, compete on performance? — now has a definitive answer. It doesn't just compete. At 14.3 us versus 518 us on a clinical-scale ontology, it wins by 36x. The gap went from 1,128x in the wrong direction to 36x in the right direction. The tool that closed it was discipline — encoded in a markdown file, executed by an AI that follows protocols without shortcuts.
And the bottom line is this: we can do ontology reasoning on a mobile platform with reasonable results. No, medical diagnosis can't be done in microseconds — the classification of a 2,700-concept clinical ontology takes 1.7 seconds, and the fuzzy reasoning that supports graded clinical judgments hasn't been optimized yet. But a 14-microsecond query response on a classified ontology, running on an iPad chip, with no cloud dependency? That's not science fiction. That's engineering, driven by a data-and-benchmarking process that works for both the graph database and the reasoning engine.
---
This is Part 3 of a series. Parts 1-2 covered the porting workflow, initial benchmarks, and the GALEN reversal. Part 4 turns to the reasoning engine itself: ELK-style parallel saturation for a 10x pipeline speedup.