Dealer: Parallelizing the fuzzy EL++ classifier for the win! Medical diagnosis on the mobile device
Loxation Team
- 13 minutes read - 2628 wordsPart 4: Parellizing the fuzzy EL++ classifier, medical diagnosis on the mobile device
This is Part 4 of a series. Parts 1–3 covered porting kuzu’s C++ graph database to Rust and optimizing the query engine. This part turns to the reasoning engine itself: implementing ELK-style parallel saturation in DEALER’s OWL 2 EL++ classifier. Real world results.
A Different Kind of Optimization
Parts 2 and 3 optimized the database layer — how DEALER stores and queries its classification results. The numbers were dramatic (6,580× query speedup on GALEN), but they addressed a downstream stage. The ontology was already classified; we were just making it faster to ask questions about the results.
This article targets the upstream bottleneck: classification itself. On GALEN — a medical ontology with roughly 30,000 concepts and a 5.6 MB source file — the single-threaded saturation algorithm took 9.7 seconds for classification and 12.4 seconds for taxonomy extraction. The total pipeline (parse, normalize, classify, extract taxonomy) ran ~22 seconds.
After implementing ELK-style parallel saturation and an optimized taxonomy extraction: ~1.85 seconds total. A 12× pipeline speedup.
Why Classification Is the Bottleneck
DEALER implements an EL++ saturation-based classifier following the ELK architecture (Kazakov et al., ISWC 2011). The algorithm is conceptually simple: for each concept in the ontology, maintain a label set S(A) initialized to {A, owl:Thing}. Apply nine completion rules (CR1–CR9) that propagate subsumption inferences across concepts until no new inferences can be derived — the fixpoint. The taxonomy is then extracted from the final label sets.
The work is enormous. On GALEN, saturation processes 28,507 contexts and performs 11.96 million inference steps. Each rule application can trigger further rule applications across related concepts, creating cascading chains of inference. A single existential restriction axiom can propagate through dozens of concepts before reaching fixpoint.
The single-threaded implementation processes concepts sequentially: pick a concept from the work queue, apply all applicable rules, add newly triggered concepts back to the queue, repeat. This is correct and simple, but it leaves 7 of 8 cores idle on a modern machine. The ELK paper showed that saturation is embarrassingly parallel — different concepts can be processed simultaneously as long as cross-concept writes are handled carefully.
The Architecture: What Makes Parallel Saturation Hard
The challenge isn’t parallelism per se — it’s the data access pattern. Each concept’s context (its label set, predecessor links, and existential edges) is primarily written by rules processing that concept, but read by rules processing other concepts. A rule firing on concept A may need to read concept B’s label set to check a subsumption, or write to concept C’s todo queue to trigger further processing.
The ELK paper’s solution is elegant: each context has a single owner (the worker currently processing it), writes to a context are only performed by its owner, and cross-context communication happens through message passing (adding items to another context’s todo queue). This eliminates data races by construction.
Here’s how we implemented it in Rust:
┌─────────────────────────────────────────────────────────────────────┐
│ Component │ Design │
├─────────────────────────────────────────────────────────────────────┤
│ Context access │ UnsafeCell + AtomicU8 CAS │
│ │ Single-writer from scheduling; zero overhead│
├─────────────────────────────────────────────────────────────────────┤
│ Todo queues │ Per-context Mutex<VecDeque> + per-context │
│ │ high-water mark for natural dedup │
├─────────────────────────────────────────────────────────────────────┤
│ Active dispatch │ Lock-free SegQueue<ConceptId> │
│ │ High-throughput MPMC for worker scheduling │
├─────────────────────────────────────────────────────────────────────┤
│ Cross-context reads │ Unsafe stale reads │
│ │ Labels/edges grow monotonically; stale │
│ │ reads are conservative but correct │
├─────────────────────────────────────────────────────────────────────┤
│ Predecessor writes │ Deferred via PredecessorAdded inference │
│ │ Eliminates all cross-context mutation │
├─────────────────────────────────────────────────────────────────────┤
│ Nominal merging │ Deferred to sequential phase │
│ │ Too complex for concurrent; GALEN has 0 │
├─────────────────────────────────────────────────────────────────────┤
│ Thread pool │ rayon::scope with custom inner loop │
│ │ Manages threads/panics; saturation-specific │
│ │ scheduling inside │
├─────────────────────────────────────────────────────────────────────┤
│ Quiescence │ active_count + idle spin (64 iterations) │
│ │ Catches late activations without busy-wait │
└─────────────────────────────────────────────────────────────────────┘
Three design decisions deserve explanation.
UnsafeCell + AtomicU8 CAS for context access. Each context slot has an atomic state byte. A worker claims a context with a compare-and-swap from IDLE to ACTIVE. If the CAS succeeds, the worker has exclusive write access — no locks, no synchronization overhead on the hot path. If it fails, another worker owns the context and the current worker moves on. This gives us the single-writer guarantee the ELK model requires at zero runtime cost beyond the CAS.
Unsafe stale reads for cross-context label sets. When a rule on concept A needs to check whether concept B’s label set contains a particular concept, it reads B’s labels without synchronization. This is safe because label sets grow monotonically — once a concept is added, it’s never removed. A stale read might miss a recently added label, but that’s conservative: it means we might re-derive an inference later rather than missing it. The saturation algorithm is designed for exactly this — redundant inferences are idempotent, and the fixpoint is the same regardless of processing order.
Deferred predecessor writes via PredecessorAdded inference. The trickiest cross-context operation is adding predecessor links: when concept A gains a new existential edge to concept B, B needs to record A as a predecessor. Instead of writing to B’s data structure directly (which would require synchronization), the worker processing A enqueues a PredecessorAdded inference on B’s todo queue. When B is next processed, its owner handles the predecessor update. This eliminates all cross-context mutation — the only cross-context operation is appending to a Mutex<VecDeque>, which is a single lock acquisition on a queue that’s only contended when multiple concepts simultaneously trigger inferences on the same target.
The Classification Results
┌─────────────────────────────────────────┬─────────┬─────────┬──────────────────────┐
│ Optimization │ Time │ Speedup │ Notes │
├─────────────────────────────────────────┼─────────┼─────────┼──────────────────────┤
│ Baseline (single-threaded) │ 9.70 s │ 1× │ │
├─────────────────────────────────────────┼─────────┼─────────┼──────────────────────┤
│ + Scratch buffers + crisp skip │ 8.77 s │ 1.1× │ Reusable allocations,│
│ │ │ │ skip stale checks │
├─────────────────────────────────────────┼─────────┼─────────┼──────────────────────┤
│ + Parallel saturation │ 1.70 s │ 5.7× │ ELK-style, 8 cores │
│ │ │ │ (Apple M1) │
└─────────────────────────────────────────┴─────────┴─────────┴──────────────────────┘
The path to 1.7 seconds came in two steps.
Step 1: Single-threaded cleanup. Before parallelizing, we eliminated unnecessary allocation overhead. Scratch buffers for rule application are now reused across iterations instead of allocated per-rule. The “crisp skip” optimization detects when a concept’s fuzzy degree is 1.0 (fully certain) and bypasses the t-norm computation and epsilon comparison — on GALEN, where most inferences are crisp, this avoids millions of floating-point operations. Combined effect: 9.70s → 8.77s, a modest 1.1× improvement. But this step matters because it reduces the per-inference cost, which amplifies the parallel speedup.
Step 2: Parallel saturation. The ELK-style concurrent architecture described above, running on 8 cores of an Apple M1. Effect: 8.77s → 1.70s, a 5.2× speedup from parallelism alone (5.7× from baseline). On a saturation workload that processes 11.96 million inferences across 28,507 contexts, this is strong scaling — not perfect (8 cores, 5.7× speedup), but well within expectations for a workload with cross-context dependencies and irregular task sizes.
The Taxonomy Extraction Breakthrough
The even bigger win was taxonomy extraction — the step after saturation that converts raw label sets into a directed acyclic graph with equivalence classes and direct parent/child edges.
┌──────────────────────────────────────────────┬──────────┬─────────┐
│ Optimization │ Time │ Speedup │
├──────────────────────────────────────────────┼──────────┼─────────┤
│ Baseline (O(n²) equiv + O(nk²) reduction) │ 12.37 s │ 1× │
├──────────────────────────────────────────────┼──────────┼─────────┤
│ O(nk) equiv + subsumer marking │ 0.13 s │ 95× │
└──────────────────────────────────────────────┴──────────┴─────────┘
From 12.37 seconds to 130 milliseconds. A 95× speedup.
The baseline algorithm was textbook but naive: O(n²) equivalence detection (comparing every pair of concepts) and O(nk²) transitive reduction (for each concept, checking every pair of its k parents to see if one subsumes the other). On GALEN’s 30,000 concepts, n² means 900 million comparisons for equivalence alone.
The optimized algorithm exploits the structure of saturation output. Equivalence detection is now O(nk): for each concept A with k labels, check whether each label B also has A in its label set. This uses the label sets that saturation already computed — no additional comparisons needed. Transitive reduction uses subsumer marking: instead of comparing parents pairwise, mark all ancestors of each parent in a bitset, then any parent that’s an ancestor of another parent is indirect and gets removed. This is O(nk) instead of O(nk²).
The 95× speedup was larger than the classification speedup because the baseline had worse algorithmic complexity. Classification went from O(lots) to O(lots/8) — constant factor improvement from parallelism. Taxonomy extraction went from O(n²) to O(nk) — algorithmic improvement that changes the scaling class.
The Full Pipeline: 22 Seconds to 1.85 Seconds
┌──────────────────┬──────────┬──────────┬─────────┐
│ Stage │ Before │ After │ Speedup │
├──────────────────┼──────────┼──────────┼─────────┤
│ Parse │ ~65 ms │ ~67 ms │ — │
├──────────────────┼──────────┼──────────┼─────────┤
│ Normalize │ ~10 ms │ ~10 ms │ — │
├──────────────────┼──────────┼──────────┼─────────┤
│ Classify │ 9.70 s │ 1.70 s │ 5.7× │
├──────────────────┼──────────┼──────────┼─────────┤
│ Extract taxonomy │ 12.37 s │ 0.13 s │ 95× │
├──────────────────┼──────────┼──────────┼─────────┤
│ Total pipeline │ ~22 s │ ~1.85 s │ 12× │
└──────────────────┴──────────┴──────────┴─────────┘
The pipeline is now dominated by classification at 1.7 seconds. Taxonomy extraction, which was the majority of the previous pipeline time at 12.37 seconds, is now 130 milliseconds — negligible. Parse and normalize were never the bottleneck and remain unchanged.
For mobile deployment, 1.85 seconds to fully classify a 30,000-concept medical ontology is a qualitative threshold. It’s fast enough to run at app startup without a loading spinner feeling painful. It’s fast enough to re-classify when an ontology update arrives over the network. And it runs on 8 cores of a mobile SoC — the M1 is an Apple Silicon chip that ships in iPads and MacBooks, the exact deployment target for DEALER. Add the 14-microsecond query time and the 27-millisecond export from Part 3’s database optimizations, and the full pipeline from OWL source to answered query completes in under 2 seconds.
Correctness: The Non-Negotiable
Parallel saturation introduces subtle correctness risks. Stale reads, reordered inferences, and concurrent rule applications can all produce results that differ from the sequential path. For a medical ontology classifier, “mostly correct” isn’t acceptable — a missing subsumption could mean a drug interaction goes undetected.
The parallel and sequential paths produce bit-exact identical label sets for every context. This was verified on two ontologies covering very different complexity profiles: the pizza ontology (small, simple axiom structure, fast saturation) and the GALEN ontology (28,507 contexts, 11.96 million inference steps, dense role hierarchies).
Bit-exact equivalence means the same concepts appear in the same label sets regardless of whether saturation runs on 1 core or 8. The monotonic growth property of label sets guarantees this: stale reads can only miss recently added labels, which means rules might fire in a different order, but the fixpoint — the final state where no new inferences are possible — is the same regardless of processing order. This is a provable property of the ELK algorithm, not an empirical observation.
What This Means for the Fuzzy Classifier
Parts 2 and 3 noted that DEALER’s fuzzy OWL 2 EL classifier — degree-annotated subsumption with t-norm semantics — hadn’t been optimized. The “crisp skip” optimization in this article is the first step: when a concept’s fuzzy degree is 1.0, bypass the t-norm computation entirely. On GALEN, where the vast majority of inferences are crisp, this avoids millions of unnecessary floating-point operations.
But the parallel saturation architecture itself is fuzzy-aware. The degree propagation and epsilon comparisons run inside each rule’s concurrent implementation. The monotonic growth property holds for fuzzy degrees too — a concept’s degree can only increase (become more certain), never decrease. This means the same stale-read argument applies: a worker might read a slightly-too-low degree for a cross-context concept, compute a conservative inference, and later re-derive a tighter one. The fixpoint is identical.
The 1.7-second classify time on GALEN is running the full fuzzy saturation algorithm, not a crisp-only shortcut. For ontologies with meaningful fuzzy annotations — where degrees differ from 1.0 — the same architecture applies without modification.
The Implementation: 976 Lines of Concurrency
The parallel saturation implementation spans two files:
concurrent.rs (376 lines) defines the concurrent data structures: ConcurrentContextTable (the shared context storage with UnsafeCell slots), ContextSlot (individual concept contexts with atomic state management), LocalInference (the message type for cross-context communication), and WorkerLocalState (per-worker scratch buffers and statistics).
reasoner.rs (+600 lines) implements classify_parallel(), the parallel saturation entry point, and parallel_phase(), which manages the rayon::scope thread pool and quiescence detection. It also contains the 9 concurrent rule implementations (CR1–CR9), each adapted to use the UnsafeCell access pattern and deferred predecessor writes.
Two new dependencies: rayon for the thread pool and crossbeam-queue for the lock-free SegQueue used in active concept dispatch.
The total implementation is under 1,000 lines. For context, kuzu’s parallel query execution infrastructure spans tens of thousands of lines of C++. The difference is scope — we’re parallelizing a specific algorithm (EL saturation) rather than a general-purpose query engine — but it illustrates how Rust’s ownership model and the rayon/crossbeam ecosystem compress concurrent code. The UnsafeCell + AtomicU8 pattern gives us the same zero-overhead single-writer semantics that would require a custom lock-free data structure in C++, in about 20 lines of Rust.
What Comes Next
The reasoning engine classifies GALEN in 1.7 seconds. The database query takes 14 microseconds. The database export takes 27 milliseconds. The total pipeline from OWL source file to answered query is under 2 seconds on a mobile-class processor. And rukuzu now wins on every database operation at every scale — 9× faster on export, 36× faster on query at GALEN’s 2,700 concepts (see Part 3 for the full benchmark data).
The next frontier is SNOMED CT. At ~350,000 concepts — roughly 12× larger than GALEN — the classification time will test whether the parallel saturation scales. The 4.8× speedup on 8 cores suggests we’re not yet hitting Amdahl’s law limits, but SNOMED CT’s axiom density and the sheer volume of cross-context inferences may introduce new contention patterns. The feedback loop continues: benchmark on SNOMED CT, identify the new bottleneck, fix it.
The series started with a question: can you port 200,000 lines of C++ to Rust with AI assistance and get competitive performance? Three parts later, the answer is yes — rukuzu wins on both reads and writes at clinical scale. This part asks the same question of the reasoning engine itself: can you implement a research-paper concurrent algorithm (Kazakov et al., ISWC 2011) with AI assistance and get near-theoretical speedups? On GALEN, the answer is 5.7× on 8 cores, with bit-exact correctness, in under 1,000 lines. That’s not a toy result. That’s a production-grade parallel reasoner.
And the bottom line across all four parts: ontology reasoning on a mobile platform is practical. Classify a 2,700-concept medical ontology in 1.7 seconds at app startup. Answer subsumption queries in 14 microseconds. Support fuzzy degree-annotated reasoning for graded clinical judgments. All on an iPad chip, all in pure Rust, no cloud round-trip required. Medical diagnosis can’t be done in microseconds — the domain is too complex, the reasoning too nuanced, the stakes too high for snap answers. But the computational infrastructure to support clinical decision-making at the point of care, running on hardware that fits in a coat pocket? That’s not science fiction. That’s what a data-driven benchmarking process, applied systematically to both the graph database and the fuzzy EL++ reasoning engine, delivers.
This is Part 4 of an ongoing series. Parts 1–3 covered the database porting and optimization. Future installments will follow the path to SNOMED CT scale.
About the Author
Jonathan Borden, M.D. served as an invited expert on the W3C OWL Working Group and is the founder and architect at Loxation. The DEALER project is part of an initiative to bring ontology reasoning to mobile and edge devices. Reach him at jonathan@oloxation.com.