Neuro-Logical: Ontology-Grounded Differential Diagnosis with LLM Integration
Loxation Team
- 10 minutes read - 1973 wordsNeuro-Logical: Ontology-Grounded Differential Diagnosis with LLM Integration
Jonathan / Loxation / March 2026
If you’ve spent time in clinical informatics, you’ve watched two waves of clinical decision support arrive with great promise and underdeliver for the same underlying reason.
The first wave — rule-based CDS — encoded expert knowledge as if-then logic over coded data. It worked when the data was clean, the rules were narrow, and the clinical context was constrained. It broke on free text, couldn’t scale to broad differential diagnosis, and required manual maintenance that no team could sustain across a SNOMED-scale terminology.
The second wave — LLM-based CDS — solved the free text problem spectacularly. GPT-5 and Claude can parse a clinical note, generate a plausible differential, and explain the reasoning in fluent prose. But they introduced failure modes that are arguably worse than the ones they fixed: hallucinated diagnoses that don’t follow from the evidence, uncalibrated confidence that can’t distinguish a 90% posterior from a 30% posterior, and incomplete differentials with no guarantee of exhaustive evaluation against the knowledge base.
We’ve been building a third approach. It uses an LLM for what LLMs are good at (natural language understanding and generation) and a formal reasoner for what formal reasoners are good at (deductive closure, calibrated inference, audit trails). The LLM’s reasoning is an input to the reasoner. The reasoner never parses free text. They work together.
Architecture
The system has three layers. The LLM handles perception (clinical text to structured observations) and expression (structured results to clinical prose). A description logic reasoner handles the inference. A verification layer checks the LLM’s output against the reasoner’s conclusions.
The critical architectural decision is that the LLM never selects a concept identifier (they are often hallucinated). It extracts clinical terms in natural language — “butterfly rash,” “ANA positive,” “fatigue” — with fuzzy degrees of clinical certainty and source spans from the original text. A separate concept linking pipeline, built on biomedical sentence embeddings (SapBERT) and nearest-neighbor search against a SNOMED CT concept vocabulary index, maps each term to an IRI deterministically. “Butterfly rash” resolves to SNOMED 200938002 (Malar rash) at 0.91 cosine similarity. The IRI physically exists in the pre-built index. Fabrication is impossible.
This two-stage extraction contract — LLM produces natural language, embedding model produces vectors, concept index produces IRIs — is the primary hallucination guardrail. We call it the hallucination firewall. An LLM given a vocabulary of IRIs in its context window will occasionally select an adjacent code, fabricate a plausible-looking identifier, or silently substitute a related concept. None of these failure modes exist when the generative model never sees an identifier.
The Reasoning Engine
The reasoner is DEALER, an OWL 2 EL++ saturation-based classifier with fuzzy degree propagation. It implements the ELK algorithm — the same algorithm that classifies SNOMED CT — extended with t-norm fuzzy conjunction (Zadeh, Łukasiewicz, or crisp, configurable per deployment). For those unfamiliar with ELK-style reasoning: given a set of stated axioms and a set of patient observations (ABox assertions with fuzzy degrees), saturation applies completion rules exhaustively until fixpoint. The result is the deductive closure — every concept in the ontology that is consistent with the observations, with a graded membership degree.
This is the exhaustiveness guarantee that no LLM can provide. If aortic dissection is in the ontology and the evidence is even weakly consistent with it, it appears in the label set. An LLM might not mention it because it pattern-matched to a more common diagnosis. A saturation-based reasoner evaluates every concept by construction.
DEALER runs on-device. It compiles to native Rust via XCFramework for iOS. Saturation of a clinical ontology completes in under 15 milliseconds on mobile hardware. It works offline.
Bayesian Layer: From Fuzzy Degrees to Calibrated Posteriors
Fuzzy membership degrees are not diagnostic probabilities. “This patient matches lupus to degree 0.7” is not the same as “there is a 70% probability this patient has lupus.” Three forms of uncertainty require a probabilistic layer on top of the deductive core:
Explaining-away. Confirming lupus should reduce the posterior for fibromyalgia when both explain the same symptoms. EL++ subsumption is monotonic — it cannot express inter-diagnosis competition. A Bayesian network with shared observable nodes handles this naturally via Noisy-OR conditional probability tables.
Epistemic uncertainty. A test not yet ordered has no representation in fuzzy DL. The system must distinguish “anti-dsDNA was negative” (strong evidence against lupus) from “anti-dsDNA was not ordered” (no information). This is the three-valued evidence model: positive findings carry a fuzzy degree, explicitly negative findings carry strong negative likelihood ratios, and unobserved findings are marginalized. The open-world default is a hard constraint — absence of evidence requires an explicit “rule out” API call. Silence means unknown, not absent.
If this sounds familiar, it should. The closed-world assumption — treating “not documented” as “not present” — is one of the most well-characterized sources of diagnostic error in the CDS literature. Our system makes the opposite assumption by default, at the API level, so that building a closed-world system requires deliberate, per-finding effort.
Prior prevalence. A condition might match the symptoms well (high fuzzy degree) but be vanishingly rare. Population priors should pull the posterior down. The Bayesian layer supports three tiers of prior estimation: Tier 1 from ontology structure (uniform, adequate for demos), Tier 2 from epidemiological data (CMS claims, public health datasets), and Tier 3 from institutional encounter data (the gold standard — your patient population, your disease mix, your test characteristics). One warning we’ve learned to repeat: bibliometric data (PubMed citation counts, UMLS concept frequency) is not valid for priors. Rare diseases are systematically overrepresented in the literature.
The Bayesian layer also computes value of information — the expected entropy reduction for each unobserved test. This produces a ranked “order this test next” recommendation based on information-theoretic optimality rather than clinical heuristics. For systems managing laboratory utilization, this is operationally significant.
SNOMED CT Integration
DEALER consumes OWL 2 Functional Syntax. SNOMED CT’s official OWL Toolkit converts RF2 distributions to .ofn files with stated axioms. The reasoner classifies these natively — it implements the same algorithm (ELK) that SNOMED International uses for their published inferred relationships.
For deployment, we don’t re-classify the full SNOMED TBox on every app launch. The stated TBox is classified once at build time (a CI step per SNOMED edition), and the saturated reasoner state is serialized to a binary snapshot (~100MB for full SNOMED). On device, the snapshot loads in under a second via memory-mapped I/O, and the reasoner is immediately ready for incremental ABox reasoning — adding patient observations without re-running saturation. The incremental path exploits monotonicity: labels and edges only grow, so existing contexts remain valid.
For Phase 1 deployments, we recommend a domain-scoped SNOMED subset (the transitive closure of concepts relevant to the target specialty — typically 5,000 to 15,000 concepts for a rheumatology or emergency medicine pilot). This classifies on-device in milliseconds. Full SNOMED support ships in Phase 2 via the serialization path.
The concept vocabulary index for the hallucination firewall is also built from SNOMED. Each concept’s canonical label (and optionally rdfs:label/skos:altLabel synonyms) is embedded with SapBERT — a model specifically trained on UMLS synonym pairs — and stored in a vector index keyed by SNOMED IRI. This index ships as static memory-mapped files alongside the ontology. At query time, the LLM’s extracted clinical term is embedded with the same model, and nearest-neighbor search returns the closest concept. SapBERT was designed for exactly this task: placing “butterfly rash” near “malar rash” in embedding space.
Verification and Auditability
The verification layer runs three checks against the LLM’s unconstrained output.
Consistency. Does the LLM’s top diagnosis appear in the reasoner’s posterior ranking? Disagreement means either the LLM hallucinated or the ontology is missing a concept — both are actionable signals.
Completeness. Did the reasoner surface a diagnosis with non-negligible posterior that the LLM didn’t mention? If so, the system injects it into the output with supporting evidence from the reasoning trace. This is the catch-the-miss mechanism. In our internal testing, this is where the system adds the most clinical value — surfacing diagnoses that a pattern-matching approach would overlook because they’re atypical presentations of uncommon conditions.
Grounding. Every claim in the final output is traceable to either the reasoning trace (for diagnostic claims and probabilities) or the source text (for extracted observations). The audit trail records the complete chain: which substring the LLM read, which clinical term it extracted, which SNOMED concept the linker matched (with similarity score), which axioms fired during saturation, which CPT weights the Bayesian network used, and which posterior probability resulted.
This trace is designed with the regulatory requirement in mind. For a Class II SaMD submission under 21 CFR 892.2050, the deterministic core (DEALER) is substantially easier to validate than a stochastic model. The audit trail supports the “valid clinical association” requirement. And the clear separation between the LLM layer (perception/expression) and the reasoning layer (DEALER/dealer-bayes) creates a clean liability boundary — extraction errors (LLM) are distinct from reasoning errors (ontology incomplete) and are identifiable in the trace.
Performance
The full pipeline on-device, using an on-device small language model for extraction and expression:
| Stage | Component | Latency |
|---|---|---|
| Extraction | On-device SLM (2-3B, quantized) | ~500ms |
| Concept embedding | SapBERT (quantized ONNX, ~30MB) | ~40ms (8 terms) |
| Concept linking | dealer-vector nearest-neighbor | <10ms |
| Reasoning + Bayesian | DEALER + dealer-bayes | <15ms |
| Expression | On-device SLM | ~500ms |
| Total | ~1.1s |
When network is available, a cloud LLM (Anthropic,OpenAI,Google,NVidia etc) replaces the on-device model for extraction and expression, improving quality at the cost of 2-5 seconds latency. The reasoning layer — DEALER, dealer-bayes, SapBERT, and the concept index — always runs on-device. It is never cloud-dependent.
Institutional Integration
The system is designed for deployment within an existing health system infrastructure. What’s needed from the institution:
Encounter data for calibration. De-identified encounters with documented findings, ordered tests, and confirmed diagnoses. A calibration pipeline processes these into institution-specific Bayesian network parameters — priors tuned to your patient population, CPT weights reflecting your disease prevalence and test characteristics. Minimum recommended volume: 50,000 encounters across target clinical domains.
Clinical domain selection. The ontology subset, Bayesian network, and extraction prompts are domain-specific. A rheumatology pilot is a different configuration than an emergency medicine pilot. We recommend starting with one specialty where the differential diagnosis problem is well-characterized and the encounter data is available.
EHR integration surface. The system consumes clinical text and produces structured output (ranked differential with posteriors, VOI recommendations, reasoning trace). The integration point depends on the EHR: a CDS Hooks endpoint, a SMART on FHIR app, or a direct API call from a custom clinical workflow.
What We’re Measuring
Extraction quality: concept precision (>95% target), concept recall (>90%), negative extraction accuracy (>98% — false negatives are the most dangerous error), and degree calibration against clinician ratings.
System-level outcomes: completeness injection rate (how often the reasoner surfaces a diagnosis the LLM missed), posterior calibration via Brier score on confirmed cases, and diagnostic error reduction compared to an LLM-only baseline.
Operational metrics: unnecessary test reduction via VOI-guided workup, time-to-correct-diagnosis, and clinician trust ratings on explanation quality.
The Core Claim
An LLM is an extraordinary perception and communication system. It is not a reasoning engine. It cannot guarantee exhaustive evaluation, calibrated probabilities, explaining-away, or epistemic uncertainty handling. It cannot produce a deterministic audit trail. A description logic reasoner with a Bayesian inference layer can do all of these things, but can’t read free text or explain itself in natural language.
The neuro-logical architecture gives each component the right job. The result is a system where every failure mode of the neural component has a corresponding check in the logical component, and the logical component is deterministic, auditable, and regulatable.
We think this is what clinical decision support should look like. If you’re working on similar problems, we’d like to hear from you.
DEALER and Loxation are built by Loxation LLC.*
Contact: jonathan@openhealth.org