Construe: extracting medical codes without building on quicksand

Construe extracts medical codes. When I explain it to someone new, the response is usually some version of "Ok so for billing, right?" Billing is just one use case, medical codes quietly run almost all of healthcare.

We originally built construe because we observed LLMs hallucinating medical codes in generated FHIR resources for our first API, lang2FHIR. LLMs are trained on next token prediction or more specifically, words or sub-parts of words. Medical codes (for the most part) do not behave like words or human language. They contain alphanumeric combinations, unique to every different terminology system, can convey different levels of hierarchy, change version to version, and some code systems are limited by licensure.

While some developers assume that Claude or ChatGPT should have access to the entire internet, the reality is that medical code systems themselves are often gated behind licensure portals, can change frequently (new drugs get released every week), and the specific combination of characters does not often follow a predictable pattern that an LLM can learn on.

But why, this reader asks, do I get pretty reasonable codes when I ask Claude? LLMs will do a decent job at predicting codes that occur very frequently in datasets they've been trained on, namely the broader internet. This covers codes like E11 (ICD-10 for type 2 diabetes) or 2339 (LOINC for blood glucose). It doesn't cover HP:0008947 (the HPO code for floppy infant, a phenotype common in NICU infants) or NAACCR site-specific codes for cancer registry submission. Getting these codes correct or incorrect can be the difference between making a material impact in helping a clinician with a decision or stalling them, potentially impacting a life-saving clinical choice.

V0: RAG as an API

We first built construe as a relatively straightforward RAG-as-an-API system. We embedded the text descriptions of medical codes and stored their embeddings in a SQLite database on our backend service. Construe consisted of the following configurable processes: chunking, embedding look up, and validation. The pipeline looked like this:

Construe V0 pipeline — clinical text is chunked, embedded, vector-searched against code descriptions, then LLM-validated to produce extracted codes.

When a user made an API request to extract codes, based on their selected chunking method the text request would be chunked or split into smaller parts. So a long clinical note could be chunked by sentences, paragraphs, or detected topics. A short note, like a message from a patient, could be not chunked at all. After chunking, the chunked text snippets were then embedded using an LLM embedding service. After embeddings are generated we identified the closest text embeddings for the corresponding code system the user's API request indicated. This search was narrowed or tuned based on how similar/dissimilar the detected codes are. Some users prefer to have highly specific codes for their use case; others express that their end-user clinician prefers high-level/more generic codes that are more familiar.

We also exposed a similarity threshold so customers could de-dupe near-duplicate codes - useful when, for example, an end-user clinician prefers one parent code instead of three closely-related children. Customers could define the level of de-duplication they'd prefer. After this step validation is performed on the detected codes against the source text. Validation was essentially LLM-as-a-judge, determining whether a code was truly supported by the source text or not.

Notably, validation needed to differ depending on the code system of interest. For example, medication codes are highly specific to dosage and formulation. Drug A at 10mg dosage is clinically very different than Drug A at 20mg dosage. A simple LLM validation might assume that since the clinical text is not written specifically enough it's ok to choose one over the other; in clinical AI that's just not true. So validation is configurable at this last stage.

We also support customers uploading their own custom coding system, so a customer who has their own medical terminology or set of codes doesn't need to start from scratch and can use construe to immediately start extracting codes.

What V0 taught us

We learned a lot from our first release of construe. Customers found the API to be extremely helpful for a variety of use cases, from powering clinical decision support to detecting medical concepts in patient chat messages. But we heard feedback that latency was too long and they observed non-determinism from the outputs. Clinical end-users gave feedback that the codes would occasionally change run-to-run, which was confusing. We added citations so that every extracted code now carries the exact sentence and its byte offsets back into the source text. Customers could audit not just which codes came back but why. We also moved off the embedded SQLite store we'd shipped V0 on and onto a hosted vector database with native vector search, which let us run searches concurrently per chunk and cut latency.

V0 taught us that latency and determinism were governed almost entirely by the foundation models. We'd built on Gemini for the function-calling quality, latency, and cost. However, as customers pushed construe deeper into clinician-facing workflows, model variance became our dominant problem.

V1: hybrid ML/LLM

We began R&D efforts to explore alternative machine learning based approaches to detect concepts; melding both traditional ML and modern LLM concepts into an updated pipeline. The majority of our team has implemented traditional ML in high-stakes and regulated industries from education to healthcare, so making this shift was easier than any of us expected.

I'll admit, as an ML practitioner for over 10 years, LLMs had gotten so good I'd started to wonder whether all those years of ML experience were for naught. Turns out: quite the contrary. Traditional ML methods can be extremely complementary with LLMs. We use LLMs as the orchestrator and for embeddings and ML for deterministic and fast concept detection.

We updated our construe pipeline to incorporate traditional clinical NLP and ML approaches as complementary configurations. Customers today can mix and match to achieve their desired outcomes. The current pipeline looks like this:

Construe V1 pipeline — chunks run embed/search/filter in parallel, merge with global dedupe and ancestor enrichment, then pass N validation rounds whose consensus vote returns the extracted codes; a fast phenotype recognizer can short-circuit the whole pipeline.

A key lever in that pipeline is chunking. We now expose seven methods across four families:

Construe chunking methods organized into four families — rule-based, LLM-based, ML-based, and full extraction — with the specific strategies offered under each.

Three validation strategies pair with these: skip, deterministic Jaccard re-ranking against the source span, and LLM judgment with optional N-round consensus voting that only returns codes passing every round unanimously.

For high-stakes deployments customers can compose a fully deterministic configuration - ML-based chunking plus Jaccard validation. We've observed a 10x improvement in latency (with some configurations achieving sub 100ms latency) and 100% deterministic results for customers who utilize these updated configurations in their realistic clinical evals.

Switching to a hybrid ML/LLM architecture does come with new challenges. We're managing more of the direct ML infrastructure now, optimizing GPU performance ourselves rather than offloading to the foundation models. But with token prices increasing, this is becoming more and more economically worth it versus relying on foundation models, which is increasingly like building on quicksand.

"Couldn't I just build this myself?"

A question that comes up often: "couldn't I just build this myself?" Honestly, yes - our customers are talented engineers, and the majority of them could build the original version of construe themselves. Many of them with ML experience could likely build our updated version too.

What we hear from them is that they just don't want to maintain it or tune it themselves. It's not the best use of their product or eng resources.

In other cases customers did build it themselves and come to us to switch off of their self-rolled implementation because the maintenance needs have become untenable.

That's our strategy as healthcare AI infrastructure: to be the experts on this stuff so our customers don't have to. The nuances of code system licensure, version drift, hierarchy resolution, dosage-sensitive validation, deterministic ensembling, clinical NLP methods, citation byte offsets is specialist work, and it's a lot of context to contend with along with, you know, the actual clinical product you're trying to ship.

Our customers are building decision support, intake tools, registry submissions, and a dozen other things on top of construe. They shouldn't also have to become experts in clinical ML, ICD10 ancestor enrichment or LOINC version handling to get there.

We've built construe with a high degree of configurability, enabling developers to tune exactly the configuration they want for their use case. A developer building an application for billing code extraction might use one configuration; a developer building clinical decision support for rare disease would use another. Some customers have even ensembled multiple construe configurations to tune to their specific use case. It's a tough balancing act for us at PhenoML to ensure that construe can meet a variety of use cases without becoming cumbersome so we work hard to maximize developer ergonomics.

What's next

In the near term: streaming, and ready-to-use high-quality clinical note evals. Longer term: easier tuning to identify your ideal configuration, out-of-the-box configurations for common use cases, and continued investment in developer ergonomics.

More broadly, the hybrid ML/LLM approach we ended up with for construe is informing how we build across the rest of the platform. We treat non-determinism as a first-class engineering problem instead of a model-quality problem to be solved by the next foundation model release. Some companies building in clinical AI just wait for the models to get better. We'd rather build the parts that don't depend on that.

Configurability at a glance

Chunking method	Citations?	LLM in chunking?	Notes
sentences	yes	no	rule-based; preserves source byte offsets for citations
none	no	no	whole text as one chunk
paragraphs / topics / SOAP	no	yes	LLM segments by semantic structure
clinical NER	yes	no	ML model tags PROBLEM / TEST / TREATMENT entities
fast phenotype recognizer	yes	no	resolves codes inline; bypasses embed/search/validate

Validation method	Deterministic?	LLM in validation?	Notes
none	yes	no	return all candidate codes
Jaccard similarity	yes	no	re-rank against the source span
LLM judge	no	yes	model decides relevance per code
LLM + consensus	yes*	yes	N rounds, keep only codes that pass every round unanimously

*Deterministic to the threshold - borderline codes that flip between rounds are filtered out.