Phenoml

Mapping a data model to SNOMED is not a lookup problem

Binding a structured data model to standard terminology isn't an afternoon of glue code against a terminology API. It's about trust baby! The fastest path through it runs through a tool built to do the opposite job.

Alex Goel

A lot of healthcare data work comes down to one unglamorous task: taking a data model that already exists and making every field in it point at standard terminology. The model might be a client's internal schema, a reporting form, or a registry's submission format. The requirement is the same. Before the data behind it can be aggregated, exchanged, or compared across institutions, each field has to be bound to a concept in SNOMED CT, or LOINC, or whatever vocabulary the use case demands.

It is tempting to treat this as a lookup problem, an afternoon of glue code against a terminology API. It is not, and the reason it is not is worth understanding even if you never write a line of healthcare code, because it is a clean example of a problem whose difficulty lives entirely in the requirement to be trusted rather than in the requirement to be found.

A concrete instance makes this easier to see. Here is one field from the College of American Pathologists' electronic Cancer Protocol for invasive breast carcinoma, the kind of structured reporting form a pathologist fills out on a resection specimen:

Procedure (Note A)
___ Excision (less than total mastectomy)
___ Total mastectomy (including nipple-sparing and skin-sparing mastectomy)
___ Other (specify): _________________
___ Not specified

Two enumerated choices, an "Other" with a free-text escape, and a null option are a closed pick-list with a defined meaning for each line, and it sits inside a protocol that has dozens more like it: histologic type, margin status, lymph node counts, pTNM categories. CAP publishes dozens of these protocols, and the same shape shows up far beyond pathology, in any form-driven clinical data model.

Every one of those options has to be bound to a standard concept before the data is computable. "Excision (less than total mastectomy)" has to point at a SNOMED CT concept that means that procedure and not a near neighbor, and so does every other line, across every field, across every protocol. Done by hand, that is weeks of work per model. We did a lot of it at Topology, and the part that surprised us was not how to make it faster. It was that the fastest path ran through a tool built to solve the opposite problem.

The task is the inverse of what the tool is for

The tool was Construe, which extracts medical codes from clinical text. You hand it a discharge summary or a progress note, the messy natural language clinicians actually write, and it returns the ICD-10, SNOMED, CPT, LOINC, or RxNorm codes the text implies. The problem it solves is one of recall. The relevant concepts are buried somewhere in a wall of free text, and the job is to find all of them without inventing any. In our case, the input is a curated list of enumerated labels that someone wrote on purpose, a committee in CAP's case, a client's architect in others.

What makes binding hard is the other end of the pipe. Every mapping has to be trusted. A human, usually a coder or an informaticist, has to be able to look at a label mapped to some SNOMED concept and sign off that the mapping is correct, because that binding is going to sit underneath the data model and stay there. Get it wrong and you have silently corrupted every downstream query that touches that field. There is no wall of text to provide cover, no sense in which a near-miss is acceptable coverage. The mapping is either defensible or it is a latent bug.

So the two problems pull on a tool in opposite directions:

Narrative extractionModel binding
InputUnbounded free textFinite, known label set
Hard partFinding them allDefending each one
Failure that hurtsMissing a conceptBinding the wrong concept, invisibly
What the human needsCoverageA reason to believe each mapping

That last row is the whole post. For narrative extraction, the reviewer cares about coverage and can spot-check a sample. For binding, the reviewer has to interrogate every mapping, which means the tool's real job isn't to produce a code at all. Its job is to produce something a human can argue with.

How we used to do it, and why it was slow

Before any of this, the workflow was a terminology browser and a lot of patience. You take a label off the protocol, open the SNOMED CT Browser, search, read the candidate concepts, check the hierarchy to make sure you're at the right level of specificity, record the concept ID, and move to the next label. Then you repeat that a few hundred times. It is exactly as tedious as it sounds, and it is the kind of tedious that produces errors precisely because it is tedious.

The better version was to hit the SNOMED terminology server's search programmatically and get candidate concepts back without the clicking. That helped with volume, but it did nothing for the labels that needed help the most.

The labels that broke it were the terse ones and the shorthand. Clinical reporting forms are not written in full sentences. They are written in the compressed, abbreviated, acronym-heavy register that clinical documents actually use, and semantic search on a terminology server is tuned to match concept descriptions. A four-word label or a bare acronym often doesn't give the search enough to land on the right concept, or it lands confidently on a plausible wrong one. None of that is a knock on the terminology server, which was built to answer "what concept matches this description." A form label is simply a different kind of input than the descriptions it was optimized against, and we kept falling into that gap on exactly the fields that were hardest to bind by hand anyway.

Pointing Construe at the same labels closed that gap. Initially we were worried that Construe would not perform well given it's purpose was to read full sentences or reports. However,the terse forms and acronyms that the terminology search misread were exactly the kind of input Construe was built to read, since a form label is just a very short instance of the compressed clinical text it handles all day. Where the search needed a fuller description to land on the right concept, Construe resolved the abbreviation from the little context the label carried.

The output has to be reviewable

The thing binding needs from a tool is output a human can check. Construe's response is shaped for exactly that. Each code comes back with a citation and a rationale: the citation is the span of input text that drove the binding, and the rationale is the reason for choosing this concept over another. The API also returns ancestor codes, so you can see where a binding sits in the SNOMED hierarchy without a second round trip.

Several end-users asked us why the tool did not return confidence scores. For narrative extraction a score is a reasonable thing to want, but for binding it is the wrong primitive. A confidence score is unfalsifiable to a reviewer. If a tool tells you "this is procedure X, confidence 0.87," there is nothing for a human to do with the 0.87. They can't check it, they can't argue with it, and they can only set a threshold and hope the threshold is calibrated, which it never quite is. The number offloads the decision onto a statistic the reviewer has no way to interrogate.

A rationale and a citation are claims, and claims can be checked. "I bound this to total mastectomy because the label specified nipple-sparing and skin-sparing variants, which are subtypes of total mastectomy" is something a coder can read and either accept or reject on the merits. The ancestor chain lets them confirm the specificity is right, that you bound to the concept you meant and not its parent or its child. The reviewer is no longer trusting a number. They are auditing an argument.

What a reviewer can do with each kind of output. A confidence score terminates in a threshold the reviewer can't interrogate; a citation and rationale terminate in a claim the reviewer can accept or reject on the merits.
What a reviewer can do with each kind of output. A confidence score terminates in a threshold the reviewer can't interrogate; a citation and rationale terminate in a claim the reviewer can accept or reject on the merits.

That is the difference between a tool that produces output and a tool that produces reviewable output, and for a task where every mapping has to clear a human, it is critical.

There is a config surface that makes this work in practice. The extract endpoint takes a config object, and the two knobs we reached for most were context relevance and consistency effort. Binding is a low-cardinality problem. A given model has a finite set of labels and tends to share a register throughout, so there are very few configurations to tune and tuning them is cheap. Narrative extraction is high-cardinality and high-recall, so per-input tuning doesn't scale and you live with general settings. Binding is low-cardinality and high-trust, so per-model tuning is both affordable and worth doing. The same tool rewards opposite operating strategies depending on which problem you point it at, and most of the leverage we got came from recognizing which problem we actually had.

What it did not replace

A coder still reviews the output. That has not changed, and I don't expect it to, because the trust requirement that defines this problem is the one thing you cannot delegate to a model.

What changed is the shape of the coder's day. We put an LLM-as-judge pass between Construe and the human, a triage layer that reads the citation and rationale and flags the bindings that warrant a closer look, so the reviewer's attention lands where it's needed instead of spreading evenly across hundreds of mappings that are mostly fine. The judge does not make the call. It sorts the queue. The human still makes every call that matters. We were even able to roll little apps to help the coders with their review, or output them to the most common mapping tool of all: Microsoft Excel!

The honest framing is that Construe turned terminology binding from a task a person does into a task a person reviews. That is a real change in kind, not just in speed, but it is a smaller claim than "we automated it," and the smaller claim is the true one. A reviewer with a good assistant is faster and less error-prone than a reviewer working alone, but they are not optional.

Try it on your own data model

You don't need our pipeline to feel the distinction in this post. The fastest way to see it is to send Construe the labels off a model you already work with and read what comes back, not the codes but the citations and rationales attached to them. You can do that directly from the console: sign up, grab credentials, and point extract at your own form labels. You can also follow our guide on building on Construe and configure it to your needs.

The inputs that teach the most are the ones that used to break a terminology search. A few worth trying:

Each of those turns the argument above from a claim into something you can check on your own data. The docs cover the full config surface for Construe, including the upload endpoint for the organization-specific vocabularies most healthcare companies turn out to have. If you want to see the input side of the example in this post, the CAP cancer protocols are public, and reading one will tell you more about why structured clinical reporting is hard than any architecture diagram could.

The question I keep coming back to is whether binding deserves its own tool at all. Everything above is an account of using a narrative extractor off-label and getting real leverage out of it. The citation, the rationale, the ancestor navigation, the cheap per-model config all turned out to be the right primitives for binding, but nobody designed them for binding. They happen to fit, and "happens to fit" is a foundation to build on. In future this is a feature I can see us building on to enhance data mapping capabilities.

For now the extractor does the job, and does it well enough that the weeks-long version of this work is hard to imagine going back to. But the fact that the inverse problem rides so comfortably on a tool built for the forward one is telling us something about how thin the line between extraction and binding really is, and whether it is a line worth drawing at all.