Back to blog

Definition of Done

HTS MCP Team · April 8, 2026 · 3 min read

retrieval-qualityclassificationsemantic-graphllm-agentscompleteness

Most LLM systems stop searching before the search is actually finished.


The problem with "looks right"

LLMs are very good at producing answers that feel complete before the underlying work actually is.

A model retrieves a few chunks. One of them matches closely. It reasons from that match, produces a confident answer, and moves on. For summarization or lightweight Q&A, that's usually fine.

Classification is different.

In classification, a relevant result is not the same thing as a sufficient result. The top semantic match can be strong and still miss the actual competing alternative. A hybrid search stack can feel sophisticated and still leave a critical blind spot. The model sounds certain. The retrieval process wasn't.

The gap between "found something plausible" and "actually finished looking" is where expensive mistakes live.

Why classification is harder than it looks

Classification is not open-ended generation. It is constrained decision-making inside a bounded space.

The question is not "can you find something relevant?" It is "can you justify this choice against the alternatives that almost fit?"

That means retrieval has to support exclusion, not just inclusion. The system has to find not only what matches, but what could be confused with what matches. And it has to do this reliably, not just when the embedding space happens to cluster the right candidates together.

Vector search is optimized to find what looks most similar. Classification requires something stricter: confidence that the model also considered what could have been confused with it. Even a very good retrieval stack doesn't naturally answer the question: what set of alternatives did we fail to examine?

The definition of done

This is the part most systems skip entirely.

For classification tasks like HTS tariff coding, "done" doesn't mean the model returned a code. It means the system has:

  • Identified the likely path through the classification hierarchy
  • Expanded to nearby and competing branches
  • Surfaced adjacent or commonly confused codes
  • Inspected exclusions, sibling relationships, and bridges across categories
  • Compared candidates before committing to one

That is a much harder standard than "top-k looked good." It's also a much more honest one.

A plausible answer is not the same thing as a complete search. A system that can't show you what it considered and what it ruled out hasn't finished its job. It just stopped early.

Why we built a semantic graph

We didn't build a graph just to add another search method. We built it because completeness requires structure.

Embeddings give you proximity. A graph gives you the neighborhood.

With a graph, the system can walk the hierarchy. It can find siblings. It can cross from one branch to another where the classification logic overlaps. It can expand around a candidate in a controlled way, not just retrieve whatever the vector space happened to surface.

The graph turns completeness from a vague aspiration into something the system can actually operationalize. You can define what "nearby" means. You can ensure the agent inspects the alternatives that matter, not just the ones that happened to embed closely.

That changes what the agent can show you when it's finished.

What agents should be able to show

An agent that classifies something should be able to explain more than its answer. It should be able to show:

  • Where it searched
  • What alternatives it considered
  • Why those alternatives were ruled out
  • Why it stopped searching

That's a research trail. And it's the difference between an agent that gives you a code and an agent that gives you a defensible classification.

In HTS classification, that might mean showing the selected code, the sibling codes that were reviewed, the adjacent branch that was explored, and the exclusion logic that ruled those candidates out.

We're not trying to guarantee omniscience. No retrieval system examines everything. But there's a wide gap between "we retrieved five chunks and picked the best one" and "we systematically explored the decision neighborhood and ruled out the meaningful alternatives."

The first one is a guess with citations. The second one is a process you can audit.

The question to ask any classification system

Next time you evaluate an AI classification tool, don't just ask whether it got the right answer. Ask it to show you what else it considered.

Ask which alternatives it evaluated. Ask why it ruled them out. Ask what would have had to be different about the input for the answer to change.

If the system can't answer those questions, it didn't finish the work. It just stopped early and sounded confident.

That's the standard we built HTS MCP to meet. Not just plausible retrieval. Defensible completion.


The real question for an LLM system isn't "did it find something?" It's "how do we know it looked enough?"