AI Passed the Customs Broker Exam. The Important Part Wasn't the AI.
HTS MCP Team · May 20, 2026 · 10 min read
I had two Claude models take the April 2025 Customs Broker License Examination, a notoriously difficult professional exam in U.S. trade compliance.
Same exam. Same MCP tools. Same structured knowledge layer. Different model tiers.
Both passed comfortably. The cheaper model scored slightly higher.
| Model | Score | Cost Tier |
|---|---|---|
| Claude Sonnet 4.6 | 76/80 (95.0%) | $$ |
| Claude Opus 4.6 | 75/80 (93.8%) | $$$ |
That result says something important about domain AI: in systems built on authoritative reference material, most of the capability comes from the knowledge layer and tool interface, not the model tier. A structured knowledge layer turned two frontier models from likely failures into high scorers on a real professional licensing exam.
This post walks through what I tested, what the models had access to, what the results were, and what I think it means for how we should be building domain-specific AI systems. It also covers what the results don't prove, because that matters too.
The exam.
The Customs Broker License Examination is administered by U.S. Customs and Border Protection twice a year. It's 80 questions across six categories: Broker Compliance, Classification, Entry/Entry Summary, Valuation and Duty Assessment, Modernized Drawback, and a Practical Exercise involving real duty calculations. You need 75% to pass. Candidates get 4.5 hours. The exam is open-book, and CBP specifies the appropriate reference materials for each sitting.
Recent pass rates, published by CBP prior to appeal decisions:
- April 2025: 30% passed
- October 2024: 24% passed
- October 2025: 12% passed
These aren't people walking in cold. These are candidates who've studied for months, many of whom work in the trade compliance industry. The October 2025 sitting had an 88% failure rate.
The exam isn't trivia. It's a research-and-application test. The questions give you a fact pattern and ask you to identify the correct regulatory provision, the correct tariff classification, or the correct duty calculation. The answers live in Title 19 of the Code of Federal Regulations and the Harmonized Tariff Schedule. Together, these two bodies of law span thousands of pages of interconnected rules, exceptions, and cross-references.
The exam tests whether you can navigate that structure under pressure. That's exactly what I built the tooling to do.
What the models had access to.
Neither model had customs law "in its weights" in any useful way. An LLM might know about 19 CFR, but it doesn't have the specific duty rate for barley seeds (HTS 1003.10.00: 0.15 cents/kg) or the exact number of business days to respond to a counterfeit goods detention notice (7, per 19 CFR 133.21(b)(2)(i)). That's reference material, not general knowledge.
What they did have access to was a structured knowledge layer exposed through MCP (Model Context Protocol) tools:
- ~19,000 HTS tariff codes with duty rates, special program indicators, and chapter/section hierarchy
- Full text of 19 CFR (Title 19, Code of Federal Regulations) with hybrid vector + full-text search
- CBP guidance documents, including directives, ACE entry summary instructions, and business process rules
- Cross-reference and relationship data between regulatory provisions
- Semantic search that understands queries like "CBERA sugar exclusions" and returns the relevant sections of 19 CFR 10.191
The interface between model and knowledge was a set of MCP tools: search_regulations, get_regulation_section, search_tariffs, get_tariff, and about a dozen more specialized endpoints. The model calls a tool, gets structured data back, reasons about it, calls another tool if needed, and eventually commits to an answer with citations.
Three layers: structured knowledge at the bottom, clean tool interfaces in the middle, and a reasoning model on top.
Here's what that looks like in practice. For the first question on broker recordkeeping, Sonnet called one tool, read one regulation, and committed to the answer:
A customs broker has a client's confidential records as defined in CBP regulations pertaining to customs brokers. The records contain information that is not available from a source that is open to the public. The broker's client has specified in writing that the records should be kept confidential. Of those persons listed below, to whom may the broker disclose the confidential records?
Per 19 CFR 111.24, broker records are confidential and may only be disclosed to: (1) the clients themselves, (2) their surety on a particular entry, (3) DHS representatives, (4) other duly accredited U.S. officers/agents, (5) on subpoena/court order, or (6) when authorized in writing by the client. The client's surety on a particular entry (Choice A) is explicitly listed. Freight forwarders (B), other clients (C), and the client's exporter (D) are not authorized recipients.
That trace is live. The reasoning text, the tool call, the duration, the chunk metadata — that's the actual record of how the model arrived at the answer, served straight from the exam database.
The results.
Here's the category breakdown:
| Category | Opus | Sonnet |
|---|---|---|
| Broker Compliance (15) | 13 (86.7%) | 15 (100%) |
| Modernized Drawback (5) | 5 (100%) | 5 (100%) |
| Practical Exercise (5) | 4 (80%) | 4 (80%) |
| Classification (15) | 15 (100%) | 13 (86.7%) |
| Valuation (15) | 13 (86.7%) | 14 (93.3%) |
| Entry / Entry Summary (25) | 25 (100%) | 25 (100%) |
| Total | 75/80 (93.8%) | 76/80 (95.0%) |
A few observations.
The knowledge layer carried both models through the lookup-heavy categories. Broker Compliance, Modernized Drawback, and Entry/Entry Summary are categories where the answers are unambiguous once you find the right regulation. "What's the maximum penalty for X?" "How many days to file Y?" "What form is required for Z?" Find the right CFR section, read it carefully, and the answer is there. Entry/Entry Summary, the largest category at 25 questions, was a clean sweep for both models: 25/25.
They made different mistakes in different places. Opus went 15/15 on Classification where Sonnet got 13/15. Sonnet went 15/15 on Broker Compliance where Opus got 13/15. Sonnet edged Opus on Valuation (14/15 vs 13/15). They tied on everything else. These don't look like systematic differences so much as ordinary variance: the kind you'd expect from two strong test-takers on the same exam. Different reasoning chains, different moments of imprecision, different questions where one model's search surfaced a more relevant result.
The more expensive model didn't win. Sonnet costs roughly a third of what Opus costs per token. On the vast majority of questions, they arrived at the same answers through the same tools. The differences showed up in the five or six questions where one model's reasoning chain happened to land correctly and the other's didn't.
What this means.
In domain-specific systems, the infrastructure determines the ceiling. Model choice mostly determines the margin.
The structured knowledge layer (the indexed regulations, the tariff database, the semantic search, the cross-references) is what made both models capable of passing. The model contributed reasoning at the edges: synthesis across multiple regulatory provisions, judgment calls on ambiguous questions, the ability to formulate effective search queries. That reasoning mattered. It's the difference between 86% and 100% on a given category. But the foundation that got both models into the 90s was the same knowledge layer underneath.
Without the tools, I'd expect both models to fail badly, likely well below the 75% threshold. Most of the exam depends on precise regulatory lookup, tariff-line specificity, and exact duty calculations rather than general world knowledge.
Consider what one question required: calculating duty on sugar syrup from the British Virgin Islands classified under HTS 2106.90.4400. Getting the right answer ($162.38) meant looking up the tariff code, finding the MFN specific rate ($0.036606/kg), noticing that the "E*" CBERA indicator has an asterisk that excludes this product from duty-free treatment, knowing to apply the rate to the raw sugar weight (4,436 kg) rather than the net weight (4,770 kg), and doing the arithmetic. An LLM working from memory would get maybe one of those five steps right.
Your client is importing a shipment of blended syrups containing sugars derived from cane sugar with added yellow color but no added flavoring. The blended syrup (sirup) is capable of further mixing and is packaged consistent with the needs of commercial carbonated beverage manufacturers in 30-gallon drums and is not packaged consistent with marketing to the ultimate consumer. The syrup was manufactured in the British Virgin Islands and imported directly from the British Virgin Islands. The merchandise will be properly classified under subheading 2106.90.4400, HTSUS. The quota is open. There are 50 drums total. The net weight of the 50 drums of syrup is 4,770 kilograms. The raw (total) sugar weight is 4,436 kilograms. How much will the estimated duty be?
British Virgin Islands (BVI) is NOT a US insular possession (A is wrong). The CBI indicator shows "E*" meaning BVI is EXCLUDED from CBERA benefits for this HTS code (B is wrong). The MFN rate for 2106.90.4400 is $0.036606/kg. Per Additional U.S. Note 5 to Chapter 17, the rate applies to the total sugar content weight (4,436 kg raw/total sugar weight), not the total net weight. Duty = 4,436 kg × $0.036606/kg = $162.38.
A frontier model can probably handle the conceptual questions from general knowledge ("what are the elements of computed value?" or "what's the highest-rate principle for commingled goods?"). But those are maybe 20 of the 80 questions. The other 60 require specific regulatory data that lives in the knowledge layer.
This has practical implications:
Invest in the knowledge layer. If you're building a system for tax compliance, insurance underwriting, pharmaceutical regulation, or any other domain where the answers live in a corpus of authoritative text, spend your engineering budget on structured ingestion, semantic indexing, and clean tool interfaces. That's what determines whether your system passes or fails.
Benchmark on real professional exams. This is the most underrated evaluation method in AI. Forget MMLU and HumanEval. Find the professional licensing exam for your domain and run your system against it. The Customs Broker License Exam has a known pass rate, a known difficulty level, and a clear scoring rubric. If your system passes it, you have a meaningful claim about capability. If it doesn't, you know exactly where the gaps are.
The appeal.
After the exam, I asked Sonnet if it wanted to appeal any of its incorrect answers. It appealed two and declined to appeal two others, noting they were "incorrect on the merits."
The first appeal was procedural: one question required comparing visual images of partially completed CBP Form 7501 entry summaries. The visual exhibits weren't accessible through the exam interface. Sonnet argued the question was unanswerable as presented. That's a clean procedural claim.
The second appeal was substantive. Question 74 involved an Italian sports car reimported after being leased to a client in China. The answer key cited a provision (19 CFR 141.2(h)) that exempts goods "exported under lease to a foreign manufacturer." Here's the question and Sonnet's answer:
Company A, based in the United States, sells and rents high end sports cars. In 2015, Company A imported an Italian sports car from Italy and paid all applicable duties and taxes. In 2020, Company A leased the car to a client in China and exported it to him. The car did not meet the specifications agreed upon in the contract; however, the client continued to drive it for the remainder of the lease. The sports car was not subjected to any repairs or alterations while in China. In 2024, Company A then reimported the car into the United States. What statement is correct upon reimportation into the United States?
The Italian sports car was previously imported in 2015 with all duties paid, exported to China in 2020, and reimported in 2024 without any repairs, alterations, or advancement in value. Under 9801.00.10, articles previously imported with duties paid and exported from the US can be reimported duty-free when returned without advancement in value or improvement in condition. Option B (9802.00.50) is wrong because that provision requires the goods to have been exported FOR the purpose of repair/alteration, which wasn't the case here. Option C is wrong because the client actually used the car for the full lease period despite non-conformity. Option D has no direct legal basis — leasing per se does not create a duty exemption. The car qualifies for duty-free entry under 9801.00.10.
Sonnet's appeal:
The operative words are "foreign manufacturer." The Chinese lessee in this question is described throughout only as "a client" who "continued to drive" the car for personal use. Nothing in the question identifies this individual as a manufacturer.
It then argued that its own answer, while imprecise in subheading, was grounded in a defensible legal principle, and that the question was ambiguous because no offered answer perfectly fit the facts.
This is worth pausing on. The model identified which questions had defensible appeals and which didn't. It constructed a structured legal argument citing specific regulatory language. It distinguished between procedural and substantive grounds. And it knew which fights to pick. That goes beyond retrieval. It starts to look like legal judgment, and it's the kind of capability that emerges when a reasoning model has access to a deep, well-structured knowledge layer.
If both appeals were granted, Sonnet's score would be 78/80 (97.5%).
What this does not prove.
This does not prove that models are interchangeable in every setting, or that reasoning quality doesn't matter. It shows something narrower and more useful: in domains where answers live in authoritative, structured reference material, the knowledge layer and tool interface account for most of the system's capability, while model differences show up mainly at the edges.
A few specific limits:
The April 2025 exam is plausibly in the training corpus. CBP publishes the exam and the answer key. The obvious objection — "the model just memorized it" — is answered by the traces above. When Sonnet committed to the sugar syrup answer, it called get_tariff for the code, looked up the CBERA exclusion, and read specific chunks back before reasoning. A memorizing model wouldn't need the trail, and wouldn't produce one. Every embedded card in this post is a record of how the model arrived at the answer, not just that it did.
This is one exam, two models, one run each. The scores would likely vary by a few points on a rerun. The fact that both models landed in the 93–95% range is meaningful; the fact that Sonnet beat Opus by exactly one question is probably not.
The exam is structured. Multiple choice with four options, clear fact patterns, specific regulatory answers. Real-world customs brokerage involves ambiguous fact patterns, client communication, judgment about which questions to even ask, and regulatory interpretation that goes beyond what any exam can test. Passing the exam is necessary but not sufficient for being a good customs broker.
The knowledge layer was built for this domain. The results reflect months of engineering: structured ingestion, semantic indexing, entity relationships, cross-references. You don't get this by dumping documents into a vector database and calling it RAG. The quality of the knowledge layer is the whole game, and building a good one takes real work.
The bigger picture.
I started this project to make trade law accessible through AI tooling. Along the way, it became an experiment in what happens when you deeply model a domain, expose it through clean interfaces, and let a reasoning model operate over the full structure.
What happens is: the system passes the professional licensing exam that most human practitioners fail. In this test, it did this across two model tiers.
I don't think this is limited to trade law. The architecture is domain-agnostic:
- A structured knowledge layer that models the domain as a connected system. Not documents in a vector database, but entities, relationships, and rules with enough fidelity that a reasoning system can traverse them.
- A model capable of multi-step reasoning over that structure.
- Clean interfaces between the two. MCP tools with well-defined inputs and outputs.
Swap out 19 CFR for tax code. Medical device regulations. Securities law. Building codes. Any domain where the answers live in a corpus of authoritative text and the job is "find the right provision and reason about it correctly."
And if you're a domain expert (a working customs broker, a tax attorney, a compliance officer) this doesn't replace you. It makes you dramatically faster. Your expertise tells the system what to ask. The system gives you the answer with citations in seconds instead of minutes. Your judgment gets more valuable, not less, because you're spending all your time on judgment instead of page-turning.
The professional licensing exam is the benchmark. The structured knowledge layer is the product. The model tier is a variable. And the results speak for themselves.
See the full results — all 80 questions, both models, every tool call: /exam-results.
The HTS-API MCP server, the exam engine, and the full regulatory knowledge layer described in this post were built by Fahad Baig at HTS MCP. The exam results are reproducible: same exam, same tools, same workflow. The models were Claude Opus 4.6 and Claude Sonnet 4.6, accessed through Anthropic's API via the Model Context Protocol.
CBP pass rate data is from officially published pre-appeal results for the April 2025, October 2024, and October 2025 examination sittings.