March 10, 2026 · Krunal Sabnis
When an SLM Routes Every Request, PII Recall Drops to Zero — Why Layered Architecture Wins for Enterprise AI
A 1.5B model classified credit card numbers as 'not sensitive.' The same model, used only for ambiguous cases behind a deterministic layer, improved routing accuracy to 95%. Right-sizing isn't just about model size — it's about knowing where each layer belongs.
The Question This Post Answers
In Part 1, we pushed PII recall from 76% to 98% using statistical NER and pattern recognizers — no LLM in the loop. But we left a gap: 4 out of 60 prompts were misrouted because the rules engine couldn’t classify them. It said “I don’t know.”
The obvious next step: add a small language model (SLM) to handle the ambiguous cases. But how much does it actually help? And what happens if you skip rules entirely and let the SLM decide everything?
We ran the same 60 prompts through three routing modes and measured accuracy, latency, PII recall, and cost. The results make a clear case for layered architecture — and an equally clear case against “just use an LLM.”
The Three Modes
Rules-only — deterministic regex, keyword matching, length heuristics. No model calls. Sub-millisecond decisions for every prompt.
Rules+SLM (hybrid) — rules handle everything they can. Only when confidence is below threshold does the prompt get forwarded to a 1.5B parameter SLM (Qwen 2.5, running locally via Ollama) for structured JSON classification.
SLM-only — every prompt goes through the SLM. No rules, no patterns, no heuristics. The model decides everything: routing, complexity assessment, sensitivity classification.
Same 60 prompts. Same three domains (healthcare, telecom, finance). Same ground truth labels.
The Results
| Metric | Rules-only | Rules+SLM | SLM-only |
|---|---|---|---|
| Routing accuracy | 93.3% | 95.0% | 43.3% |
| SLM calls | 0 | 3 (5%) | 60 (100%) |
| Avg latency | 25.8ms | 146.4ms | 2,297ms |
| P50 latency | 0.0ms | 0.0ms | 2,288ms |
| P95 latency | 8.9ms | 2,315ms | 2,567ms |
| PII recall | 98.2% | 98.2% | 0% |
Read that last row again. SLM-only: 0% PII recall. The model routed zero prompts to the PII detection pipeline. It classified credit card numbers as “not sensitive.” It looked at social security numbers and said “can be handled locally.”
What the SLM Actually Said
This isn’t a theoretical failure. Here’s what the 1.5B model returned when asked to classify prompts containing explicit PII:
A prompt containing a name, SSN, and phone number:
“The user prompt is a simple request for account opening, which can be handled by a local service model without requiring any sensitive data.”
A prompt with a credit card number and IBAN:
“The user is asking to pay a bill with their credit card number and account details, which are not sensitive personal information.”
A prompt asking for a differential diagnosis with patient records:
“The user is asking for a basic investment strategy recommendation without any sensitive information.”
The model doesn’t understand what PII is. It doesn’t understand regulatory sensitivity. It doesn’t understand that “simple” and “safe” are different axes. And this is exactly what happens when you delegate policy decisions to a model that was trained to be helpful, not compliant.
Why Hybrid Wins
The hybrid mode made exactly 3 SLM calls out of 60 prompts — the 3 cases where the rules engine’s confidence dropped below threshold. Of those 3, the SLM correctly routed 1 (a complex medical question about comorbid depression and chronic pain management). The other 2 it got wrong.
That’s a +1.7% accuracy gain for 3 model calls. Here’s why the economics matter:
Rules-only cost per 1,000 prompts: ~0. Sub-millisecond CPU time.
Hybrid cost per 1,000 prompts: ~50 SLM calls (5% fallback rate). On CPU-only inference with a 1.5B model, that’s roughly 100 seconds of compute. On a GPU, under 5 seconds.
SLM-only cost per 1,000 prompts: 1,000 SLM calls. On CPU, that’s ~38 minutes of compute. On a GPU, ~2 minutes. For 43% accuracy.
The hybrid approach is not a compromise. It’s the correct architecture: deterministic where you can be, probabilistic where you must be.
Per-Domain Breakdown
| Domain | Rules-only | Rules+SLM | SLM-only |
|---|---|---|---|
| Finance | 100% | 100% | 35% |
| Healthcare | 90% | 95% | 45% |
| Telecom | 90% | 90% | 50% |
Finance hits 100% in both rules-only and hybrid — the keyword and PII patterns in financial prompts are unambiguous. Healthcare benefits most from the SLM layer (90% → 95%) because medical prompts have nuanced complexity that keywords can’t always capture. Telecom stays at 90% in both — the remaining errors are a keyword false positive and a domain expertise gap in the 1.5B model.
The Three Remaining Failures
Even hybrid mode misclassifies 3 prompts. Understanding why tells you where to invest next:
1. Keyword false positive (tc-001): Prompt mentions “data usage” which matches a telecom local keyword, but the prompt also contains a person’s name. The rules engine is confident (keyword match), so the SLM never gets called. Fix: improve rule priority — PII patterns should override keyword matches.
2. Domain expertise gap (hc-016): A complex cardiac vs pulmonary differential diagnosis. The 1.5B model classified it as “simple” with 0.95 confidence. It doesn’t have enough medical knowledge to distinguish clinical complexity. Fix: either a larger model or domain-specific fine-tuning — but only for the 5% of prompts that reach the SLM.
3. Domain expertise gap (tc-019): Telecom capacity planning and total cost of ownership analysis. The model classified it as a basic comparison. Same root cause — the 1.5B model lacks domain depth. Same fix profile.
Two of three failures are model capability limits. One is a rules ordering bug. None are architectural failures.
What This Means for Production Systems
1. Never let an LLM make policy decisions alone
The SLM-only results aren’t a commentary on Qwen 2.5 specifically. Any general-purpose language model — small or large — will struggle with domain-specific compliance decisions. PII classification is a policy question, not a language understanding question. “Is a credit card number sensitive?” isn’t ambiguous to a human. It shouldn’t be delegated to a model.
2. The SLM’s job is classification, not policy
In the hybrid architecture, the SLM doesn’t decide what’s sensitive. It classifies complexity and recommends a route. The rules engine and PII detector make the compliance decisions. The SLM fills one specific gap: “is this prompt simple enough for local handling, or complex enough for premium cloud?” That’s a judgment call, not a policy call.
3. Measure the marginal value of each layer
+1.7% accuracy for 3 SLM calls. That’s the marginal value of the SLM layer in this benchmark. Is it worth it? Depends on your domain. In healthcare, that 1.7% was the difference between routing a comorbid pain management question to a local model vs a specialist cloud endpoint. That matters clinically. In finance, the SLM added nothing — rules were already at 100%.
4. CPU-only inference is viable for routing
The SLM ran on CPU (i7-12700K, no GPU). Average inference time: ~2.3 seconds per call. In the hybrid architecture, that’s 2.3 seconds for 5% of prompts. The other 95% resolve in under 1ms. For a routing decision (not a generation task), this latency is acceptable. You don’t need a GPU cluster for classification.
The Bigger Picture
This benchmark answered a specific question — where does a language model belong in a routing pipeline? The answer: behind deterministic layers, handling only what rules can’t.
But prompt routing is one boundary in a larger problem. Enterprise AI systems have multiple boundaries that need governance:
- Data boundary — what leaves your perimeter, and in what state? (This is what the router handles.)
- Tool boundary — which systems can an AI agent access, with what permissions, under whose authority?
- Model boundary — which model handles which task, at what cost, with what auditability?
Get any of these boundaries wrong and you have either a compliance incident or an architecture that burns budget without delivering value.
The question isn’t which model to use. It’s which boundary you’re solving for — and whether the layer you’re adding actually governs it, or just adds latency.
We solved the data boundary. The tool boundary is next.
This is Part 2 of a series on building governed AI architecture for the enterprise. Part 1 covers how we pushed PII recall from 76% to 98% with statistical NER and pattern recognizers — no LLM in the loop.