The Brain Behind the Bot: Subject Classification with Small Language Models
How we route every question to the right subject in under 200ms using a cheap classifier — and the prompt design tricks that made it 96% accurate.
Why classification is the first step
Before we can give a learner a good answer, we need to know what they're actually asking about. "Explain pointers" lives in a different universe of context than "Explain pointers in React" — and treating both the same way wastes tokens and confuses the model.
Choosing a model
We benchmarked four candidates against 2,000 hand-labeled CS questions:
| Model | Accuracy | p50 latency | Cost / 1k | |---|---|---|---| | GPT-4o | 98.1% | 1.4s | $5.00 | | Claude Haiku | 96.4% | 380ms | $0.25 | | GPT-4o-mini | 95.9% | 410ms | $0.15 | | Llama 3 8B | 92.1% | 220ms | $0.04 |
Haiku won. The 2% accuracy gap to GPT-4o vanished entirely once we added a confidence threshold that triggers a fallback to the bigger model.
Prompt design
```text You are a subject classifier for a CS tutoring platform. Allowed subjects: DSA, OS, Networks, DBMS, OOP, SystemDesign, Other. Reply with strict JSON: { "subject": ..., "confidence": 0.0-1.0 } ```
No chain-of-thought. No examples. Just a strict JSON shape and a closed enum.
Confidence routing
- ≥ 0.85 → use Haiku's answer
- 0.55 - 0.85 → re-classify with GPT-4o-mini
- < 0.55 → tag as Other, log for offline review
The interesting engineering is not in picking the model — it's in the confidence routing, the eval set, and the offline review loop.