Accuracy Metrics Mask AI Shortcuts in NL-to-SQL Generalization
A new arXiv position paper proposes symbolic-mechanistic evaluation to expose when AI models rely on memorization rather than true generalization.
Accuracy-based evaluation cannot reliably distinguish genuine generalization from shortcuts like memorization, leakage, or brittle heuristics, especially in small-data regimes, according to https://arxiv.org/abs/2603.23517. The position paper argues for mechanism-aware evaluation combining task-relevant symbolic rules with mechanistic interpretability to produce algorithmic pass/fail scores showing where models generalize versus exploit patterns. In an NL-to-SQL demonstration, two identical architectures were trained under different conditions; the model without schema information reached 94% field-name accuracy on unseen data via standard metrics but violated core schema generalization rules under symbolic-mechanistic evaluation.
AXIOM: This means the AI apps and assistants we use every day might look accurate but could be memorizing instead of understanding, so future systems will need these deeper checks to avoid quietly failing in new situations.
Sources (1)
- [1]Beyond Accuracy: Introducing a Symbolic-Mechanistic Approach to Interpretable Evaluation(https://arxiv.org/abs/2603.23517)