Raw accuracy is not understanding. The gap between them is measurable.
For any benchmark and model, we define three numbers:
A small gap means robust performance — structural understanding that survives changes in phrasing, formatting, variable names, and domain framing. A large gap means brittleness — accuracy that depends on surface features of how questions happen to be written.
But stability alone isn't enough. A model that gives the same answer no matter what — ignoring negation, dropped premises, changed quantities — is not robust. It's oblivious. So we also test structural sensitivity: does the model change its answer when the problem genuinely changes?
This dual test — stable where it should be, sensitive where it should be — is what makes SymGap more than a robustness check.
Separating reasoning from pattern matching.
The deepest question isn't whether a model is stable under paraphrase. It's whether the model is solving the problem or recognizing the problem's shape. We independently vary two dimensions: surface form and underlying logical structure.
Standard
Trap
Transfer
Control
A model that reasons about structure handles all four cells. A model that pattern-matches aces Standard but fails Trap and Transfer. The gap between them is a direct measurement of understanding depth.
What this looks like in practice.
Consider a simple rate problem tested across all four matrix cells:
Standard
"A train travels 180 miles at 60 mph. How long does the trip take?"
Most models: correct (3h)Transfer
"A pipeline processes 180 records at 60 records/sec. How long does it take?"
Some models: confused by domainTrap
"A train goes 60 mph for half the distance and 90 mph for the other half. 180 miles. How long?"
Many models: 2.4h (wrong)Control
"A pipeline processes the first 90 records at 60/sec and the next 90 at 90/sec. How long?"
Baseline comparisonThe Standard version is trivial. The Transfer version has identical math but unfamiliar framing. The Trap version looks like the Standard version but requires a different calculation. A model that gets Standard right and Trap wrong has matched the pattern without understanding the math.
What a symmetry-adjusted leaderboard looks like.
Standard leaderboards rank by raw accuracy. A symmetry-adjusted leaderboard shows the full picture — and the ranking can change.
| Model | Raw | Adjusted | Gap | Surface stability | Structural sensitivity |
|---|---|---|---|---|---|
| Model A | 0.92 | 0.71 | 0.21 | 0.77 | 0.83 |
| Model B | 0.90 | 0.79 | 0.11 | 0.85 | 0.82 |
| Model C | 0.91 | 0.68 | 0.23 | 0.72 | 0.86 |
| Model D | 0.84 | 0.76 | 0.08 | 0.81 | 0.78 |
Illustrative data. Model A leads on raw accuracy but has the second-largest symmetry gap. Model D scores lowest but is the most robust. Model B emerges as the leader when adjusted scores are used.
What we change, and what we hold constant.
Every transformation is classified as either surface (should not change the answer) or structural (should change the answer). The dual classification is what makes SymGap a test of understanding, not just a stress test.
Surface transforms
Paraphrase, formatting changes, variable renaming, premise reordering, domain redressing. These change the words. They don't change the problem.
Structural transforms
Negation, quantifier changes, premise removal, causal reversal, structural swaps. These change the problem itself. The model should notice.
The diagnostic
Drift under surface transforms = brittleness. Stability under structural transforms = obliviousness. Both are failures of understanding.
Three problems this addresses.
Benchmark gaming
Models increasingly optimize for benchmark scores through training data overlap, format memorization, and surface-pattern exploitation. The symmetry gap directly measures how much of a score is real. A model that truly understands a domain will have a small gap. A model that has memorized the test will have a large one.
Deployment trust
In production, prompts arrive in unpredictable forms. Users don't phrase things like benchmark authors. Evidence arrives in arbitrary order. Variables have unfamiliar names. A model's symmetry profile tells you whether its benchmark performance will survive contact with the real world.
The reasoning question
The field's central open question is whether language models genuinely reason or merely pattern-match at scale. The understanding matrix provides a direct empirical test. If a model aces Standard problems but fails Trap and Transfer variants, the answer — for those tasks, at that scale — is pattern matching.
How it works.
SymGap audits any benchmark in four steps:
Transform
Apply meaning-preserving and meaning-altering transformations to each benchmark question. Transforms are generated once, human-validated, and frozen as reproducible fixtures.
Evaluate
Run the original and all transformed variants through the model. Cache every response. Use deterministic generation settings for reproducibility.
Judge
Compare outputs using exact match (for numeric and multiple-choice), embedding similarity (continuous signal), and LLM-as-judge (for freeform responses).
Score
Compute the symmetry gap, surface stability, structural sensitivity, per-family drift profile, and understanding matrix scores. Generate a human-readable report card.
An open-source toolkit and benchmark.
SymGap will be released as a Python toolkit and a curated evaluation suite. The first release targets GSM8K, MMLU, and a custom evidence-aggregation benchmark, with results across frontier models from OpenAI, Anthropic, Google, and Meta.
The companion paper introduces the symmetry gap metric and the understanding matrix, presents empirical findings, and positions SymGap as a complement to standard accuracy-based evaluation.
Accuracy tells you what a model gets right. Symmetry tells you whether it knows why.
A benchmark score is a snapshot. The symmetry gap measures the depth behind it. When the phrasing changes, the variables change, and the familiar framing disappears — what's left is understanding. SymGap measures what's left.