SymGap
An evaluation framework for language models

How much of benchmark performance is real?

A model scores 92% on a math benchmark. But how much of that score survives when you rephrase the questions, rename the variables, or dress the same problem in an unfamiliar domain? The difference between the raw score and the robust score is the symmetry gap — it measures how much a model actually understands versus how much it pattern-matches.

0.92
Raw benchmark score
0.71
After transformation
0.21
The symmetry gap
The problem

Benchmarks measure what a model gets right — not why.

A model that recognizes the shape of a familiar problem can score well without understanding the underlying logic. Standard accuracy cannot tell the difference.

Our approach

Measure what survives when the surface changes.

Apply meaning-preserving transformations to benchmark questions. The drop — the symmetry gap — quantifies what depends on surface cues rather than genuine reasoning.

01 — The symmetry gap

Raw accuracy is not understanding. The gap between them is measurable.

For any benchmark and model, we define three numbers:

raw_score           = standard benchmark accuracy adjusted_score     = accuracy after meaning-preserving transforms symmetry_gap       = raw_score − adjusted_score

A small gap means robust performance — structural understanding that survives changes in phrasing, formatting, variable names, and domain framing. A large gap means brittleness — accuracy that depends on surface features of how questions happen to be written.

But stability alone isn't enough. A model that gives the same answer no matter what — ignoring negation, dropped premises, changed quantities — is not robust. It's oblivious. So we also test structural sensitivity: does the model change its answer when the problem genuinely changes?

surface_stability       = consistency under irrelevant changes structural_sensitivity  = responsiveness to meaningful changes symmetry_score         = f(stability, sensitivity)

This dual test — stable where it should be, sensitive where it should be — is what makes SymGap more than a robustness check.

02 — Understanding matrix

Separating reasoning from pattern matching.

The deepest question isn't whether a model is stable under paraphrase. It's whether the model is solving the problem or recognizing the problem's shape. We independently vary two dimensions: surface form and underlying logical structure.

The 2×2 understanding test
Same structure
Different structure
Familiar surface

Standard

The baseline. Familiar framing, standard problem. Most models ace this.

Trap

Looks like a familiar problem but isn't. Tests whether the model actually reads or template-matches.
Unfamiliar surface

Transfer

Same logic, different domain dressing. Tests whether understanding is abstract or anchored to specific contexts.

Control

Different surface, different structure. Baseline for comparison.

A model that reasons about structure handles all four cells. A model that pattern-matches aces Standard but fails Trap and Transfer. The gap between them is a direct measurement of understanding depth.

understanding_score     = (transfer_accuracy + trap_accuracy) / 2 pattern_matching_score  = standard_accuracy − understanding_score
03 — Example

What this looks like in practice.

Consider a simple rate problem tested across all four matrix cells:

Standard

"A train travels 180 miles at 60 mph. How long does the trip take?"

Most models: correct (3h)

Transfer

"A pipeline processes 180 records at 60 records/sec. How long does it take?"

Some models: confused by domain

Trap

"A train goes 60 mph for half the distance and 90 mph for the other half. 180 miles. How long?"

Many models: 2.4h (wrong)

Control

"A pipeline processes the first 90 records at 60/sec and the next 90 at 90/sec. How long?"

Baseline comparison

The Standard version is trivial. The Transfer version has identical math but unfamiliar framing. The Trap version looks like the Standard version but requires a different calculation. A model that gets Standard right and Trap wrong has matched the pattern without understanding the math.

04 — Leaderboard

What a symmetry-adjusted leaderboard looks like.

Standard leaderboards rank by raw accuracy. A symmetry-adjusted leaderboard shows the full picture — and the ranking can change.

ModelRawAdjustedGapSurface stabilityStructural sensitivity
Model A0.920.710.210.770.83
Model B0.900.790.110.850.82
Model C0.910.680.230.720.86
Model D0.840.760.080.810.78

Illustrative data. Model A leads on raw accuracy but has the second-largest symmetry gap. Model D scores lowest but is the most robust. Model B emerges as the leader when adjusted scores are used.

05 — Transforms

What we change, and what we hold constant.

Every transformation is classified as either surface (should not change the answer) or structural (should change the answer). The dual classification is what makes SymGap a test of understanding, not just a stress test.

Surface transforms

Paraphrase, formatting changes, variable renaming, premise reordering, domain redressing. These change the words. They don't change the problem.

Structural transforms

Negation, quantifier changes, premise removal, causal reversal, structural swaps. These change the problem itself. The model should notice.

The diagnostic

Drift under surface transforms = brittleness. Stability under structural transforms = obliviousness. Both are failures of understanding.

06 — Why this matters

Three problems this addresses.

Benchmark gaming

Models increasingly optimize for benchmark scores through training data overlap, format memorization, and surface-pattern exploitation. The symmetry gap directly measures how much of a score is real. A model that truly understands a domain will have a small gap. A model that has memorized the test will have a large one.

Deployment trust

In production, prompts arrive in unpredictable forms. Users don't phrase things like benchmark authors. Evidence arrives in arbitrary order. Variables have unfamiliar names. A model's symmetry profile tells you whether its benchmark performance will survive contact with the real world.

The reasoning question

The field's central open question is whether language models genuinely reason or merely pattern-match at scale. The understanding matrix provides a direct empirical test. If a model aces Standard problems but fails Trap and Transfer variants, the answer — for those tasks, at that scale — is pattern matching.

07 — Methodology

How it works.

SymGap audits any benchmark in four steps:

Transform

Apply meaning-preserving and meaning-altering transformations to each benchmark question. Transforms are generated once, human-validated, and frozen as reproducible fixtures.

Evaluate

Run the original and all transformed variants through the model. Cache every response. Use deterministic generation settings for reproducibility.

Judge

Compare outputs using exact match (for numeric and multiple-choice), embedding similarity (continuous signal), and LLM-as-judge (for freeform responses).

Score

Compute the symmetry gap, surface stability, structural sensitivity, per-family drift profile, and understanding matrix scores. Generate a human-readable report card.

08 — What's next

An open-source toolkit and benchmark.

SymGap will be released as a Python toolkit and a curated evaluation suite. The first release targets GSM8K, MMLU, and a custom evidence-aggregation benchmark, with results across frontier models from OpenAI, Anthropic, Google, and Meta.

The companion paper introduces the symmetry gap metric and the understanding matrix, presents empirical findings, and positions SymGap as a complement to standard accuracy-based evaluation.

Accuracy tells you what a model gets right. Symmetry tells you whether it knows why.

A benchmark score is a snapshot. The symmetry gap measures the depth behind it. When the phrasing changes, the variables change, and the familiar framing disappears — what's left is understanding. SymGap measures what's left.