The Semantic Equivalence Problem
When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. A Java method that reads the contents of this source as a string is semantically equivalent to one that gets the textual information from this source and represent it as a string, yet token-overlap metrics will penalize the second heavily.
This module explores why overlap-based metrics can fail and what alternatives exist.
The Core Problem
Consider these two summaries of the same code:
- Prediction: "Reads the contents of this source as a string."
- Ground Truth: "Get the textual information from this source and represent it as a string."
Only 3 tokens overlap — yet the summaries are semantically equivalent. An exact-match metric (BLEU) would score this 0.21, suggesting a near-total failure when in fact the prediction is perfectly correct. This is the fundamental challenge of evaluation: surface-level similarity is not the same as semantic correctness.
Classification Metrics: Foundation
Before we evaluate generative models, many SE tasks are classification problems. These use standard classification metrics that form the foundation for all other evaluation approaches.
- Precision
- Of all items predicted positive, how many truly are? TP / (TP + FP)
- Recall
- Of all actual positives, how many did we find? TP / (TP + FN)
- F1 Score
- Harmonic mean of precision and recall. 2 · P · R / (P + R)
- Accuracy
- Overall correct predictions. Can be misleading with imbalanced data.
BLEU: The Workhorse Metric
BLEU (Bilingual Evaluation Understudy) measures how much the generated text's n-grams overlap with the reference. It is simple, fast, and widely used — but it has significant limitations for code evaluation. BLEU computes the precision of n-grams (sequences of 1, 2, 3, or 4 consecutive tokens) in the candidate that appear in the reference, then applies a brevity penalty to discourage short outputs.
BLEU = BP × exp(Σ w_n · log p_n)
where:
p_n = modified n-gram precision
w_n = 1/N (uniform weight, typically N=4)
BP = min(1, e^(1 − r/c)) = brevity penalty
r = reference length, c = candidate length
Worked Example
Reference: public int getMax(int[] arr)
Candidate: public int findMax(int[] array)
- 1-grams: {public, int} match → precision = 2/4 = 0.50
- 2-grams: {public int} matches → precision = 1/3 = 0.33
- Brevity penalty: both have 4 tokens → BP = 1.00
- BLEU-2 = 1.00 × exp(0.5 × ln(0.50) + 0.5 × ln(0.33)) = 0.41
The variable rename alone dropped the score significantly, even though the code structure is identical.
BLEU Limitations and Failure Modes
BLEU is widely used but has well-documented failure modes, especially for code. Understanding these limitations is essential for interpreting BLEU scores correctly and knowing when to use alternatives.
Variable Renaming
sum to total drops BLEU significantly despite functional equivalence. BLEU treats every token equally — it cannot distinguish variable names from keywords. A single rename can drop BLEU by 10-30%.Reordering
No Semantic Understanding
x = a + b and x = b + a are mathematically identical. BLEU scores them differently because bigrams differ. Commutativity is invisible to BLEU.Sentence-Level Unreliability
ROUGE, METEOR, and Recall-Oriented Metrics
BLEU measures precision — how much of the candidate appears in the reference. But what about recall? Different tasks need different emphasis. ROUGE is recall-oriented: what fraction of reference unigrams appear in the candidate? ROUGE-L uses Longest Common Subsequence, rewarding in-order overlap without requiring contiguity.
- ROUGE-1
- Unigram recall: what fraction of reference unigrams appear in the candidate?
- ROUGE-2
- Bigram recall: captures some word-order information.
- ROUGE-L
- Longest common subsequence: rewards in-order overlap without requiring contiguity.
- METEOR
- Combines unigram matches with stemming, synonyms, and word order penalty. Recall-weighted F-mean.
ROUGE-L Worked Example
Reference: reads the contents of this source as a string (9 tokens)
Candidate: get the textual information from this source and represent it as a string (13 tokens)
- LCS: "the", "this", "source", "as", "a", "string" = 6 tokens
- ROUGE-L Recall: 6 / 9 = 0.667
- ROUGE-L Precision: 6 / 13 = 0.462
- ROUGE-L F1: 0.546
This is more generous than BLEU (0.21) because LCS recognizes the shared sequential structure despite different word choices.
Embedding-Based Metrics and Vector Similarity
Instead of comparing surface tokens, we can encode each text into a vector and measure geometric closeness. Embedding-based metrics transform the evaluation problem from string matching to distance in a learned semantic space.
getMaxValue becomes ["get", "Max", "Value"].cos(A, B) = (A · B) / (||A|| × ||B||)
Range: −1 (opposite) to +1 (identical direction)
0 = orthogonal (unrelated)
Key insight: Cosine similarity measures DIRECTION, not LENGTH.
Two vectors pointing the same way score 1.0 regardless of length.
CodeBLEU: Structure and Semantic Awareness
CodeBLEU extends BLEU by recognizing that code should be evaluated not only as text, but also as structure and behavior. It combines four components — each weighted 0.25: n-gram match, weighted n-gram match, AST match, and data-flow match.
Compares syntactic structure. Parses both snippets into Abstract Syntax Trees, normalizes variable names, and measures the fraction of reference subtrees that appear in the candidate.
Example: Two for-loops with different loop variables score 1.0 on AST match because their structure is identical after normalization.
Compares value dependencies. Tracks how values are defined, used, and propagated. Two snippets with identical data-flow graphs score highly even if variable names differ.
Example: int x = a + b; int y = x * 2; and int temp = a + b; int result = temp * 2; have identical data-flow despite renaming.
pass@k: Functional Correctness and Test-Based Evaluation
For code generation, we don't need EVERY output to be correct — just ONE. pass@k measures the probability that at least one of k generated solutions passes all test cases. It is the gold standard for code generation evaluation.
pass@k = 1 − C(n−c, k) / C(n, k)
where:
n = total samples generated
c = number that pass all tests
k = number of attempts allowed
With n=100, c=23: pass@1 ≈ 0.23, pass@10 ≈ 0.89, pass@100 = 1.00. This captures a fundamental property of code generation: users often generate multiple candidates and pick the best one. A model producing one correct solution out of ten is still useful.
| Benchmark | Coverage | Key property |
|---|---|---|
| HumanEval | 164 hand-crafted Python problems | Standard: n=200, report pass@1, pass@10, pass@100 |
| MBPP | 974 crowd-sourced Python tasks | Simpler than HumanEval, broader coverage |
| LiveCodeBench | Continuously updated from competitive programming | Immune to contamination; post-dates training |
Contrastive Learning: Shaping the Embedding Space
If lexical metrics are not enough, how else might we learn what "similar meaning" looks like? The answer: shape the embedding space itself through contrastive learning. The idea is simple: train a model so that similar pairs stay close in embedding space and dissimilar pairs are pushed far apart.
- Contrastive Loss
- Pull similar pairs together, push dissimilar pairs apart up to a margin.
- Triplet Loss
- Anchor should be closer to positive than to negative by margin m.
- N-pair Loss
- Generalize to multiple negatives — distinguish the correct match from many wrong candidates.
SIDE: Summary-to-Code Semantic Alignment
Traditional metrics compare prediction vs. reference summary. But a summary can sound fluent and still be wrong with respect to the code. A stronger metric should measure alignment with code semantics. SIDE (Summary alIgnment to coDe sEmantic) learns a metric using contrastive learning on ~180K method-summary pairs, measuring whether the summary aligns with the meaning of the code itself — not just with a reference sentence.
Prediction vs. Reference Summary. A fluent-sounding but wrong summary might score well on overlap metrics.
Summary vs. Code Semantics. Encodes both code and summary through MPNet, compares semantic alignment directly.
Good Summary
Bad Summary
Training Data
Result
Common Pitfalls and How to Avoid Them
Even with the right metrics, evaluation can go wrong. Contamination — when test data leaks into pre-training corpora — makes evaluation results meaningless. A model trained on web data may have memorized solutions from HumanEval or MBPP. High scores then measure memorization, not generalization. Detection: check for long n-gram overlap (8-13 tokens) between test and training. Defense: use time-stamped benchmarks (LiveCodeBench).
Statistical rigor: a 2% BLEU improvement means nothing if it falls within the noise margin. Always report confidence intervals using bootstrap resampling. If Model A's BLEU is 32.4 [30.8, 34.1] and Model B's is 34.1 [32.3, 35.9], the confidence intervals overlap — improvement is NOT statistically significant.
Cherry-Picking
Wrong Baseline
Overfitting to Benchmarks
Metric Gaming
Metric Selection Guide by Task
Different tasks demand different metrics. Using the wrong metric can lead to misleading conclusions about model quality. Here is a decision framework for common SE tasks.
| Task | Primary metric | Secondary metrics | When to use human eval |
|---|---|---|---|
| Code Generation | pass@k | CodeBLEU + BLEU | Always for final paper conclusions |
| Code Summarization | ROUGE + METEOR | BLEU + SIDE | To assess summary quality and alignment |
| Code Translation | CodeBLEU | Exact Match + AST similarity | When structure preservation matters |
| Bug Fixing | Exact Match + Test Pass | BLEU as secondary | Always — correctness is binary |
| Code Review | Human Evaluation | SIDE + Embedding Sim | Primary evaluation method |
| Documentation Generation | METEOR | ROUGE + Embedding Sim | For fluency and completeness |
Course materials
Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.
Lab B · Evaluating a code model
Fine-tune CodeT5 for code translation, then evaluate with BLEU and CodeBLEU. Two notebooks: one runs the model, one computes the metrics.