mastropaolo.dev playbook Back to playbook

Module 03 · Metrics

Evaluating Rigorously

Master the metrics for evaluating AI systems that write and understand code—from token overlap to semantic alignment and functional correctness testing.

Reading time~50 min read InstructorDr. Mastropaolo CohortSpring 2026

The Semantic Equivalence Problem

When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. A Java method that reads the contents of this source as a string is semantically equivalent to one that gets the textual information from this source and represent it as a string, yet token-overlap metrics will penalize the second heavily.

This module explores why overlap-based metrics can fail and what alternatives exist.

The Core Problem

Consider these two summaries of the same code:

  • Prediction: "Reads the contents of this source as a string."
  • Ground Truth: "Get the textual information from this source and represent it as a string."

Only 3 tokens overlap — yet the summaries are semantically equivalent. An exact-match metric (BLEU) would score this 0.21, suggesting a near-total failure when in fact the prediction is perfectly correct. This is the fundamental challenge of evaluation: surface-level similarity is not the same as semantic correctness.

12
Metrics covered
3
Loss functions
SIDE
Final framework
7
Live demos

Classification Metrics: Foundation

Before we evaluate generative models, many SE tasks are classification problems. These use standard classification metrics that form the foundation for all other evaluation approaches.

Precision
Of all items predicted positive, how many truly are? TP / (TP + FP)
Recall
Of all actual positives, how many did we find? TP / (TP + FN)
F1 Score
Harmonic mean of precision and recall. 2 · P · R / (P + R)
Accuracy
Overall correct predictions. Can be misleading with imbalanced data.

BLEU: The Workhorse Metric

BLEU (Bilingual Evaluation Understudy) measures how much the generated text's n-grams overlap with the reference. It is simple, fast, and widely used — but it has significant limitations for code evaluation. BLEU computes the precision of n-grams (sequences of 1, 2, 3, or 4 consecutive tokens) in the candidate that appear in the reference, then applies a brevity penalty to discourage short outputs.

text
BLEU = BP × exp(Σ w_n · log p_n)

where:
  p_n = modified n-gram precision
  w_n = 1/N (uniform weight, typically N=4)
  BP = min(1, e^(1 − r/c)) = brevity penalty
  r = reference length, c = candidate length

Worked Example

Reference: public int getMax(int[] arr)

Candidate: public int findMax(int[] array)

  • 1-grams: {public, int} match → precision = 2/4 = 0.50
  • 2-grams: {public int} matches → precision = 1/3 = 0.33
  • Brevity penalty: both have 4 tokens → BP = 1.00
  • BLEU-2 = 1.00 × exp(0.5 × ln(0.50) + 0.5 × ln(0.33)) = 0.41

The variable rename alone dropped the score significantly, even though the code structure is identical.

BLEU Limitations and Failure Modes

BLEU is widely used but has well-documented failure modes, especially for code. Understanding these limitations is essential for interpreting BLEU scores correctly and knowing when to use alternatives.

Variable Renaming

Renaming sum to total drops BLEU significantly despite functional equivalence. BLEU treats every token equally — it cannot distinguish variable names from keywords. A single rename can drop BLEU by 10-30%.

Reordering

Swapping the order of independent statements breaks higher-order n-gram matches even though execution order may not matter. BLEU-4 can halve despite identical semantics.

No Semantic Understanding

x = a + b and x = b + a are mathematically identical. BLEU scores them differently because bigrams differ. Commutativity is invisible to BLEU.

Sentence-Level Unreliability

BLEU was designed for corpus-level evaluation (averaging over thousands of examples). Sentence-level BLEU is unreliable and often produces scores of 0 for short sequences. Never draw conclusions from a single BLEU score.

ROUGE, METEOR, and Recall-Oriented Metrics

BLEU measures precision — how much of the candidate appears in the reference. But what about recall? Different tasks need different emphasis. ROUGE is recall-oriented: what fraction of reference unigrams appear in the candidate? ROUGE-L uses Longest Common Subsequence, rewarding in-order overlap without requiring contiguity.

ROUGE-1
Unigram recall: what fraction of reference unigrams appear in the candidate?
ROUGE-2
Bigram recall: captures some word-order information.
ROUGE-L
Longest common subsequence: rewards in-order overlap without requiring contiguity.
METEOR
Combines unigram matches with stemming, synonyms, and word order penalty. Recall-weighted F-mean.

ROUGE-L Worked Example

Reference: reads the contents of this source as a string (9 tokens)

Candidate: get the textual information from this source and represent it as a string (13 tokens)

  • LCS: "the", "this", "source", "as", "a", "string" = 6 tokens
  • ROUGE-L Recall: 6 / 9 = 0.667
  • ROUGE-L Precision: 6 / 13 = 0.462
  • ROUGE-L F1: 0.546

This is more generous than BLEU (0.21) because LCS recognizes the shared sequential structure despite different word choices.

Embedding-Based Metrics and Vector Similarity

Instead of comparing surface tokens, we can encode each text into a vector and measure geometric closeness. Embedding-based metrics transform the evaluation problem from string matching to distance in a learned semantic space.

Tokenization
Code is split into sub-word tokens using BPE. getMaxValue becomes ["get", "Max", "Value"].
Token Embedding
Each token ID maps to a learned 768-dim vector via a lookup table.
Transformer Layers
12 self-attention layers refine each token's vector using context from all other tokens.
Pooling
The [CLS] token's final hidden state (or mean of all tokens) serves as the fixed-length representation of the entire snippet.
text
cos(A, B) = (A · B) / (||A|| × ||B||)

Range: −1 (opposite) to +1 (identical direction)
0 = orthogonal (unrelated)

Key insight: Cosine similarity measures DIRECTION, not LENGTH.
Two vectors pointing the same way score 1.0 regardless of length.

CodeBLEU: Structure and Semantic Awareness

CodeBLEU extends BLEU by recognizing that code should be evaluated not only as text, but also as structure and behavior. It combines four components — each weighted 0.25: n-gram match, weighted n-gram match, AST match, and data-flow match.

AST Match

Compares syntactic structure. Parses both snippets into Abstract Syntax Trees, normalizes variable names, and measures the fraction of reference subtrees that appear in the candidate.

Example: Two for-loops with different loop variables score 1.0 on AST match because their structure is identical after normalization.

Data-flow Match

Compares value dependencies. Tracks how values are defined, used, and propagated. Two snippets with identical data-flow graphs score highly even if variable names differ.

Example: int x = a + b; int y = x * 2; and int temp = a + b; int result = temp * 2; have identical data-flow despite renaming.

pass@k: Functional Correctness and Test-Based Evaluation

For code generation, we don't need EVERY output to be correct — just ONE. pass@k measures the probability that at least one of k generated solutions passes all test cases. It is the gold standard for code generation evaluation.

text
pass@k = 1 − C(n−c, k) / C(n, k)

where:
  n = total samples generated
  c = number that pass all tests
  k = number of attempts allowed

With n=100, c=23: pass@1 ≈ 0.23, pass@10 ≈ 0.89, pass@100 = 1.00. This captures a fundamental property of code generation: users often generate multiple candidates and pick the best one. A model producing one correct solution out of ten is still useful.

BenchmarkCoverageKey property
HumanEval164 hand-crafted Python problemsStandard: n=200, report pass@1, pass@10, pass@100
MBPP974 crowd-sourced Python tasksSimpler than HumanEval, broader coverage
LiveCodeBenchContinuously updated from competitive programmingImmune to contamination; post-dates training

Contrastive Learning: Shaping the Embedding Space

If lexical metrics are not enough, how else might we learn what "similar meaning" looks like? The answer: shape the embedding space itself through contrastive learning. The idea is simple: train a model so that similar pairs stay close in embedding space and dissimilar pairs are pushed far apart.

Contrastive Loss
Pull similar pairs together, push dissimilar pairs apart up to a margin.
Triplet Loss
Anchor should be closer to positive than to negative by margin m.
N-pair Loss
Generalize to multiple negatives — distinguish the correct match from many wrong candidates.

SIDE: Summary-to-Code Semantic Alignment

Traditional metrics compare prediction vs. reference summary. But a summary can sound fluent and still be wrong with respect to the code. A stronger metric should measure alignment with code semantics. SIDE (Summary alIgnment to coDe sEmantic) learns a metric using contrastive learning on ~180K method-summary pairs, measuring whether the summary aligns with the meaning of the code itself — not just with a reference sentence.

Traditional Approach

Prediction vs. Reference Summary. A fluent-sounding but wrong summary might score well on overlap metrics.

SIDE Approach

Summary vs. Code Semantics. Encodes both code and summary through MPNet, compares semantic alignment directly.

Good Summary

"Create a connection to the consumer." aligns with the method's actual behavior. SIDE score: 0.81

Bad Summary

"Connect to the server and return the status." misrepresents the code. SIDE score: 0.23

Training Data

Contrastively trained on CodeXGLUE's 180K method-summary pairs to recognize semantic alignment.

Result

SIDE distinguishes good summaries from bad ones better than any surface metric, approaching human judgment.

Common Pitfalls and How to Avoid Them

Even with the right metrics, evaluation can go wrong. Contamination — when test data leaks into pre-training corpora — makes evaluation results meaningless. A model trained on web data may have memorized solutions from HumanEval or MBPP. High scores then measure memorization, not generalization. Detection: check for long n-gram overlap (8-13 tokens) between test and training. Defense: use time-stamped benchmarks (LiveCodeBench).

Statistical rigor: a 2% BLEU improvement means nothing if it falls within the noise margin. Always report confidence intervals using bootstrap resampling. If Model A's BLEU is 32.4 [30.8, 34.1] and Model B's is 34.1 [32.3, 35.9], the confidence intervals overlap — improvement is NOT statistically significant.

Cherry-Picking

Showing only the best outputs creates misleading impressions. Fix: Report aggregate metrics over the full test set.

Wrong Baseline

Comparing to weak or outdated models inflates relative improvement. Fix: Compare against current state-of-the-art.

Overfitting to Benchmarks

Optimizing for metric scores (Goodhart's Law) rather than quality. Fix: Validate with held-out data and human evaluation.

Metric Gaming

Generating outputs of specific length or padding with safe tokens to exploit brevity penalties. Fix: Use multiple metrics; check quality manually.

Metric Selection Guide by Task

Different tasks demand different metrics. Using the wrong metric can lead to misleading conclusions about model quality. Here is a decision framework for common SE tasks.

TaskPrimary metricSecondary metricsWhen to use human eval
Code Generationpass@kCodeBLEU + BLEUAlways for final paper conclusions
Code SummarizationROUGE + METEORBLEU + SIDETo assess summary quality and alignment
Code TranslationCodeBLEUExact Match + AST similarityWhen structure preservation matters
Bug FixingExact Match + Test PassBLEU as secondaryAlways — correctness is binary
Code ReviewHuman EvaluationSIDE + Embedding SimPrimary evaluation method
Documentation GenerationMETEORROUGE + Embedding SimFor fluency and completeness

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

Lab B · Evaluating a code model

Fine-tune CodeT5 for code translation, then evaluate with BLEU and CodeBLEU. Two notebooks: one runs the model, one computes the metrics.

ipynb CodeT5_CTransl.ipynb sync pending
ipynb CodeBLEU.ipynb sync pending