Probability Refresher
A quick recap of the probability concepts you will need throughout this module.
The sample space (S) is the set of all possible outcomes. For example, rolling a die gives S = {1, 2, 3, 4, 5, 6}. An event A = “roll an even number” = {2, 4, 6}, so P(A) = 3/6 = 0.5.
Conditional Probability
P(A | B) = P(A ∩ B) / P(B)
Example: A bag has 3 red and 2 blue marbles. You draw one and it is red (event B). What is the probability the next draw is also red (event A)? Since one red was removed, P(A|B) = 2/4 = 0.5.
The Chain Rule
To estimate the probability of an entire sentence, we use the joint probability P(A, B) = P(A) · P(B|A). For a sentence s = (w₁, w₂, …, wₙ), the chain rule expands this to P(w₁, …, wₙ) = ∏ⁱ P(wⁱ | w₁, …, wⁱ₋₁).
Walkthrough: “its water is so transparent”
Applying the chain rule step by step gives a product of P(its) × P(water | its) × P(is | its water) × P(so | its water is) × P(transparent | its water is so). The problem: finding exact long sequences in any corpus is extremely unlikely — even huge corpora will never contain most long word sequences.
Maximum Likelihood Estimation
When we estimate probabilities by counting, we are using Maximum Likelihood Estimation (MLE) — the simplest and most intuitive way to learn a language model. Count occurrences, divide by total observations, and that is your probability estimate.
P_MLE(w | context) = Count(context, w) / Count(context)
Worked Example
In a Java corpus, suppose Count(“(”, “for”) = 100 and Count(“(”) = 500. Then PMLE(“for” | “(”) = 100 / 500 = 0.20.
The Data Sparsity Problem
To compute exact sequence probabilities, we would need enough data to observe every possible word combination. But the number of combinations is astronomical. With just 1,000 unique words, the number of possible 5-word combinations is C(1000, 5) = 8.25 × 10¹². No corpus is large enough to observe all of these.
The solution: instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.
N-gram Models & the Markov Assumption
The key idea: approximate the full conditional probability by looking at only the last N−1 tokens.
| Model | Approximation |
|---|---|
| Unigram | P(w₁…wₙ) ≈ Π P(wⁱ) — each word independent |
| Bigram | P(wⁱ | w₁…wⁱ₋₁) ≈ P(wⁱ | wⁱ₋₁) — previous word only |
| Trigram | P(wⁱ | w₁…wⁱ₋₁) ≈ P(wⁱ | wⁱ₋₂, wⁱ₋₁) |
| N-gram | Condition on the last N−1 tokens. Higher N = more context but exponentially more data needed. |
Start & End Tokens
Language models need to know where sequences begin and end. Special boundary tokens <s> and </s> make this possible. Without <s>, the model cannot learn which tokens start sequences. Without </s>, the model cannot learn when to stop generating.
Zipf's Law & the Naturalness Hypothesis
Word frequencies in natural language (and code) follow a power-law distribution known as Zipf's Law: a small number of words occur very frequently, while a large number occur rarely. About 20% of words account for 80% of all occurrences. In code, tokens like ;, =, (, ), return, if dominate, while most custom identifiers fall in the long tail.
The Naturalness Hypothesis
Hindle et al. (ICSE 2012) showed that despite the theoretical expressiveness of programming languages, real programs are mostly simple and rather repetitive, with predictable statistical properties that can be captured in statistical language models — in many cases more so than English prose.
Training a Bigram Model
Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets, count all bigram occurrences, then normalize each row to get probabilities.
def train(self, methods, vocab):
self.vocab = vocab
self.vocab_size = len(vocab)
for tokens in methods:
p = pad(tokens, self.n)
for i in range(self.n - 1, len(p)):
for k in range(1, self.n + 1):
ctx = tuple(p[i - k + 1 : i])
self.counts[ctx][p[i]] += 1
self.totals[ctx] += 1
In practice, a 3-way split is preferred: training (70%) for learning the counts, a held-out/validation set (10%) for tuning hyperparameters like interpolation weights λ, and a test set (20%) for final evaluation. To prevent data leakage, split by repository — all methods from one repo go into the same pool.
Evaluation: Perplexity & Cross-Entropy
Extrinsic evaluation tests a model on a downstream task and measures accuracy. It is the gold standard but expensive and time-consuming. Intrinsic evaluation uses perplexity — a measure of model quality independent of any application. Lower perplexity = better model.
PP(W) = ( Πⁱ 1/P(wⁱ | wⁱ₋₁) )^(1/N)
def perplexity(self, methods):
log_prob, n_tok = 0.0, 0
for tokens in methods:
p = pad(tokens, self.n)
for i in range(self.n - 1, len(p)):
ctx = tuple(p[i - self.n + 1 : i])
log_prob += math.log2(self.prob(p[i], ctx))
n_tok += 1
return 2.0 ** (-(log_prob / n_tok)) if n_tok else float("inf")
Smoothing: Handling Zero Counts
If count(wⁱ₋₁, wⁱ) = 0, then P(wⁱ|wⁱ₋₁) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.
Laplace (Add-1) Smoothing
P(wⁱ | wⁱ₋₁) = (count(wⁱ₋₁, wⁱ) + 1) / (count(wⁱ₋₁) + V)
where V = vocabulary size. Simple but crude — steals too much mass from observed events.
| Method | P(newList | return) | P(null | return) | Key trade-off |
|---|---|---|---|
| MLE (no smoothing) | 0/200 = 0.0000 | 80/200 = 0.4000 | Unseen events get zero — breaks perplexity |
| Laplace (add-1) | (0+1)/(200+1000) = 0.000833 | (80+1)/(200+1000) = 0.0675 | Steals too much mass |
| Add-k (k=0.01) | (0+0.01)/(200+10) = 0.0000476 | (80+0.01)/(200+10) = 0.3810 | Better balance |
| Kneser-Ney | Continuation prob: ~0.0003 | Count minus discount d: ~0.3950 | Best overall — uses context diversity |
Interpolation & Backoff
Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates. Linear interpolation blends P_uni, P_bi, and P_tri with weights λ₁ + λ₂ + λ₃ = 1. Stupid backoff tries the n-gram first; if count is zero, backs off to (n−1)-gram with penalty λ = 0.4.
def prob_interpolation(self, token, ctx):
lam = 1.0 / self.n
p = 0.0
for k in range(1, self.n + 1):
sub_ctx = ctx[-(k-1):] if k > 1 else ()
c = self.counts.get(sub_ctx, {}).get(token, 0)
t = self.totals.get(sub_ctx, 0)
p += lam * (c + self.alpha) / (t + self.alpha * self.vocab_size)
return p
Temperature & Sampling
Always picking the most likely next token (greedy decoding) produces repetitive, boring output. Sampling introduces variety by choosing randomly according to the probability distribution. Temperature (T) controls the sharpness of the distribution.
- T = 0 (greedy) — always pick the highest-probability token
- T = 1 — sample from the actual learned distribution
- T > 1 — flatten the distribution (more random/creative)
- T < 1 — sharpen the distribution (more focused/deterministic)
| Token | Raw logit | T = 0.3 (focused) | T = 1.0 (neutral) | T = 2.0 (creative) |
|---|---|---|---|---|
ArrayList | 5.0 | 0.876 | 0.437 | 0.288 |
HashMap | 3.5 | 0.101 | 0.263 | 0.241 |
String | 2.0 | 0.012 | 0.142 | 0.197 |
Object | 1.5 | 0.006 | 0.105 | 0.175 |
int | 1.0 | 0.003 | 0.053 | 0.099 |
Out-of-Vocabulary & the <UNK> Token
New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability. Solution: replace rare tokens (count < threshold) with a special <UNK> token before training. The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier. Modern solution: this is why modern LLMs use BPE (Byte Pair Encoding) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords.
From N-grams to Neural Language Models
N-gram models have two fundamental limitations that neural language models overcome: parameter explosion (the number of parameters grows exponentially with the n-gram order) and no generalization (n-gram models cannot transfer knowledge between similar contexts).
- Count-based statistics
- Fixed context window
- No semantic understanding
- Cannot generalize
- Fast, interpretable
- Learned representations
- Long-range context
- Semantic embeddings
- Generalizes to new data
- GPU-intensive, less interpretable
Course materials
Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.
Pre-lab · ML warmup — spam classifier
Before n-grams over code, build the simplest probabilistic classifier on text: Naive Bayes vs Random Forest on a spam corpus. Establishes the MLE / smoothing / evaluation muscle you'll re-use on code tokens in this module.