mastropaolo.dev playbook Back to playbook

Module 02 · Language Models

Probabilistic Source Code Modeling

How do we model the likelihood of code tokens in a sequence? n-gram language models, probability estimation, perplexity, smoothing, and why code is surprisingly predictable.

Reading time~45 min read InstructorDr. Mastropaolo CohortSpring 2026

Probability Refresher

A quick recap of the probability concepts you will need throughout this module.

The sample space (S) is the set of all possible outcomes. For example, rolling a die gives S = {1, 2, 3, 4, 5, 6}. An event A = “roll an even number” = {2, 4, 6}, so P(A) = 3/6 = 0.5.

Conditional Probability

P(A | B) = P(A ∩ B) / P(B)

Example: A bag has 3 red and 2 blue marbles. You draw one and it is red (event B). What is the probability the next draw is also red (event A)? Since one red was removed, P(A|B) = 2/4 = 0.5.

The Chain Rule

To estimate the probability of an entire sentence, we use the joint probability P(A, B) = P(A) · P(B|A). For a sentence s = (w₁, w₂, …, wₙ), the chain rule expands this to P(w₁, …, wₙ) = ∏ⁱ P(wⁱ | w₁, …, wⁱ₋₁).

Walkthrough: “its water is so transparent”

Applying the chain rule step by step gives a product of P(its) × P(water | its) × P(is | its water) × P(so | its water is) × P(transparent | its water is so). The problem: finding exact long sequences in any corpus is extremely unlikely — even huge corpora will never contain most long word sequences.

Maximum Likelihood Estimation

When we estimate probabilities by counting, we are using Maximum Likelihood Estimation (MLE) — the simplest and most intuitive way to learn a language model. Count occurrences, divide by total observations, and that is your probability estimate.

P_MLE(w | context) = Count(context, w) / Count(context)

Worked Example

In a Java corpus, suppose Count(“(”, “for”) = 100 and Count(“(”) = 500. Then PMLE(“for” | “(”) = 100 / 500 = 0.20.

The Data Sparsity Problem

To compute exact sequence probabilities, we would need enough data to observe every possible word combination. But the number of combinations is astronomical. With just 1,000 unique words, the number of possible 5-word combinations is C(1000, 5) = 8.25 × 10¹². No corpus is large enough to observe all of these.

The solution: instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.

N-gram Models & the Markov Assumption

The key idea: approximate the full conditional probability by looking at only the last N−1 tokens.

ModelApproximation
UnigramP(w₁…wₙ) ≈ Π P(wⁱ) — each word independent
BigramP(wⁱ | w₁…wⁱ₋₁) ≈ P(wⁱ | wⁱ₋₁) — previous word only
TrigramP(wⁱ | w₁…wⁱ₋₁) ≈ P(wⁱ | wⁱ₋₂, wⁱ₋₁)
N-gramCondition on the last N−1 tokens. Higher N = more context but exponentially more data needed.

Start & End Tokens

Language models need to know where sequences begin and end. Special boundary tokens <s> and </s> make this possible. Without <s>, the model cannot learn which tokens start sequences. Without </s>, the model cannot learn when to stop generating.

Zipf's Law & the Naturalness Hypothesis

Word frequencies in natural language (and code) follow a power-law distribution known as Zipf's Law: a small number of words occur very frequently, while a large number occur rarely. About 20% of words account for 80% of all occurrences. In code, tokens like ;, =, (, ), return, if dominate, while most custom identifiers fall in the long tail.

The Naturalness Hypothesis

Hindle et al. (ICSE 2012) showed that despite the theoretical expressiveness of programming languages, real programs are mostly simple and rather repetitive, with predictable statistical properties that can be captured in statistical language models — in many cases more so than English prose.

~7.5
English cross-entropy (bits)
~4.4
Java cross-entropy (bits)
~41%
Lower entropy for code

Training a Bigram Model

Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets, count all bigram occurrences, then normalize each row to get probabilities.

python
def train(self, methods, vocab):
    self.vocab = vocab
    self.vocab_size = len(vocab)
    for tokens in methods:
        p = pad(tokens, self.n)
        for i in range(self.n - 1, len(p)):
            for k in range(1, self.n + 1):
                ctx = tuple(p[i - k + 1 : i])
                self.counts[ctx][p[i]] += 1
                self.totals[ctx] += 1

In practice, a 3-way split is preferred: training (70%) for learning the counts, a held-out/validation set (10%) for tuning hyperparameters like interpolation weights λ, and a test set (20%) for final evaluation. To prevent data leakage, split by repository — all methods from one repo go into the same pool.

Evaluation: Perplexity & Cross-Entropy

Extrinsic evaluation tests a model on a downstream task and measures accuracy. It is the gold standard but expensive and time-consuming. Intrinsic evaluation uses perplexity — a measure of model quality independent of any application. Lower perplexity = better model.

PP(W) = ( Πⁱ 1/P(wⁱ | wⁱ₋₁) )^(1/N)

python
def perplexity(self, methods):
    log_prob, n_tok = 0.0, 0
    for tokens in methods:
        p = pad(tokens, self.n)
        for i in range(self.n - 1, len(p)):
            ctx = tuple(p[i - self.n + 1 : i])
            log_prob += math.log2(self.prob(p[i], ctx))
            n_tok += 1
    return 2.0 ** (-(log_prob / n_tok)) if n_tok else float("inf")

Smoothing: Handling Zero Counts

If count(wⁱ₋₁, wⁱ) = 0, then P(wⁱ|wⁱ₋₁) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.

Laplace (Add-1) Smoothing

P(wⁱ | wⁱ₋₁) = (count(wⁱ₋₁, wⁱ) + 1) / (count(wⁱ₋₁) + V)

where V = vocabulary size. Simple but crude — steals too much mass from observed events.

MethodP(newList | return)P(null | return)Key trade-off
MLE (no smoothing)0/200 = 0.000080/200 = 0.4000Unseen events get zero — breaks perplexity
Laplace (add-1)(0+1)/(200+1000) = 0.000833(80+1)/(200+1000) = 0.0675Steals too much mass
Add-k (k=0.01)(0+0.01)/(200+10) = 0.0000476(80+0.01)/(200+10) = 0.3810Better balance
Kneser-NeyContinuation prob: ~0.0003Count minus discount d: ~0.3950Best overall — uses context diversity

Interpolation & Backoff

Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates. Linear interpolation blends P_uni, P_bi, and P_tri with weights λ₁ + λ₂ + λ₃ = 1. Stupid backoff tries the n-gram first; if count is zero, backs off to (n−1)-gram with penalty λ = 0.4.

python
def prob_interpolation(self, token, ctx):
    lam = 1.0 / self.n
    p = 0.0
    for k in range(1, self.n + 1):
        sub_ctx = ctx[-(k-1):] if k > 1 else ()
        c = self.counts.get(sub_ctx, {}).get(token, 0)
        t = self.totals.get(sub_ctx, 0)
        p += lam * (c + self.alpha) / (t + self.alpha * self.vocab_size)
    return p

Temperature & Sampling

Always picking the most likely next token (greedy decoding) produces repetitive, boring output. Sampling introduces variety by choosing randomly according to the probability distribution. Temperature (T) controls the sharpness of the distribution.

  • T = 0 (greedy) — always pick the highest-probability token
  • T = 1 — sample from the actual learned distribution
  • T > 1 — flatten the distribution (more random/creative)
  • T < 1 — sharpen the distribution (more focused/deterministic)
TokenRaw logitT = 0.3 (focused)T = 1.0 (neutral)T = 2.0 (creative)
ArrayList5.00.8760.4370.288
HashMap3.50.1010.2630.241
String2.00.0120.1420.197
Object1.50.0060.1050.175
int1.00.0030.0530.099

Out-of-Vocabulary & the <UNK> Token

New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability. Solution: replace rare tokens (count < threshold) with a special <UNK> token before training. The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier. Modern solution: this is why modern LLMs use BPE (Byte Pair Encoding) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords.

From N-grams to Neural Language Models

N-gram models have two fundamental limitations that neural language models overcome: parameter explosion (the number of parameters grows exponentially with the n-gram order) and no generalization (n-gram models cannot transfer knowledge between similar contexts).

N-gram LMs
  • Count-based statistics
  • Fixed context window
  • No semantic understanding
  • Cannot generalize
  • Fast, interpretable
Neural LMs
  • Learned representations
  • Long-range context
  • Semantic embeddings
  • Generalizes to new data
  • GPU-intensive, less interpretable
N-grams
Count-based
FFNNs
Bengio 2003 — embeddings + dense network
RNNs
Hidden state accumulates context
Transformers
Self-attention, parallel
LLMs
Scale to billions of parameters

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

Pre-lab · ML warmup — spam classifier

Before n-grams over code, build the simplest probabilistic classifier on text: Naive Bayes vs Random Forest on a spam corpus. Establishes the MLE / smoothing / evaluation muscle you'll re-use on code tokens in this module.

ipynb Spam_Classifiers_NBandRF.ipynb sync pending