Probabilistic Source Code Modeling

Probability Refresher

A quick recap of the probability concepts you will need throughout this module.

The sample space (S) is the set of all possible outcomes. For example, rolling a die gives S = {1, 2, 3, 4, 5, 6}. An event A = “roll an even number” = {2, 4, 6}, so P(A) = 3/6 = 0.5.

Conditional Probability

P(A | B) = P(A ∩ B) / P(B)

Example: A bag has 3 red and 2 blue marbles. You draw one and it is red (event B). What is the probability the next draw is also red (event A)? Since one red was removed, P(A|B) = 2/4 = 0.5.

The Chain Rule

To estimate the probability of an entire sentence, we use the joint probability P(A, B) = P(A) · P(B|A). For a sentence s = (w₁, w₂, …, wₙ), the chain rule expands this to P(w₁, …, wₙ) = ∏ⁱ P(wⁱ | w₁, …, wⁱ₋₁).

Walkthrough: “its water is so transparent”

Applying the chain rule step by step gives a product of P(its) × P(water | its) × P(is | its water) × P(so | its water is) × P(transparent | its water is so). The problem: finding exact long sequences in any corpus is extremely unlikely — even huge corpora will never contain most long word sequences.

Maximum Likelihood Estimation

When we estimate probabilities by counting, we are using Maximum Likelihood Estimation (MLE) — the simplest and most intuitive way to learn a language model. Count occurrences, divide by total observations, and that is your probability estimate.

P_MLE(w | context) = Count(context, w) / Count(context)

Worked Example

In a Java corpus, suppose Count(“(”, “for”) = 100 and Count(“(”) = 500. Then P_MLE(“for” | “(”) = 100 / 500 = 0.20.

The Data Sparsity Problem

To compute exact sequence probabilities, we would need enough data to observe every possible word combination. But the number of combinations is astronomical. With just 1,000 unique words, the number of possible 5-word combinations is C(1000, 5) = 8.25 × 10¹². No corpus is large enough to observe all of these.

The solution: instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.

N-gram Models & the Markov Assumption

The key idea: approximate the full conditional probability by looking at only the last N−1 tokens.

Model	Approximation
Unigram	P(w₁…wₙ) ≈ Π P(wⁱ) — each word independent
Bigram	P(wⁱ \| w₁…wⁱ₋₁) ≈ P(wⁱ \| wⁱ₋₁) — previous word only
Trigram	P(wⁱ \| w₁…wⁱ₋₁) ≈ P(wⁱ \| wⁱ₋₂, wⁱ₋₁)
N-gram	Condition on the last N−1 tokens. Higher N = more context but exponentially more data needed.

Start & End Tokens

Language models need to know where sequences begin and end. Special boundary tokens <s> and </s> make this possible. Without <s>, the model cannot learn which tokens start sequences. Without </s>, the model cannot learn when to stop generating.

Zipf's Law & the Naturalness Hypothesis

Word frequencies in natural language (and code) follow a power-law distribution known as Zipf's Law: a small number of words occur very frequently, while a large number occur rarely. About 20% of words account for 80% of all occurrences. In code, tokens like ;, =, (, ), return, if dominate, while most custom identifiers fall in the long tail.

The Naturalness Hypothesis

Hindle et al. (ICSE 2012) showed that despite the theoretical expressiveness of programming languages, real programs are mostly simple and rather repetitive, with predictable statistical properties that can be captured in statistical language models — in many cases more so than English prose.

~7.5

English cross-entropy (bits)

~4.4

Java cross-entropy (bits)

~41%

Lower entropy for code

Training a Bigram Model

Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets, count all bigram occurrences, then normalize each row to get probabilities.

python

def train(self, methods, vocab):
    self.vocab = vocab
    self.vocab_size = len(vocab)
    for tokens in methods:
        p = pad(tokens, self.n)
        for i in range(self.n - 1, len(p)):
            for k in range(1, self.n + 1):
                ctx = tuple(p[i - k + 1 : i])
                self.counts[ctx][p[i]] += 1
                self.totals[ctx] += 1

In practice, a 3-way split is preferred: training (70%) for learning the counts, a held-out/validation set (10%) for tuning hyperparameters like interpolation weights λ, and a test set (20%) for final evaluation. To prevent data leakage, split by repository — all methods from one repo go into the same pool.

Evaluation: Perplexity & Cross-Entropy

Extrinsic evaluation tests a model on a downstream task and measures accuracy. It is the gold standard but expensive and time-consuming. Intrinsic evaluation uses perplexity — a measure of model quality independent of any application. Lower perplexity = better model.

PP(W) = ( Πⁱ 1/P(wⁱ | wⁱ₋₁) )^(1/N)

python

def perplexity(self, methods):
    log_prob, n_tok = 0.0, 0
    for tokens in methods:
        p = pad(tokens, self.n)
        for i in range(self.n - 1, len(p)):
            ctx = tuple(p[i - self.n + 1 : i])
            log_prob += math.log2(self.prob(p[i], ctx))
            n_tok += 1
    return 2.0 ** (-(log_prob / n_tok)) if n_tok else float("inf")

Smoothing: Handling Zero Counts

If count(wⁱ₋₁, wⁱ) = 0, then P(wⁱ|wⁱ₋₁) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.

Laplace (Add-1) Smoothing

P(wⁱ | wⁱ₋₁) = (count(wⁱ₋₁, wⁱ) + 1) / (count(wⁱ₋₁) + V)

where V = vocabulary size. Simple but crude — steals too much mass from observed events.

Method	P(newList \| return)	P(null \| return)	Key trade-off
MLE (no smoothing)	0/200 = 0.0000	80/200 = 0.4000	Unseen events get zero — breaks perplexity
Laplace (add-1)	(0+1)/(200+1000) = 0.000833	(80+1)/(200+1000) = 0.0675	Steals too much mass
Add-k (k=0.01)	(0+0.01)/(200+10) = 0.0000476	(80+0.01)/(200+10) = 0.3810	Better balance
Kneser-Ney	Continuation prob: ~0.0003	Count minus discount d: ~0.3950	Best overall — uses context diversity

Interpolation & Backoff

Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates. Linear interpolation blends P_uni, P_bi, and P_tri with weights λ₁ + λ₂ + λ₃ = 1. Stupid backoff tries the n-gram first; if count is zero, backs off to (n−1)-gram with penalty λ = 0.4.

python

def prob_interpolation(self, token, ctx):
    lam = 1.0 / self.n
    p = 0.0
    for k in range(1, self.n + 1):
        sub_ctx = ctx[-(k-1):] if k > 1 else ()
        c = self.counts.get(sub_ctx, {}).get(token, 0)
        t = self.totals.get(sub_ctx, 0)
        p += lam * (c + self.alpha) / (t + self.alpha * self.vocab_size)
    return p

Temperature & Sampling

Always picking the most likely next token (greedy decoding) produces repetitive, boring output. Sampling introduces variety by choosing randomly according to the probability distribution. Temperature (T) controls the sharpness of the distribution.

T = 0 (greedy) — always pick the highest-probability token
T = 1 — sample from the actual learned distribution
T > 1 — flatten the distribution (more random/creative)
T < 1 — sharpen the distribution (more focused/deterministic)

Token	Raw logit	T = 0.3 (focused)	T = 1.0 (neutral)	T = 2.0 (creative)
`ArrayList`	5.0	0.876	0.437	0.288
`HashMap`	3.5	0.101	0.263	0.241
`String`	2.0	0.012	0.142	0.197
`Object`	1.5	0.006	0.105	0.175
`int`	1.0	0.003	0.053	0.099

Out-of-Vocabulary & the <UNK> Token

New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability. Solution: replace rare tokens (count < threshold) with a special <UNK> token before training. The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier. Modern solution: this is why modern LLMs use BPE (Byte Pair Encoding) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords.

From N-grams to Neural Language Models

N-gram models have two fundamental limitations that neural language models overcome: parameter explosion (the number of parameters grows exponentially with the n-gram order) and no generalization (n-gram models cannot transfer knowledge between similar contexts).

N-gram LMs

Count-based statistics
Fixed context window
No semantic understanding
Cannot generalize
Fast, interpretable

Neural LMs

Learned representations
Long-range context
Semantic embeddings
Generalizes to new data
GPU-intensive, less interpretable

N-grams

Count-based

FFNNs

Bengio 2003 — embeddings + dense network

RNNs

Hidden state accumulates context

Transformers

Self-attention, parallel

LLMs

Scale to billions of parameters

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

pdf Slides Lecture slides — Probabilistic Language Modeling 10795 KB · open → pdf Slides Lecture slides — Vocabulary & Embeddings 10064 KB · open →

Pre-lab · ML warmup — spam classifier

Before n-grams over code, build the simplest probabilistic classifier on text: Naive Bayes vs Random Forest on a spam corpus. Establishes the MLE / smoothing / evaluation muscle you'll re-use on code tokens in this module.

ipynb Spam_Classifiers_NBandRF.ipynb sync pending

Next module Evaluating Rigorously

→