mastropaolo.dev playbook Back to playbook

Module 04 · Deep Learning

Deep Learning Foundations

From non-generative classification tasks to pre-trained Transformers: the deep learning architectures and training paradigms that power modern AI-assisted software engineering.

Reading time~60 min read InstructorDr. Mastropaolo CohortSpring 2026

Introduction & Module Overview

This module covers the core deep learning concepts and architectures that power modern AI-driven software engineering tools. We progress through two major pathways: non-generative tasks (classification without code generation) and generative tasks (producing new code sequences). By the end, you will understand why Transformers replaced RNNs, how pre-training and fine-tuning work, and which architectures suit which SE tasks.

Part A: Non-Generative Tasks

Clone detection, vulnerability prediction, code smell detection — classification and prediction tasks that analyze existing code without producing new source.

Part B: Embeddings & Seq2Seq

Word embeddings, BPE tokenization, LSTM architecture, encoder-decoder models with attention, and beam search for code generation.

Part C: Transformers & Pre-training

Self-attention, multi-head attention, positional encoding, pre-training objectives (MLM/CLM), fine-tuning, and the modern code model ecosystem (CodeBERT, CodeT5, StarCoder).

Neural Network Fundamentals

Every deep learning model is built from the same basic building blocks: artificial neurons organized into layers. A single neuron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function: y = σ(Σ wⁱ xⁱ + b). The weights (wⁱ) and bias (b) are learnable parameters; the activation (σ) introduces non-linearity that enables networks to learn complex patterns.

Neuron
A single computational unit that computes a weighted sum of inputs, adds bias, and applies an activation function.
Input Layer
Receives raw features (e.g., code metrics, token counts).
Hidden Layers
Learn intermediate representations. More layers enable more abstract features.
Output Layer
Produces the final prediction (a class label, probability, or continuous score).

ReLU

max(0, x) — fast, prevents vanishing gradients for positive inputs, default for hidden layers.

Sigmoid

1/(1+e^-x) — output between 0 and 1, used in binary classification and LSTM gates.

Tanh

(e^x - e^-x)/(e^x + e^-x) — zero-centered, output between -1 and 1, used in LSTM cells.

Softmax

Normalizes outputs to a probability distribution over multiple classes. Used in classification output layers.

Backpropagation: How Networks Learn

Neural networks learn through a four-step iterative process that repeats thousands of times:

Forward Pass
Input flows through the network layer by layer. Each neuron applies its weights, bias, and activation function.
Loss Computation
Compare the prediction to ground truth using a loss function (cross-entropy for classification, MSE for regression).
Backward Pass
Use the chain rule from calculus to compute gradients — how much each weight contributed to the error.
Weight Update
Adjust weights in the direction that reduces loss: w = w - lr * gradient. The learning rate (lr) controls step size.
python
# PyTorch training loop (3 lines of actual logic)
prediction = model(input)        # forward pass
loss = criterion(prediction, label)
loss.backward()                  # backpropagation computes gradients
optimizer.step()                 # gradient descent updates weights

Non-Generative Tasks: Classification & Prediction

Non-generative tasks take code as input and output a label or score — they analyze existing code without producing new source. Examples include clone detection ("is this code a copy of another?"), vulnerability prediction ("does this method contain a security flaw?"), and code smell detection ("is this class poorly designed?"). All non-generative tasks follow the same pipeline: source code → feature extraction → classifier/DNN → label/score.

Clone Detection (Types I-IV)
Type I: Identical except whitespace/comments. Type II: Renamed identifiers and literals. Type III: Gapped clone — statements added/removed. Type IV: Semantic clone — same function, different syntax. Only ML/DL can detect Type IV.
Vulnerability & Code Smell
Vulnerability: Classify code as vulnerable/safe using metrics (LOC, complexity, dangerous APIs), tokens, and change history. Challenge: class imbalance. Code Smell: Detect God Class, Long Method, Feature Envy from software metrics.
TaskInputOutputFeaturesChallenge
Clone DetectionCode pairClone type (I-IV)Tokens, AST, embeddingsType IV semantic clones
Vulnerability PredictionCode componentRisk score (0–1)Metrics, API calls, historyClass imbalance (few vulns)
Code Smell DetectionClass / methodSmell typeLOC, complexity, couplingSubjective thresholds

Embeddings: From Symbols to Vectors

Before feeding code tokens to a neural network, we must represent them as numerical vectors. A naive approach — one-hot encoding — creates a sparse vector with a single 1 and rest 0s. Problem: vocabulary can be 50K+ tokens. Dense embeddings (e.g., Word2Vec) map each token to a low-dimensional vector (128–768 dimensions). Tokens appearing in similar contexts cluster near each other in vector space.

Cosine Similarity

Measure how close two embeddings are: cos(θ) = (A · B) / (||A|| × ||B||). Range: -1 (opposite) to +1 (identical). Snippets with different syntax but same logic may have cosine similarity 0.98 — how Type IV clones are detected.

Byte Pair Encoding (BPE)

Tokenize compound identifiers (e.g., getEmbeddedIPv4) into subword units. Start with characters, iteratively merge the most frequent adjacent pairs. Essential for code's open vocabulary.

LSTM & GRU: Gated Memory for Sequences

Vanilla RNNs process sequences token-by-token, passing a hidden state forward. But gradients during backpropagation shrink exponentially: after 50 tokens, the gradient is nearly zero. This vanishing gradient problem prevents learning long-range dependencies. LSTM (Long Short-Term Memory) solves this with gated memory cells. Three gates control information flow:

Forget Gate (f_t)
sigmoid(W_f * [h_{t-1}, x_t]) — decides what to erase from cell state.
Input Gate (i_t)
sigmoid(W_i * [h_{t-1}, x_t]) — decides what new info to store.
Output Gate (o_t)
sigmoid(W_o * [h_{t-1}, x_t]) — decides what to expose to the next layer.
Cell State Update
c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t — the forget gate erases old info; input gate writes new info.
LSTM
3 Gates: Input, Forget, Output. 2 States: Cell + Hidden. More parameters, slower training, better on long sequences.
GRU
2 Gates: Update, Reset. 1 State: Hidden only. Faster training, fewer parameters, good on small datasets.

Seq2Seq & Attention: Breaking the Bottleneck

Sequence-to-sequence (Seq2Seq) models map variable-length input sequences to variable-length output sequences. Classic example: code translation (Java → Python) or bug repair. An encoder LSTM reads the input and compresses it into a fixed-size context vector. A decoder LSTM then generates output tokens one at a time. During training, the decoder receives ground-truth previous tokens (teacher forcing).

Encoder Forward Pass
Input sequence processed left-to-right. Final hidden state h_last encodes the entire input.
Attention Scoring
At each decoder step, compute dot-products: score_i = decoder_state · encoder_state_i. Softmax converts to weights.
Context Aggregation
Weighted sum of encoder states: context = Σ weight_i * encoder_state_i.
Token Generation
Concatenate context with decoder state. Feed through output layers to predict the next token.

Transformers: Self-Attention Replaces Recurrence

By 2017, the Transformer architecture (Vaswani et al., "Attention Is All You Need") revolutionized NLP and AI for code. Instead of processing sequences token-by-token (as RNNs do), self-attention allows every token to attend to every other token simultaneously — enabling massive parallelism on GPUs. Self-attention: Each token is projected into three vectors — Query (Q), Key (K), and Value (V). Attention = softmax(QK^T / √d_k) * V.

Multi-Head Self-Attention
Each token queries all other tokens in parallel across h heads. Captures bracket matching, variable scoping, data flow.
Residual + Layer Norm
Add attention output to original input (residual connection). Normalize activations. Enables gradient flow through deep stacks.
Feed-Forward Network
Two-layer MLP applied independently to each position. Stores factual knowledge: API signatures, language syntax, common patterns.
Residual + Layer Norm Again
Second residual around FFN. Output ready for next Transformer block.
RNN/LSTM Processing
Sequential: Process tokens left-to-right. Token dependencies require O(n) hops — a token at position 50 and position 1 require 49 steps to interact. Limited parallelism.
Transformer Processing
Parallel: All tokens processed simultaneously. Token dependencies are O(1) direct path via attention. Scales to billions of parameters.

Pre-training & Fine-tuning: Transfer Learning

Modern code models follow a two-stage paradigm: pre-training on massive unlabeled code, then fine-tuning on smaller task-specific datasets. Pre-training objective (encoder-only): Masked Language Modeling (MLM) — randomly mask 15% of tokens, predict them from bidirectional context. Used by BERT, CodeBERT. Pre-training objective (decoder-only): Causal Language Modeling (CLM) — predict the next token given all previous tokens. Used by GPT, Codex.

Pre-training Stage

Goal: General code understanding. Data: 900GB+. Cost: Millions in GPU compute. Duration: Weeks on 64+ GPUs. Frequency: Once, shared publicly.

Fine-tuning Stage

Goal: Task specialization. Data: 100s–1000s examples. Cost: Cheap — hours on 1 GPU. Frequency: For each new task/domain.

Code Models Ecosystem

By 2020–2024, the code AI ecosystem matured into specialized models for different tasks and architectures.

ModelArchitectureTraining dataKey capability
CodeBERTEncoder-onlyCodeSearchNet (6 langs)Code search, clone detection, defect prediction
GraphCodeBERTEncoder-onlyCodeSearchNet + data flowStructure-aware code understanding
CodeT5Encoder-decoderCodeSearchNet + C/C#Code generation, summarization, translation
StarCoderDecoder-onlyThe Stack (80+ langs)Code completion, fill-in-the-middle
Codex / GPTDecoder-onlyGitHub public codeCode generation from NL, Copilot backend

Module Recap & Key Takeaways

Deep learning transformed software engineering from hand-crafted rules to learned representations. The shift from handcrafted features to learned representations, and from task-specific models to pre-trained + fine-tuned architectures, enabled a step-change in AI for software engineering.

Non-Generative Tasks

Clone detection, vulnerability prediction, code smell detection. Key insight: only ML/DL can detect semantic clones (Type IV).

Embeddings & BPE

Dense vectors capture semantic similarity. Byte Pair Encoding handles code's open vocabulary. Foundation for all neural code models.

LSTM & Seq2Seq

Gated recurrent cells solve vanishing gradients. Encoder-decoder for variable-length-to-variable-length tasks. Attention breaks the bottleneck.

Transformers

Self-attention replaces recurrence. Parallel processing on GPUs. 8–16 attention heads learn diverse patterns. Enables scaling.

Pre-training

Train once on massive unlabeled code (MLM/CLM). Fine-tune cheaply on task-specific data. Transfer learning reduces data 100×.

Modern Code Models

CodeBERT, CodeT5, StarCoder power GitHub Copilot, automated code review, vulnerability scanners, refactoring assistants.

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

Lab · Deep code completion

An end-to-end seq2seq + attention setup for code completion, with a small hyperparameter tuner. Notebook reproduces every number in the corresponding lecture.

ipynb Lab_2_DL_Code_Completion.ipynb sync pending