Introduction & Module Overview
This module covers the core deep learning concepts and architectures that power modern AI-driven software engineering tools. We progress through two major pathways: non-generative tasks (classification without code generation) and generative tasks (producing new code sequences). By the end, you will understand why Transformers replaced RNNs, how pre-training and fine-tuning work, and which architectures suit which SE tasks.
Part A: Non-Generative Tasks
Part B: Embeddings & Seq2Seq
Part C: Transformers & Pre-training
Neural Network Fundamentals
Every deep learning model is built from the same basic building blocks: artificial neurons organized into layers. A single neuron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function: y = σ(Σ wⁱ xⁱ + b). The weights (wⁱ) and bias (b) are learnable parameters; the activation (σ) introduces non-linearity that enables networks to learn complex patterns.
- Neuron
- A single computational unit that computes a weighted sum of inputs, adds bias, and applies an activation function.
- Input Layer
- Receives raw features (e.g., code metrics, token counts).
- Hidden Layers
- Learn intermediate representations. More layers enable more abstract features.
- Output Layer
- Produces the final prediction (a class label, probability, or continuous score).
ReLU
max(0, x) — fast, prevents vanishing gradients for positive inputs, default for hidden layers.Sigmoid
1/(1+e^-x) — output between 0 and 1, used in binary classification and LSTM gates.Tanh
(e^x - e^-x)/(e^x + e^-x) — zero-centered, output between -1 and 1, used in LSTM cells.Softmax
Backpropagation: How Networks Learn
Neural networks learn through a four-step iterative process that repeats thousands of times:
w = w - lr * gradient. The learning rate (lr) controls step size.# PyTorch training loop (3 lines of actual logic)
prediction = model(input) # forward pass
loss = criterion(prediction, label)
loss.backward() # backpropagation computes gradients
optimizer.step() # gradient descent updates weights
Non-Generative Tasks: Classification & Prediction
Non-generative tasks take code as input and output a label or score — they analyze existing code without producing new source. Examples include clone detection ("is this code a copy of another?"), vulnerability prediction ("does this method contain a security flaw?"), and code smell detection ("is this class poorly designed?"). All non-generative tasks follow the same pipeline: source code → feature extraction → classifier/DNN → label/score.
| Task | Input | Output | Features | Challenge |
|---|---|---|---|---|
| Clone Detection | Code pair | Clone type (I-IV) | Tokens, AST, embeddings | Type IV semantic clones |
| Vulnerability Prediction | Code component | Risk score (0–1) | Metrics, API calls, history | Class imbalance (few vulns) |
| Code Smell Detection | Class / method | Smell type | LOC, complexity, coupling | Subjective thresholds |
Embeddings: From Symbols to Vectors
Before feeding code tokens to a neural network, we must represent them as numerical vectors. A naive approach — one-hot encoding — creates a sparse vector with a single 1 and rest 0s. Problem: vocabulary can be 50K+ tokens. Dense embeddings (e.g., Word2Vec) map each token to a low-dimensional vector (128–768 dimensions). Tokens appearing in similar contexts cluster near each other in vector space.
Cosine Similarity
cos(θ) = (A · B) / (||A|| × ||B||). Range: -1 (opposite) to +1 (identical). Snippets with different syntax but same logic may have cosine similarity 0.98 — how Type IV clones are detected.Byte Pair Encoding (BPE)
getEmbeddedIPv4) into subword units. Start with characters, iteratively merge the most frequent adjacent pairs. Essential for code's open vocabulary.LSTM & GRU: Gated Memory for Sequences
Vanilla RNNs process sequences token-by-token, passing a hidden state forward. But gradients during backpropagation shrink exponentially: after 50 tokens, the gradient is nearly zero. This vanishing gradient problem prevents learning long-range dependencies. LSTM (Long Short-Term Memory) solves this with gated memory cells. Three gates control information flow:
- Forget Gate (f_t)
- sigmoid(W_f * [h_{t-1}, x_t]) — decides what to erase from cell state.
- Input Gate (i_t)
- sigmoid(W_i * [h_{t-1}, x_t]) — decides what new info to store.
- Output Gate (o_t)
- sigmoid(W_o * [h_{t-1}, x_t]) — decides what to expose to the next layer.
- Cell State Update
- c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t — the forget gate erases old info; input gate writes new info.
Seq2Seq & Attention: Breaking the Bottleneck
Sequence-to-sequence (Seq2Seq) models map variable-length input sequences to variable-length output sequences. Classic example: code translation (Java → Python) or bug repair. An encoder LSTM reads the input and compresses it into a fixed-size context vector. A decoder LSTM then generates output tokens one at a time. During training, the decoder receives ground-truth previous tokens (teacher forcing).
Transformers: Self-Attention Replaces Recurrence
By 2017, the Transformer architecture (Vaswani et al., "Attention Is All You Need") revolutionized NLP and AI for code. Instead of processing sequences token-by-token (as RNNs do), self-attention allows every token to attend to every other token simultaneously — enabling massive parallelism on GPUs. Self-attention: Each token is projected into three vectors — Query (Q), Key (K), and Value (V). Attention = softmax(QK^T / √d_k) * V.
Pre-training & Fine-tuning: Transfer Learning
Modern code models follow a two-stage paradigm: pre-training on massive unlabeled code, then fine-tuning on smaller task-specific datasets. Pre-training objective (encoder-only): Masked Language Modeling (MLM) — randomly mask 15% of tokens, predict them from bidirectional context. Used by BERT, CodeBERT. Pre-training objective (decoder-only): Causal Language Modeling (CLM) — predict the next token given all previous tokens. Used by GPT, Codex.
Pre-training Stage
Fine-tuning Stage
Code Models Ecosystem
By 2020–2024, the code AI ecosystem matured into specialized models for different tasks and architectures.
| Model | Architecture | Training data | Key capability |
|---|---|---|---|
| CodeBERT | Encoder-only | CodeSearchNet (6 langs) | Code search, clone detection, defect prediction |
| GraphCodeBERT | Encoder-only | CodeSearchNet + data flow | Structure-aware code understanding |
| CodeT5 | Encoder-decoder | CodeSearchNet + C/C# | Code generation, summarization, translation |
| StarCoder | Decoder-only | The Stack (80+ langs) | Code completion, fill-in-the-middle |
| Codex / GPT | Decoder-only | GitHub public code | Code generation from NL, Copilot backend |
Module Recap & Key Takeaways
Deep learning transformed software engineering from hand-crafted rules to learned representations. The shift from handcrafted features to learned representations, and from task-specific models to pre-trained + fine-tuned architectures, enabled a step-change in AI for software engineering.
Non-Generative Tasks
Embeddings & BPE
LSTM & Seq2Seq
Transformers
Pre-training
Modern Code Models
Course materials
Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.
Lab · Deep code completion
An end-to-end seq2seq + attention setup for code completion, with a small hyperparameter tuner. Notebook reproduces every number in the corresponding lecture.