Deep Learning Foundations — CSCI 455 / 555

Introduction & Module Overview

This module covers the core deep learning concepts and architectures that power modern AI-driven software engineering tools. We progress through two major pathways: non-generative tasks (classification without code generation) and generative tasks (producing new code sequences). By the end, you will understand why Transformers replaced RNNs, how pre-training and fine-tuning work, and which architectures suit which SE tasks.

Part A: Non-Generative Tasks

Clone detection, vulnerability prediction, code smell detection — classification and prediction tasks that analyze existing code without producing new source.

Part B: Embeddings & Seq2Seq

Word embeddings, BPE tokenization, LSTM architecture, encoder-decoder models with attention, and beam search for code generation.

Part C: Transformers & Pre-training

Self-attention, multi-head attention, positional encoding, pre-training objectives (MLM/CLM), fine-tuning, and the modern code model ecosystem (CodeBERT, CodeT5, StarCoder).

Neural Network Fundamentals

Every deep learning model is built from the same basic building blocks: artificial neurons organized into layers. A single neuron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function: y = σ(Σ wⁱ xⁱ + b). The weights (wⁱ) and bias (b) are learnable parameters; the activation (σ) introduces non-linearity that enables networks to learn complex patterns.

Neuron: A single computational unit that computes a weighted sum of inputs, adds bias, and applies an activation function.
Input Layer: Receives raw features (e.g., code metrics, token counts).
Hidden Layers: Learn intermediate representations. More layers enable more abstract features.
Output Layer: Produces the final prediction (a class label, probability, or continuous score).

ReLU

max(0, x) — fast, prevents vanishing gradients for positive inputs, default for hidden layers.

Sigmoid

1/(1+e^-x) — output between 0 and 1, used in binary classification and LSTM gates.

Tanh

(e^x - e^-x)/(e^x + e^-x) — zero-centered, output between -1 and 1, used in LSTM cells.

Softmax

Normalizes outputs to a probability distribution over multiple classes. Used in classification output layers.

Backpropagation: How Networks Learn

Neural networks learn through a four-step iterative process that repeats thousands of times:

Forward Pass

Input flows through the network layer by layer. Each neuron applies its weights, bias, and activation function.

Loss Computation

Compare the prediction to ground truth using a loss function (cross-entropy for classification, MSE for regression).

Backward Pass

Use the chain rule from calculus to compute gradients — how much each weight contributed to the error.

Weight Update

Adjust weights in the direction that reduces loss: w = w - lr * gradient. The learning rate (lr) controls step size.

python

# PyTorch training loop (3 lines of actual logic)
prediction = model(input)        # forward pass
loss = criterion(prediction, label)
loss.backward()                  # backpropagation computes gradients
optimizer.step()                 # gradient descent updates weights

Non-Generative Tasks: Classification & Prediction

Non-generative tasks take code as input and output a label or score — they analyze existing code without producing new source. Examples include clone detection ("is this code a copy of another?"), vulnerability prediction ("does this method contain a security flaw?"), and code smell detection ("is this class poorly designed?"). All non-generative tasks follow the same pipeline: source code → feature extraction → classifier/DNN → label/score.

Clone Detection (Types I-IV)

Type I: Identical except whitespace/comments. Type II: Renamed identifiers and literals. Type III: Gapped clone — statements added/removed. Type IV: Semantic clone — same function, different syntax. Only ML/DL can detect Type IV.

Vulnerability & Code Smell

Vulnerability: Classify code as vulnerable/safe using metrics (LOC, complexity, dangerous APIs), tokens, and change history. Challenge: class imbalance. Code Smell: Detect God Class, Long Method, Feature Envy from software metrics.

Task	Input	Output	Features	Challenge
Clone Detection	Code pair	Clone type (I-IV)	Tokens, AST, embeddings	Type IV semantic clones
Vulnerability Prediction	Code component	Risk score (0–1)	Metrics, API calls, history	Class imbalance (few vulns)
Code Smell Detection	Class / method	Smell type	LOC, complexity, coupling	Subjective thresholds

Embeddings: From Symbols to Vectors

Before feeding code tokens to a neural network, we must represent them as numerical vectors. A naive approach — one-hot encoding — creates a sparse vector with a single 1 and rest 0s. Problem: vocabulary can be 50K+ tokens. Dense embeddings (e.g., Word2Vec) map each token to a low-dimensional vector (128–768 dimensions). Tokens appearing in similar contexts cluster near each other in vector space.

Cosine Similarity

Measure how close two embeddings are: cos(θ) = (A · B) / (||A|| × ||B||). Range: -1 (opposite) to +1 (identical). Snippets with different syntax but same logic may have cosine similarity 0.98 — how Type IV clones are detected.

Byte Pair Encoding (BPE)

Tokenize compound identifiers (e.g., getEmbeddedIPv4) into subword units. Start with characters, iteratively merge the most frequent adjacent pairs. Essential for code's open vocabulary.

LSTM & GRU: Gated Memory for Sequences

Vanilla RNNs process sequences token-by-token, passing a hidden state forward. But gradients during backpropagation shrink exponentially: after 50 tokens, the gradient is nearly zero. This vanishing gradient problem prevents learning long-range dependencies. LSTM (Long Short-Term Memory) solves this with gated memory cells. Three gates control information flow:

Forget Gate (f_t): sigmoid(W_f * [h_{t-1}, x_t]) — decides what to erase from cell state.
Input Gate (i_t): sigmoid(W_i * [h_{t-1}, x_t]) — decides what new info to store.
Output Gate (o_t): sigmoid(W_o * [h_{t-1}, x_t]) — decides what to expose to the next layer.
Cell State Update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t — the forget gate erases old info; input gate writes new info.

LSTM

3 Gates: Input, Forget, Output. 2 States: Cell + Hidden. More parameters, slower training, better on long sequences.

GRU

2 Gates: Update, Reset. 1 State: Hidden only. Faster training, fewer parameters, good on small datasets.

Seq2Seq & Attention: Breaking the Bottleneck

Sequence-to-sequence (Seq2Seq) models map variable-length input sequences to variable-length output sequences. Classic example: code translation (Java → Python) or bug repair. An encoder LSTM reads the input and compresses it into a fixed-size context vector. A decoder LSTM then generates output tokens one at a time. During training, the decoder receives ground-truth previous tokens (teacher forcing).

Encoder Forward Pass

Input sequence processed left-to-right. Final hidden state h_last encodes the entire input.

Attention Scoring

At each decoder step, compute dot-products: score_i = decoder_state · encoder_state_i. Softmax converts to weights.

Context Aggregation

Weighted sum of encoder states: context = Σ weight_i * encoder_state_i.

Token Generation

Concatenate context with decoder state. Feed through output layers to predict the next token.

Transformers: Self-Attention Replaces Recurrence

By 2017, the Transformer architecture (Vaswani et al., "Attention Is All You Need") revolutionized NLP and AI for code. Instead of processing sequences token-by-token (as RNNs do), self-attention allows every token to attend to every other token simultaneously — enabling massive parallelism on GPUs. Self-attention: Each token is projected into three vectors — Query (Q), Key (K), and Value (V). Attention = softmax(QK^T / √d_k) * V.

Multi-Head Self-Attention

Each token queries all other tokens in parallel across h heads. Captures bracket matching, variable scoping, data flow.

Residual + Layer Norm

Add attention output to original input (residual connection). Normalize activations. Enables gradient flow through deep stacks.

Feed-Forward Network

Two-layer MLP applied independently to each position. Stores factual knowledge: API signatures, language syntax, common patterns.

Residual + Layer Norm Again

Second residual around FFN. Output ready for next Transformer block.

RNN/LSTM Processing

Sequential: Process tokens left-to-right. Token dependencies require O(n) hops — a token at position 50 and position 1 require 49 steps to interact. Limited parallelism.

Transformer Processing

Parallel: All tokens processed simultaneously. Token dependencies are O(1) direct path via attention. Scales to billions of parameters.

Pre-training & Fine-tuning: Transfer Learning

Modern code models follow a two-stage paradigm: pre-training on massive unlabeled code, then fine-tuning on smaller task-specific datasets. Pre-training objective (encoder-only): Masked Language Modeling (MLM) — randomly mask 15% of tokens, predict them from bidirectional context. Used by BERT, CodeBERT. Pre-training objective (decoder-only): Causal Language Modeling (CLM) — predict the next token given all previous tokens. Used by GPT, Codex.

Pre-training Stage

Goal: General code understanding. Data: 900GB+. Cost: Millions in GPU compute. Duration: Weeks on 64+ GPUs. Frequency: Once, shared publicly.

Fine-tuning Stage

Goal: Task specialization. Data: 100s–1000s examples. Cost: Cheap — hours on 1 GPU. Frequency: For each new task/domain.

Code Models Ecosystem

By 2020–2024, the code AI ecosystem matured into specialized models for different tasks and architectures.

Model	Architecture	Training data	Key capability
CodeBERT	Encoder-only	CodeSearchNet (6 langs)	Code search, clone detection, defect prediction
GraphCodeBERT	Encoder-only	CodeSearchNet + data flow	Structure-aware code understanding
CodeT5	Encoder-decoder	CodeSearchNet + C/C#	Code generation, summarization, translation
StarCoder	Decoder-only	The Stack (80+ langs)	Code completion, fill-in-the-middle
Codex / GPT	Decoder-only	GitHub public code	Code generation from NL, Copilot backend

Module Recap & Key Takeaways

Deep learning transformed software engineering from hand-crafted rules to learned representations. The shift from handcrafted features to learned representations, and from task-specific models to pre-trained + fine-tuned architectures, enabled a step-change in AI for software engineering.

Non-Generative Tasks

Clone detection, vulnerability prediction, code smell detection. Key insight: only ML/DL can detect semantic clones (Type IV).

Embeddings & BPE

Dense vectors capture semantic similarity. Byte Pair Encoding handles code's open vocabulary. Foundation for all neural code models.

LSTM & Seq2Seq

Gated recurrent cells solve vanishing gradients. Encoder-decoder for variable-length-to-variable-length tasks. Attention breaks the bottleneck.

Transformers

Self-attention replaces recurrence. Parallel processing on GPUs. 8–16 attention heads learn diverse patterns. Enables scaling.

Pre-training

Train once on massive unlabeled code (MLM/CLM). Fine-tune cheaply on task-specific data. Transfer learning reduces data 100×.

Modern Code Models

CodeBERT, CodeT5, StarCoder power GitHub Copilot, automated code review, vulnerability scanners, refactoring assistants.

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

pdf Slides Lecture slides — Deep Learning 101 45595 KB · open → pdf Slides Lecture slides — DL4SE Applications Overview 17349 KB · open → pdf Slides Lecture slides — Pre-trained Models for SD Activities 16858 KB · open → pdf Paper GitHub Copilot AI pair programmer — Asset or Liability? (ICSE 2023) 9577 KB · open →

Lab · Deep code completion

An end-to-end seq2seq + attention setup for code completion, with a small hyperparameter tuner. Notebook reproduces every number in the corresponding lecture.

ipynb Lab_2_DL_Code_Completion.ipynb sync pending

Next module Prompting LLMs

→