Prompting LLMs

Why Prompting Matters

Prompting is a lightweight adaptation technique that guides a pre-trained language model toward specific tasks by embedding input-output examples and instructions directly in the prompt. Instead of retraining or fine-tuning, prompting changes what the model sees at inference time while keeping the model frozen. Prompting adapts the input; fine-tuning adapts the model.

Prompting / In-Context Learning

No parameter updates
No backpropagation
Adapts at inference on the fly
Works with just a few examples
Same model serves multiple tasks
Cost scales with API queries

Fine-Tuning

Updates model parameters
Requires backpropagation
Needs a training phase
Requires labeled training data
Produces task-specific models
Cost is upfront in training time

In-Context Learning Fundamentals

In-Context Learning (ICL) is the core mechanism behind prompting. A pre-trained language model observes demonstrations (input-output pairs) embedded in the prompt context and learns to replicate the pattern on a new target input. No model weights are updated; the learning happens entirely at inference time.

text

// Prompt Structure for ICL

1. TASK DESCRIPTION
   "Fix the following buggy Java method by adding appropriate validation."

2. DEMONSTRATIONS
   Demo 1: Buggy -> Fixed (guard clause for null check)
   Demo 2: Buggy -> Fixed (guard clause for range check)

3. TARGET INPUT
   "public String getElement(String[] arr, int idx) { return arr[idx]; }"

4. MODEL OUTPUT
   Applies learned pattern: adds bounds checking guard clause

Few-Shot Prompting and Example Selection

Few-shot prompting provides 3–10 demonstrations to establish a clear pattern. Research on code summarization shows that moving from zero-shot to few-shot dramatically improves performance — but only when the examples are well-chosen. A critical insight: which examples you select matters as much as how many you provide.

Poor Example Selection (Random)

Demonstrations from unrelated projects with different naming conventions, APIs, and coding styles.

Result: the model receives conflicting signals. Performance degradation of 15–25%.

Careful Example Selection (Same-Project)

Demonstrations from the same codebase sharing vocabulary and style.

Result: the model receives clear, aligned signals. Performance improvement of 20–35%.

10-shot

Outperforms fine-tuned baselines on code summarization

0–1 shot

Insufficient to convey task pattern; high variance

25–40%

Accuracy improvement from zero-shot to few-shot

Chain-of-Thought Prompting for Reasoning

Chain-of-Thought (CoT) prompting asks the model to show its work step-by-step before reaching a conclusion. This simple modification — "Let's analyze step by step" — can improve accuracy on complex code analysis tasks by 25–40%. Why does it work? CoT forces intermediate reasoning steps that constrain each subsequent step.

Standard Prompt (No CoT)

Q: "Is this code thread-safe?"

A: "No."

The model jumps to a verdict without explanation. Hard to verify. Likely wrong.

Chain-of-Thought Prompt

Q: "Is this code thread-safe? Let's analyze step by step."

A: "Step 1: Identify shared state... Step 2: Check for synchronization... Step 3: Look for race conditions... Conclusion: No, because..."

Reasoning is transparent and constrained by logic.

text

// CoT Example: Debugging a Bubble Sort

Buggy Code:
def bubble_sort(arr):
  for i in range(len(arr)):
    for j in range(len(arr) - 1):
      if arr[j] > arr[j + 1]:
        arr[j] = arr[j + 1]        # Bug: overwrites arr[j]
        arr[j + 1] = arr[j]        # Now both are the same

CoT Trace (input [3, 1, 2]):
Step 1: i=0, j=0: arr[0]=3 > arr[1]=1 -> swap
        arr[0] = 1 makes arr = [1, 1, 2]
        arr[1] = arr[0] copies 1 back
        -> THE VALUE 3 IS LOST

Step 2: The swap never saves the original arr[j] value
        before overwriting it.

Conclusion: Needs temporary variable or tuple unpacking:
           arr[j], arr[j+1] = arr[j+1], arr[j]

Prompt Engineering Best Practices

A systematic framework for writing effective prompts. Each principle is a concrete lever you can adjust to improve output quality.

Be Specific: Replace vague instructions ('Write a function') with precise requirements ('Write a Python function that takes a list of integers and returns the second-largest unique value'). Ambiguity forces the model to guess.
Provide Context: Include framework details, imports, class structure, and coding style guidelines. The model cannot read your codebase; it can only read what you paste.
Specify Output Format: Request JSON, XML, or specific docstring format. Show, don't tell — one example is worth ten words of explanation.
Use Delimiters: Wrap code in triple backticks or XML tags to separate instructions from code. Prevents the model from confusing the two.
Include Examples: Provide one or more input-output demonstrations that show your expectations for structure, style, and detail level.
Iterate and Refine: Your first prompt is rarely perfect. Each bad output reveals a gap in your instructions. Refine and re-run until you get the quality you want.

Retrieval-Augmented Generation (RAG)

RAG combines retrieval with generation. Instead of hoping the model knows your codebase and APIs, you retrieve relevant code snippets at query time and inject them into the prompt as context. The LLM then generates answers grounded in your actual code rather than hallucinating generic patterns.

Query

User asks a question or requests code

Embed

Encode query into a vector

Vector DB Search

Find top-k nearest snippets in the codebase index

Retrieve Top-k

Pull the most relevant snippets

Augment Prompt

Prepend retrieved snippets to the user's prompt

LLM Generate

Model produces output grounded in real code

Without RAG

The LLM guesses based on general knowledge: 'Use OAuth2 with Spring Security.' Not your actual implementation. May hallucinate classes that don't exist in your codebase.

With RAG

The LLM sees your actual code: AuthMiddleware, JwtTokenService, @Authenticated. It grounds the answer in your project's real implementation. Specific, accurate, relevant.

Managing Context Windows

Context window is the maximum number of tokens a model can process at once. Modern models have large windows (Claude 3.5: 200K, Gemini 1.5: 1M), but even 200K tokens is only ~150K lines of code. A critical finding: models degrade on very long contexts even when they technically fit. Information in the middle of long prompts is overlooked (the 'lost in the middle' effect). Shorter, more focused prompts consistently outperform longer ones.

Truncation: Cut files to the first N lines or the most relevant functions. Simple but lossy.
Chunking: Split large files into logical chunks (functions, classes, modules) and process each independently.
RAG / Selective Retrieval: Retrieve only relevant snippets instead of including everything. Most common production strategy.
Summarization: Use the LLM itself to compress 500-line files into 20-line synopses.
Hierarchical Context: Include a high-level map (file tree, function signatures) plus full code of 2–3 most-relevant files.

Tool Use and Function Calling

Modern LLMs can request to call external tools — transforming them from text generators into agents. For SE automation, tool use enables an iterative development loop: LLM generates code → calls run_tests() → sees test failures → fixes the code → calls run_tests() again. This is how Copilot Workspace, Cursor, and Claude Code work.

Without Tool Use

The LLM generates code and guesses whether it works. Hallucination risk is high. The model might confidently produce code with bugs it cannot detect.

With Tool Use

The LLM generates code, runs tests, sees failures, and iteratively fixes issues. Feedback loop grounds the model in reality. Code correctness improves dramatically.

Prompt Chaining and Multi-Step Workflows

Complex tasks benefit from decomposition. Instead of one massive prompt requesting 'analyze bugs, suggest fixes, and generate tests,' chain multiple focused prompts where each step's output feeds the next. Each prompt is optimized independently and debugged in isolation.

Step 1: Understand

Read code, extract a clear specification of what it does.

Step 2: Identify Edge Cases

List boundary conditions, error paths, and invariants worth testing.

Step 3: Write Tests

Generate executable tests one edge case at a time.

Step 4: Review & Refine

Read the tests; merge duplicates; check for missing scenarios.

2 tests

From a single monolithic prompt for binary search

8+ tests

From a four-step chain — 4× more edge cases covered

Self-Consistency and Majority Voting

Generate multiple responses to the same prompt and take the majority vote. This trades compute for accuracy. With temperature > 0, each generation explores a different reasoning path. By generating 5 solutions and taking the best (via tests or majority vote), you amplify correctness. Self-consistency is the empirical foundation of pass@k.

text

// Self-Consistency Example: Palindrome Checker

Generate 5 solutions at temperature=0.8:
  Solution 1 (Two-pointer):    CORRECT  ✓
  Solution 2 (Reverse slice):  CORRECT  ✓
  Solution 3 (Buggy recursion): INCORRECT ✗ (missing base case)
  Solution 4 (Case-insensitive): CORRECT ✓
  Solution 5 (Two-pointer loop): CORRECT ✓

Result: 4/5 correct
Voting: Pick solution 4 (handles case sensitivity edge case)

Evaluating Prompt Effectiveness

How do you know if your prompts are working? Systematic evaluation separates prompt engineering from guesswork.

Automated Metrics: pass@k (does code pass tests?), BLEU (similarity to reference), test-pass-rate. Quantitative and reproducible.
A/B Testing: Compare two prompt versions on the same inputs. Measure which produces better outputs.
Error Analysis: Categorize failures: wrong logic, wrong syntax, wrong API, hallucinated functions. Each suggests a specific prompt improvement.
Regression Testing: Save your best prompts. When the underlying model updates, re-run your suite to catch regressions.

Variant A (Zero-Shot)

No examples, task description only.
45% test pass rate

Variant B (Few-Shot, 3 examples)

Task + 3 demonstrations.
62% test pass rate

Variant C (Few-Shot + CoT + Format)

Examples + step-by-step reasoning + explicit format constraints.
78% test pass rate

Patterns, Anti-Patterns, and Synthesis

Prompting is a powerful, practical technique for software engineering automation. The major patterns are tools to reach for depending on task complexity and data availability.

Pattern	Description	When to use	Performance	Prompt length
Zero-Shot	Task description only	Simple, well-known tasks	Variable	Short
One-Shot	Single demonstration	Tasks with clear patterns	Moderate	Short
Few-Shot	3–10 demonstrations	Complex SE tasks, code summarization	Strong	Medium
Chain-of-Thought	Step-by-step reasoning in demos	Debugging, code review, analysis	Strong	Long
RAG	Retrieve relevant snippets	Large codebases, domain-specific tasks	Strong	Medium

Why Prompting Matters

In-Context Learning Fundamentals

Few-Shot Prompting and Example Selection

Chain-of-Thought Prompting for Reasoning

Prompt Engineering Best Practices

Retrieval-Augmented Generation (RAG)

Managing Context Windows

Tool Use and Function Calling

Prompt Chaining and Multi-Step Workflows

Self-Consistency and Majority Voting

Evaluating Prompt Effectiveness

Variant A (Zero-Shot)

Variant B (Few-Shot, 3 examples)

Variant C (Few-Shot + CoT + Format)

Patterns, Anti-Patterns, and Synthesis

Course materials

Lab C · Prompting & RAG