mastropaolo.dev playbook Back to playbook

Module 06 · Reliability

Hallucinations

Code hallucinations occur when LLMs generate syntactically valid code that looks plausible but invokes non-existent APIs, uses wrong parameters, or implements incorrect logic — a uniquely dangerous form of AI error.

Reading time~50 min read InstructorDr. Mastropaolo CohortSpring 2026

What Are Hallucinations in Code?

In general NLP, a hallucination occurs when a model generates content that is nonsensical, unfaithful to the source, or factually incorrect. In code generation, hallucinations take on a uniquely dangerous character: they produce syntactically valid code that compiles or runs — but behaves incorrectly. Code hallucinations are worse because they create a verifiability gap (require execution or deep expertise to detect), they have downstream impact (runtime crashes, security vulnerabilities, silent data corruption), and they create false confidence (syntactically correct code creates a strong illusion of correctness).

General LLM Hallucination

Output that is fluent and confident but factually wrong. Example: claiming a historical event occurred on the wrong date, or citing a paper that does not exist.

Code Hallucination

Generated code that looks plausible but calls non-existent APIs, uses wrong method signatures, introduces logical errors, or references unavailable libraries. The code often passes superficial review.

Why It Matters

Runtime crashes from hallucinated APIs, silent data corruption from logic errors, security vulnerabilities from knowledge-conflicting patterns, and slopsquatting attacks on fabricated package names.
31.3%
Avg hallucination rate (CodeHaluEval)
42.1%
API hallucinations (most common)
19.7%
Logic hallucinations (hardest)
58%
Reduction with RAG

How LLMs Generate Code: The Autoregressive Process

Before understanding hallucinations, we must understand how LLMs produce code token by token. The autoregressive generation process — where each token conditions on all previous tokens — is the root cause of many hallucination patterns. The model: (1) receives the prompt as input tokens, (2) computes a probability distribution over its vocabulary for the next token, (3) samples a token using temperature, top-k, or top-p, (4) appends and repeats.

Temperature
Controls how peaked the distribution is. Low (0.0–0.3) = deterministic, less hallucinatory but repetitive. High (0.8–1.2) = creative but more prone to hallucination.
Top-k Sampling
Only samples from the k most probable tokens, eliminating very low probability tokens that would be hallucinations.
Top-p (Nucleus)
Samples from a probabilistic nucleus containing the smallest set of tokens whose cumulative probability exceeds threshold p.

CodeHalu Taxonomy: Four Categories

Tian et al. (AAAI 2025) propose a systematic taxonomy of code hallucinations based on execution-based verification, dividing them into four categories. Understanding this taxonomy enables precise classification and targeted mitigation.

CategoryDefinitionExamplesDetection
1. MappingThe LLM fails to correctly map task description to code. Generated solution does not align with what was asked.Asked to sort descending but code sorts ascending; asked to return indices but returns values.Execution-based testing against specification
2. NamingReferences identifiers that do not exist in current scope. Includes undefined references and name confusion.df.length() instead of len(df); list.filterBy() instead of list.filter(); np.normalize() (doesn't exist).Static analysis, type checkers, AST analysis
3. ResourceReferences external resources (packages, files, services) that do not exist or are inaccessible.Importing sklearn.neural (correct: sklearn.neural_network); assuming config files exist.Dependency resolver, static analysis
4. LogicCode is syntactically valid and uses real APIs correctly, but implements flawed algorithms.Off-by-one in loops; early return in sorting; palindrome check that compares to s[:-1] instead of s[::-1].Execution-based testing with comprehensive test suites

Deep Dive: API and Parameter Hallucinations

API hallucinations — where the model generates calls to methods, classes, or modules that do not exist in the target library — are the most common form of code hallucination. Parameter hallucinations, where the model calls a real method but with non-existent or incorrect arguments, are subtler but equally dangerous.

python
# API Hallucination Example 1: Non-existent method
import pandas as pd
df1 = pd.read_csv("data1.csv")
df2 = pd.read_csv("data2.csv")

# HALLUCINATED: pandas has no .smart_merge() method
result = pd.smart_merge(df1, df2,
    on="id",
    strategy="fuzzy",        # not a real param
    threshold=0.85)          # not a real param

# CORRECT approach:
# result = pd.merge(df1, df2, on="id")
# For fuzzy: use thefuzz or recordlinkage library
API Hallucination Patterns
  • Naming interpolation: combines real API fragments
  • Cross-library confusion: mixing APIs from different ecosystems
  • Version drift: calling removed APIs from older versions
  • Fabricated submodules: inventing packages that sound real
Why This Happens
  • Training data contains multiple library versions
  • Model generalizes from similar APIs in different libraries
  • Parameter names are plausible by analogy
  • Method names follow predictable conventions, so interpolation works

Spot the Hallucination: Interactive Examples

Practice identifying hallucinations by examining real code examples. The key is recognizing that hallucinations often look plausible — they follow naming conventions and patterns you would expect in real APIs.

python
# Challenge 1: Spot the hallucination
# Snippet A:
import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = data.sortDescending()  # HALLUCINATED
print(sorted_data)

# Snippet B:
import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = sorted(data, reverse=True)  # CORRECT
print(sorted_data)
javascript
// Challenge 2: String vs Array API confusion
// Hallucinated:
const text = "Hello, World!";
const reversed = text.reverse();  // WRONG: strings don't have .reverse()

// Correct:
const reversed = text.split('').reverse().join('');
python
# Challenge 3: Logic hallucination (all APIs are real)
def is_palindrome(s):
    s = s.lower().strip()
    return s == s[:-1]  # WRONG: should be s[::-1]

# This code:
# - Is syntactically valid (no errors)
# - Uses real Python APIs (slice notation)
# - But implements incorrect logic

Why LLMs Hallucinate Code

Hallucinations are not random errors — they emerge systematically from how transformers learn and generate code.

Training Data Gaps
APIs that are infrequent in training data are more likely to be hallucinated. Less popular libraries have far fewer training examples, leading to fabricated method names.
API Version Confusion
Training data contains code using multiple versions of the same library. The model may blend APIs across incompatible versions.
Distribution Shift
When a prompt asks for code combining libraries in novel ways, the model falls back on statistical patterns rather than genuine understanding.
Autoregressive Accumulation
Each generated token conditions on all previous tokens. A small early error cascades through the rest of the function.
Pattern Mimicry over Semantics
LLMs are pattern-matching systems. They synthesize plausible-sounding method names that follow conventions but do not actually exist.

Detecting Hallucinations: Multiple Strategies

No single detection method catches all hallucinations. Effective detection requires multiple complementary strategies layered into a comprehensive pipeline.

Execution-Based Verification
  • Execute generated code against test suites
  • Compare actual outputs to expected results
  • Catches all categories affecting behavior
  • Requires comprehensive test suites
  • Cannot detect dead code or style issues
Static Analysis & Type Checking
  • Parse code to AST, check for undefined references
  • Type checkers (mypy, TypeScript) find param mismatches
  • Linters catch unused imports
  • Excellent for naming and resource hallucinations
  • Misses logic errors and valid but wrong APIs
MethodNamingResourceLogicAPI paramsCost
Linting (ruff, ESLint)✓ Excellent✓ Good✗ No✗ NoVery low
Type Checking (mypy, tsc)✓ Good✓ Good✗ No✓ ExcellentLow
Test Execution✓ Good✓ Good✓ Good✓ GoodMedium–High
Static Analysis (Semgrep)✓ Good✓ Good✗ No✓ FairMedium
AST-Based Analysis✓ Excellent✓ Fair✗ No✓ FairLow

RAG for Hallucination Reduction

Retrieval-Augmented Generation (RAG) is a foundational technique for reducing hallucinations. Instead of relying on the model's parametric memory alone, RAG retrieves relevant documentation at query time to ground the generation in factual data.

User Query
User asks for code or explanation
Embed Query
Encode query into a vector
Search Docs
Vector search against documentation index
Retrieve Context
Pull top-k relevant docs
Augment Prompt
Prepend docs to the user prompt
Generate with Grounding
Model generates code constrained by real docs

Mitigation: De-Hallucinator's Iterative Approach

Eghbali & Pradel (2024) introduce De-Hallucinator, an iterative grounding approach that retrieves relevant API documentation to reduce hallucinations. Unlike single-pass RAG, De-Hallucinator iterates — extracting APIs from generated code, retrieving documentation for those specific APIs, and re-querying the model with verified documentation.

User Prompt
Initial natural language request
Initial Generation
LLM produces a first draft, possibly hallucinated
Extract API Calls
Parse the output for API references
Retrieve API Docs
Pull docs for each referenced API
Re-query with Context
Feed verified docs back to the model
Refined Output
Corrected code grounded in real APIs
23.3–50.6%
Edit distance improvement
23.9–61.0%
API recall improvement
63.2%
More fixed tests
15.5%
Statement coverage increase

Prompt Engineering as Mitigation

Careful prompt design is one of the simplest and most effective ways to reduce hallucinations. These techniques require no infrastructure changes — just better prompts.

Be Explicit About Constraints
Tell the model exactly what it can and cannot use: "Only use Python standard library functions. Do not import any third-party packages."
Provide API Documentation in Context
Paste the relevant API docs directly into the prompt. The model is far less likely to hallucinate APIs when it has the real docs in its context window.
Ask for Citations and Reasoning
Prompt: "Cite which library each method comes from" or "Explain why you chose this approach before writing code."
Add Negative Examples
"Do NOT use deprecated methods. Do NOT use eval(). Do NOT concatenate strings into SQL queries."
Request Self-Verification
"After generating the code, verify that all API methods you used actually exist in the specified library version."

Tool-Augmented Generation: Real-Time Verification

Tool-augmented generation gives LLMs access to external tools that verify code in real time, dramatically reducing hallucinations by providing ground truth feedback during the generation process.

ToolCatchesMissesSpeed
Linter (ruff, ESLint)Undefined names, unused imports, styleLogic errors, valid but wrong APIsInstant
Type Checker (mypy, tsc)Wrong param types, invalid signaturesRuntime behavior, algorithm correctnessInstant
Test Runner (pytest, JUnit)Logic errors, edge cases, wrong outputsRequires good test suite; cannot catch untested pathsSeconds–minutes
Static Analyzer (Semgrep)Security vulnerabilities, anti-patternsNovel hallucinations not in rule databaseSeconds

Real-World Hallucinations: Production Case Studies

Code hallucinations are not hypothetical — they have all happened in production. Every case shows why code outputs from LLMs must be treated as untrusted input requiring verification.

Copilot Suggests Non-Existent npm Package
GitHub Copilot suggested require('express-validator-sanitizer') — a package that does not exist. A researcher later registered the hallucinated name and found that other developers had already attempted to install it, demonstrating real-world slopsquatting risk.
ChatGPT Generates Deprecated TensorFlow Code
ChatGPT frequently generated TensorFlow 1.x session-based code for users on TensorFlow 2.x. The pattern tf.Session().run() was removed in TF 2.0, but the model generates it confidently.
Hallucinated Django ORM Methods
LLM-generated Django code frequently hallucinates QuerySet methods: .filter_active(), .get_or_none(). These follow Django's naming conventions perfectly but do not exist.
Incorrect Cryptographic Patterns
Security audits found LLM-generated code frequently uses hashlib.md5(password) for password hashing — an insecure practice. The correct approach is bcrypt or argon2.

Building Hallucination-Resistant Workflows

A practical framework you can apply to production systems and course projects. This workflow layers multiple mitigation strategies for increasing levels of protection.

Step 1: Use RAG for Grounding
Provide relevant documentation and code context in your prompts.
Step 2: Set Low Temperature
Use T=0.0–0.2 for code you plan to ship.
Step 3: Add Explicit Constraints
Specify which libraries, versions, and patterns to use. Include negative constraints.
Step 4: Run Linting and Type Checking
Automatically catch naming and resource hallucinations with static analysis.
Step 5: Generate and Run Tests
Ask the LLM to generate tests for its own code, then execute them. Failing tests indicate hallucinations.
Step 6: Human Review for Critical Logic
Logic hallucinations cannot be caught by automated tools alone.
Step 7: Monitor in Production
Log LLM outputs, track error rates, set up alerts for unexpected failures.

Measuring and Evaluating Hallucinations

Quantifying code hallucinations requires specialized benchmarks and metrics. The CodeHaluEval benchmark provides a rigorous framework.

MiHN (Micro Hallucination Number)
Total count of individual hallucinated elements across all generated samples. Lower is better.
MaHR (Macro Hallucination Rate)
Fraction of generated code samples that contain at least one hallucination. Ranges from 0 to 1.0.
Edit Distance
How many edits are needed to fix hallucinated code to match correct code.
API Recall
Fraction of required API calls that are correctly included in generated code.
Test Pass Rate
Percentage of generated code that passes the specification's test suite. The ultimate metric.
31.3%
Avg hallucination rate (CodeHaluEval)
42.1%
API hallucinations (most common)
19.7%
Logic hallucinations (hardest)
67.5%
MiHN decrease with MARIN

Summary and Key Takeaways

Code hallucinations are a critical challenge in AI-assisted software development. Understanding their mechanisms, taxonomies, and mitigation strategies is essential for safely deploying LLM-generated code in production systems.

Hallucinations are Unavoidable but Manageable
LLMs will always hallucinate at some rate. The goal is not to eliminate them entirely but to detect and mitigate them systematically.
Taxonomies Enable Precision
CodeHalu provides a framework for precise classification. Different hallucination types require different detection strategies.
Multiple Defense Layers Work Better Than One
Linting catches naming errors. Type checkers catch parameters. Tests catch logic errors. RAG prevents information gaps.
RAG and Iterative Refinement Are Powerful
Providing documentation and repeatedly refining generated code dramatically reduces hallucinations. De-Hallucinator and MARIN show 50–70% reduction.
Production Code Requires Human Oversight
For critical code (security, infrastructure, algorithms), human code review remains essential.

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

— · No lab artifact this module

Hallucinations is taught as a red-team workshop on the cohort's own models from Labs A–C. No standalone notebook.

No lab artifact — this module is taught without a standalone notebook.