Hallucinations

What Are Hallucinations in Code?

In general NLP, a hallucination occurs when a model generates content that is nonsensical, unfaithful to the source, or factually incorrect. In code generation, hallucinations take on a uniquely dangerous character: they produce syntactically valid code that compiles or runs — but behaves incorrectly. Code hallucinations are worse because they create a verifiability gap (require execution or deep expertise to detect), they have downstream impact (runtime crashes, security vulnerabilities, silent data corruption), and they create false confidence (syntactically correct code creates a strong illusion of correctness).

General LLM Hallucination

Output that is fluent and confident but factually wrong. Example: claiming a historical event occurred on the wrong date, or citing a paper that does not exist.

Code Hallucination

Generated code that looks plausible but calls non-existent APIs, uses wrong method signatures, introduces logical errors, or references unavailable libraries. The code often passes superficial review.

Why It Matters

Runtime crashes from hallucinated APIs, silent data corruption from logic errors, security vulnerabilities from knowledge-conflicting patterns, and slopsquatting attacks on fabricated package names.

31.3%

Avg hallucination rate (CodeHaluEval)

42.1%

API hallucinations (most common)

19.7%

Logic hallucinations (hardest)

58%

Reduction with RAG

How LLMs Generate Code: The Autoregressive Process

Before understanding hallucinations, we must understand how LLMs produce code token by token. The autoregressive generation process — where each token conditions on all previous tokens — is the root cause of many hallucination patterns. The model: (1) receives the prompt as input tokens, (2) computes a probability distribution over its vocabulary for the next token, (3) samples a token using temperature, top-k, or top-p, (4) appends and repeats.

Temperature: Controls how peaked the distribution is. Low (0.0–0.3) = deterministic, less hallucinatory but repetitive. High (0.8–1.2) = creative but more prone to hallucination.
Top-k Sampling: Only samples from the k most probable tokens, eliminating very low probability tokens that would be hallucinations.
Top-p (Nucleus): Samples from a probabilistic nucleus containing the smallest set of tokens whose cumulative probability exceeds threshold p.

CodeHalu Taxonomy: Four Categories

Tian et al. (AAAI 2025) propose a systematic taxonomy of code hallucinations based on execution-based verification, dividing them into four categories. Understanding this taxonomy enables precise classification and targeted mitigation.

Category	Definition	Examples	Detection
1. Mapping	The LLM fails to correctly map task description to code. Generated solution does not align with what was asked.	Asked to sort descending but code sorts ascending; asked to return indices but returns values.	Execution-based testing against specification
2. Naming	References identifiers that do not exist in current scope. Includes undefined references and name confusion.	`df.length()` instead of `len(df)`; `list.filterBy()` instead of `list.filter()`; `np.normalize()` (doesn't exist).	Static analysis, type checkers, AST analysis
3. Resource	References external resources (packages, files, services) that do not exist or are inaccessible.	Importing `sklearn.neural` (correct: `sklearn.neural_network`); assuming config files exist.	Dependency resolver, static analysis
4. Logic	Code is syntactically valid and uses real APIs correctly, but implements flawed algorithms.	Off-by-one in loops; early return in sorting; palindrome check that compares to `s[:-1]` instead of `s[::-1]`.	Execution-based testing with comprehensive test suites

Deep Dive: API and Parameter Hallucinations

API hallucinations — where the model generates calls to methods, classes, or modules that do not exist in the target library — are the most common form of code hallucination. Parameter hallucinations, where the model calls a real method but with non-existent or incorrect arguments, are subtler but equally dangerous.

python

# API Hallucination Example 1: Non-existent method
import pandas as pd
df1 = pd.read_csv("data1.csv")
df2 = pd.read_csv("data2.csv")

# HALLUCINATED: pandas has no .smart_merge() method
result = pd.smart_merge(df1, df2,
    on="id",
    strategy="fuzzy",        # not a real param
    threshold=0.85)          # not a real param

# CORRECT approach:
# result = pd.merge(df1, df2, on="id")
# For fuzzy: use thefuzz or recordlinkage library

API Hallucination Patterns

Naming interpolation: combines real API fragments
Cross-library confusion: mixing APIs from different ecosystems
Version drift: calling removed APIs from older versions
Fabricated submodules: inventing packages that sound real

Why This Happens

Training data contains multiple library versions
Model generalizes from similar APIs in different libraries
Parameter names are plausible by analogy
Method names follow predictable conventions, so interpolation works

Spot the Hallucination: Interactive Examples

Practice identifying hallucinations by examining real code examples. The key is recognizing that hallucinations often look plausible — they follow naming conventions and patterns you would expect in real APIs.

python

# Challenge 1: Spot the hallucination
# Snippet A:
import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = data.sortDescending()  # HALLUCINATED
print(sorted_data)

# Snippet B:
import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = sorted(data, reverse=True)  # CORRECT
print(sorted_data)

javascript

// Challenge 2: String vs Array API confusion
// Hallucinated:
const text = "Hello, World!";
const reversed = text.reverse();  // WRONG: strings don't have .reverse()

// Correct:
const reversed = text.split('').reverse().join('');

python

# Challenge 3: Logic hallucination (all APIs are real)
def is_palindrome(s):
    s = s.lower().strip()
    return s == s[:-1]  # WRONG: should be s[::-1]

# This code:
# - Is syntactically valid (no errors)
# - Uses real Python APIs (slice notation)
# - But implements incorrect logic

Why LLMs Hallucinate Code

Hallucinations are not random errors — they emerge systematically from how transformers learn and generate code.

Training Data Gaps: APIs that are infrequent in training data are more likely to be hallucinated. Less popular libraries have far fewer training examples, leading to fabricated method names.
API Version Confusion: Training data contains code using multiple versions of the same library. The model may blend APIs across incompatible versions.
Distribution Shift: When a prompt asks for code combining libraries in novel ways, the model falls back on statistical patterns rather than genuine understanding.
Autoregressive Accumulation: Each generated token conditions on all previous tokens. A small early error cascades through the rest of the function.
Pattern Mimicry over Semantics: LLMs are pattern-matching systems. They synthesize plausible-sounding method names that follow conventions but do not actually exist.

Detecting Hallucinations: Multiple Strategies

No single detection method catches all hallucinations. Effective detection requires multiple complementary strategies layered into a comprehensive pipeline.

Execution-Based Verification

Execute generated code against test suites
Compare actual outputs to expected results
Catches all categories affecting behavior
Requires comprehensive test suites
Cannot detect dead code or style issues

Static Analysis & Type Checking

Parse code to AST, check for undefined references
Type checkers (mypy, TypeScript) find param mismatches
Linters catch unused imports
Excellent for naming and resource hallucinations
Misses logic errors and valid but wrong APIs

Method	Naming	Resource	Logic	API params	Cost
Linting (ruff, ESLint)	✓ Excellent	✓ Good	✗ No	✗ No	Very low
Type Checking (mypy, tsc)	✓ Good	✓ Good	✗ No	✓ Excellent	Low
Test Execution	✓ Good	✓ Good	✓ Good	✓ Good	Medium–High
Static Analysis (Semgrep)	✓ Good	✓ Good	✗ No	✓ Fair	Medium
AST-Based Analysis	✓ Excellent	✓ Fair	✗ No	✓ Fair	Low

RAG for Hallucination Reduction

Retrieval-Augmented Generation (RAG) is a foundational technique for reducing hallucinations. Instead of relying on the model's parametric memory alone, RAG retrieves relevant documentation at query time to ground the generation in factual data.

User Query

User asks for code or explanation

Embed Query

Encode query into a vector

Search Docs

Vector search against documentation index

Retrieve Context

Pull top-k relevant docs

Augment Prompt

Prepend docs to the user prompt

Generate with Grounding

Model generates code constrained by real docs

Mitigation: De-Hallucinator's Iterative Approach

Eghbali & Pradel (2024) introduce De-Hallucinator, an iterative grounding approach that retrieves relevant API documentation to reduce hallucinations. Unlike single-pass RAG, De-Hallucinator iterates — extracting APIs from generated code, retrieving documentation for those specific APIs, and re-querying the model with verified documentation.

User Prompt

Initial natural language request

Initial Generation

LLM produces a first draft, possibly hallucinated

Extract API Calls

Parse the output for API references

Retrieve API Docs

Pull docs for each referenced API

Re-query with Context

Feed verified docs back to the model

Refined Output

Corrected code grounded in real APIs

23.3–50.6%

Edit distance improvement

23.9–61.0%

API recall improvement

63.2%

More fixed tests

15.5%

Statement coverage increase

Prompt Engineering as Mitigation

Careful prompt design is one of the simplest and most effective ways to reduce hallucinations. These techniques require no infrastructure changes — just better prompts.

Be Explicit About Constraints: Tell the model exactly what it can and cannot use: "Only use Python standard library functions. Do not import any third-party packages."
Provide API Documentation in Context: Paste the relevant API docs directly into the prompt. The model is far less likely to hallucinate APIs when it has the real docs in its context window.
Ask for Citations and Reasoning: Prompt: "Cite which library each method comes from" or "Explain why you chose this approach before writing code."
Add Negative Examples: "Do NOT use deprecated methods. Do NOT use eval(). Do NOT concatenate strings into SQL queries."
Request Self-Verification: "After generating the code, verify that all API methods you used actually exist in the specified library version."

Tool-Augmented Generation: Real-Time Verification

Tool-augmented generation gives LLMs access to external tools that verify code in real time, dramatically reducing hallucinations by providing ground truth feedback during the generation process.

Tool	Catches	Misses	Speed
Linter (ruff, ESLint)	Undefined names, unused imports, style	Logic errors, valid but wrong APIs	Instant
Type Checker (mypy, tsc)	Wrong param types, invalid signatures	Runtime behavior, algorithm correctness	Instant
Test Runner (pytest, JUnit)	Logic errors, edge cases, wrong outputs	Requires good test suite; cannot catch untested paths	Seconds–minutes
Static Analyzer (Semgrep)	Security vulnerabilities, anti-patterns	Novel hallucinations not in rule database	Seconds

Real-World Hallucinations: Production Case Studies

Code hallucinations are not hypothetical — they have all happened in production. Every case shows why code outputs from LLMs must be treated as untrusted input requiring verification.

Copilot Suggests Non-Existent npm Package: GitHub Copilot suggested require('express-validator-sanitizer') — a package that does not exist. A researcher later registered the hallucinated name and found that other developers had already attempted to install it, demonstrating real-world slopsquatting risk.
ChatGPT Generates Deprecated TensorFlow Code: ChatGPT frequently generated TensorFlow 1.x session-based code for users on TensorFlow 2.x. The pattern tf.Session().run() was removed in TF 2.0, but the model generates it confidently.
Hallucinated Django ORM Methods: LLM-generated Django code frequently hallucinates QuerySet methods: .filter_active(), .get_or_none(). These follow Django's naming conventions perfectly but do not exist.
Incorrect Cryptographic Patterns: Security audits found LLM-generated code frequently uses hashlib.md5(password) for password hashing — an insecure practice. The correct approach is bcrypt or argon2.

Building Hallucination-Resistant Workflows

A practical framework you can apply to production systems and course projects. This workflow layers multiple mitigation strategies for increasing levels of protection.

Step 1: Use RAG for Grounding: Provide relevant documentation and code context in your prompts.
Step 2: Set Low Temperature: Use T=0.0–0.2 for code you plan to ship.
Step 3: Add Explicit Constraints: Specify which libraries, versions, and patterns to use. Include negative constraints.
Step 4: Run Linting and Type Checking: Automatically catch naming and resource hallucinations with static analysis.
Step 5: Generate and Run Tests: Ask the LLM to generate tests for its own code, then execute them. Failing tests indicate hallucinations.
Step 6: Human Review for Critical Logic: Logic hallucinations cannot be caught by automated tools alone.
Step 7: Monitor in Production: Log LLM outputs, track error rates, set up alerts for unexpected failures.

Measuring and Evaluating Hallucinations

Quantifying code hallucinations requires specialized benchmarks and metrics. The CodeHaluEval benchmark provides a rigorous framework.

MiHN (Micro Hallucination Number): Total count of individual hallucinated elements across all generated samples. Lower is better.
MaHR (Macro Hallucination Rate): Fraction of generated code samples that contain at least one hallucination. Ranges from 0 to 1.0.
Edit Distance: How many edits are needed to fix hallucinated code to match correct code.
API Recall: Fraction of required API calls that are correctly included in generated code.
Test Pass Rate: Percentage of generated code that passes the specification's test suite. The ultimate metric.

31.3%

Avg hallucination rate (CodeHaluEval)

42.1%

API hallucinations (most common)

19.7%

Logic hallucinations (hardest)

67.5%

MiHN decrease with MARIN

Summary and Key Takeaways

Code hallucinations are a critical challenge in AI-assisted software development. Understanding their mechanisms, taxonomies, and mitigation strategies is essential for safely deploying LLM-generated code in production systems.

Hallucinations are Unavoidable but Manageable: LLMs will always hallucinate at some rate. The goal is not to eliminate them entirely but to detect and mitigate them systematically.
Taxonomies Enable Precision: CodeHalu provides a framework for precise classification. Different hallucination types require different detection strategies.
Multiple Defense Layers Work Better Than One: Linting catches naming errors. Type checkers catch parameters. Tests catch logic errors. RAG prevents information gaps.
RAG and Iterative Refinement Are Powerful: Providing documentation and repeatedly refining generated code dramatically reduces hallucinations. De-Hallucinator and MARIN show 50–70% reduction.
Production Code Requires Human Oversight: For critical code (security, infrastructure, algorithms), human code review remains essential.

What Are Hallucinations in Code?

General LLM Hallucination

Code Hallucination

Why It Matters

How LLMs Generate Code: The Autoregressive Process

CodeHalu Taxonomy: Four Categories

Deep Dive: API and Parameter Hallucinations

Spot the Hallucination: Interactive Examples

Why LLMs Hallucinate Code

Detecting Hallucinations: Multiple Strategies

RAG for Hallucination Reduction

Mitigation: De-Hallucinator's Iterative Approach

Prompt Engineering as Mitigation

Tool-Augmented Generation: Real-Time Verification

Real-World Hallucinations: Production Case Studies

Building Hallucination-Resistant Workflows

Measuring and Evaluating Hallucinations

Summary and Key Takeaways

Course materials

— · No lab artifact this module