What Are Hallucinations in Code?
In general NLP, a hallucination occurs when a model generates content that is nonsensical, unfaithful to the source, or factually incorrect. In code generation, hallucinations take on a uniquely dangerous character: they produce syntactically valid code that compiles or runs — but behaves incorrectly. Code hallucinations are worse because they create a verifiability gap (require execution or deep expertise to detect), they have downstream impact (runtime crashes, security vulnerabilities, silent data corruption), and they create false confidence (syntactically correct code creates a strong illusion of correctness).
General LLM Hallucination
Code Hallucination
Why It Matters
How LLMs Generate Code: The Autoregressive Process
Before understanding hallucinations, we must understand how LLMs produce code token by token. The autoregressive generation process — where each token conditions on all previous tokens — is the root cause of many hallucination patterns. The model: (1) receives the prompt as input tokens, (2) computes a probability distribution over its vocabulary for the next token, (3) samples a token using temperature, top-k, or top-p, (4) appends and repeats.
- Temperature
- Controls how peaked the distribution is. Low (0.0–0.3) = deterministic, less hallucinatory but repetitive. High (0.8–1.2) = creative but more prone to hallucination.
- Top-k Sampling
- Only samples from the k most probable tokens, eliminating very low probability tokens that would be hallucinations.
- Top-p (Nucleus)
- Samples from a probabilistic nucleus containing the smallest set of tokens whose cumulative probability exceeds threshold p.
CodeHalu Taxonomy: Four Categories
Tian et al. (AAAI 2025) propose a systematic taxonomy of code hallucinations based on execution-based verification, dividing them into four categories. Understanding this taxonomy enables precise classification and targeted mitigation.
| Category | Definition | Examples | Detection |
|---|---|---|---|
| 1. Mapping | The LLM fails to correctly map task description to code. Generated solution does not align with what was asked. | Asked to sort descending but code sorts ascending; asked to return indices but returns values. | Execution-based testing against specification |
| 2. Naming | References identifiers that do not exist in current scope. Includes undefined references and name confusion. | df.length() instead of len(df); list.filterBy() instead of list.filter(); np.normalize() (doesn't exist). | Static analysis, type checkers, AST analysis |
| 3. Resource | References external resources (packages, files, services) that do not exist or are inaccessible. | Importing sklearn.neural (correct: sklearn.neural_network); assuming config files exist. | Dependency resolver, static analysis |
| 4. Logic | Code is syntactically valid and uses real APIs correctly, but implements flawed algorithms. | Off-by-one in loops; early return in sorting; palindrome check that compares to s[:-1] instead of s[::-1]. | Execution-based testing with comprehensive test suites |
Deep Dive: API and Parameter Hallucinations
API hallucinations — where the model generates calls to methods, classes, or modules that do not exist in the target library — are the most common form of code hallucination. Parameter hallucinations, where the model calls a real method but with non-existent or incorrect arguments, are subtler but equally dangerous.
# API Hallucination Example 1: Non-existent method
import pandas as pd
df1 = pd.read_csv("data1.csv")
df2 = pd.read_csv("data2.csv")
# HALLUCINATED: pandas has no .smart_merge() method
result = pd.smart_merge(df1, df2,
on="id",
strategy="fuzzy", # not a real param
threshold=0.85) # not a real param
# CORRECT approach:
# result = pd.merge(df1, df2, on="id")
# For fuzzy: use thefuzz or recordlinkage library
- Naming interpolation: combines real API fragments
- Cross-library confusion: mixing APIs from different ecosystems
- Version drift: calling removed APIs from older versions
- Fabricated submodules: inventing packages that sound real
- Training data contains multiple library versions
- Model generalizes from similar APIs in different libraries
- Parameter names are plausible by analogy
- Method names follow predictable conventions, so interpolation works
Spot the Hallucination: Interactive Examples
Practice identifying hallucinations by examining real code examples. The key is recognizing that hallucinations often look plausible — they follow naming conventions and patterns you would expect in real APIs.
# Challenge 1: Spot the hallucination
# Snippet A:
import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = data.sortDescending() # HALLUCINATED
print(sorted_data)
# Snippet B:
import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = sorted(data, reverse=True) # CORRECT
print(sorted_data)
// Challenge 2: String vs Array API confusion
// Hallucinated:
const text = "Hello, World!";
const reversed = text.reverse(); // WRONG: strings don't have .reverse()
// Correct:
const reversed = text.split('').reverse().join('');
# Challenge 3: Logic hallucination (all APIs are real)
def is_palindrome(s):
s = s.lower().strip()
return s == s[:-1] # WRONG: should be s[::-1]
# This code:
# - Is syntactically valid (no errors)
# - Uses real Python APIs (slice notation)
# - But implements incorrect logic
Why LLMs Hallucinate Code
Hallucinations are not random errors — they emerge systematically from how transformers learn and generate code.
- Training Data Gaps
- APIs that are infrequent in training data are more likely to be hallucinated. Less popular libraries have far fewer training examples, leading to fabricated method names.
- API Version Confusion
- Training data contains code using multiple versions of the same library. The model may blend APIs across incompatible versions.
- Distribution Shift
- When a prompt asks for code combining libraries in novel ways, the model falls back on statistical patterns rather than genuine understanding.
- Autoregressive Accumulation
- Each generated token conditions on all previous tokens. A small early error cascades through the rest of the function.
- Pattern Mimicry over Semantics
- LLMs are pattern-matching systems. They synthesize plausible-sounding method names that follow conventions but do not actually exist.
Detecting Hallucinations: Multiple Strategies
No single detection method catches all hallucinations. Effective detection requires multiple complementary strategies layered into a comprehensive pipeline.
- Execute generated code against test suites
- Compare actual outputs to expected results
- Catches all categories affecting behavior
- Requires comprehensive test suites
- Cannot detect dead code or style issues
- Parse code to AST, check for undefined references
- Type checkers (mypy, TypeScript) find param mismatches
- Linters catch unused imports
- Excellent for naming and resource hallucinations
- Misses logic errors and valid but wrong APIs
| Method | Naming | Resource | Logic | API params | Cost |
|---|---|---|---|---|---|
| Linting (ruff, ESLint) | ✓ Excellent | ✓ Good | ✗ No | ✗ No | Very low |
| Type Checking (mypy, tsc) | ✓ Good | ✓ Good | ✗ No | ✓ Excellent | Low |
| Test Execution | ✓ Good | ✓ Good | ✓ Good | ✓ Good | Medium–High |
| Static Analysis (Semgrep) | ✓ Good | ✓ Good | ✗ No | ✓ Fair | Medium |
| AST-Based Analysis | ✓ Excellent | ✓ Fair | ✗ No | ✓ Fair | Low |
RAG for Hallucination Reduction
Retrieval-Augmented Generation (RAG) is a foundational technique for reducing hallucinations. Instead of relying on the model's parametric memory alone, RAG retrieves relevant documentation at query time to ground the generation in factual data.
Mitigation: De-Hallucinator's Iterative Approach
Eghbali & Pradel (2024) introduce De-Hallucinator, an iterative grounding approach that retrieves relevant API documentation to reduce hallucinations. Unlike single-pass RAG, De-Hallucinator iterates — extracting APIs from generated code, retrieving documentation for those specific APIs, and re-querying the model with verified documentation.
Prompt Engineering as Mitigation
Careful prompt design is one of the simplest and most effective ways to reduce hallucinations. These techniques require no infrastructure changes — just better prompts.
- Be Explicit About Constraints
- Tell the model exactly what it can and cannot use: "Only use Python standard library functions. Do not import any third-party packages."
- Provide API Documentation in Context
- Paste the relevant API docs directly into the prompt. The model is far less likely to hallucinate APIs when it has the real docs in its context window.
- Ask for Citations and Reasoning
- Prompt: "Cite which library each method comes from" or "Explain why you chose this approach before writing code."
- Add Negative Examples
- "Do NOT use deprecated methods. Do NOT use eval(). Do NOT concatenate strings into SQL queries."
- Request Self-Verification
- "After generating the code, verify that all API methods you used actually exist in the specified library version."
Tool-Augmented Generation: Real-Time Verification
Tool-augmented generation gives LLMs access to external tools that verify code in real time, dramatically reducing hallucinations by providing ground truth feedback during the generation process.
| Tool | Catches | Misses | Speed |
|---|---|---|---|
| Linter (ruff, ESLint) | Undefined names, unused imports, style | Logic errors, valid but wrong APIs | Instant |
| Type Checker (mypy, tsc) | Wrong param types, invalid signatures | Runtime behavior, algorithm correctness | Instant |
| Test Runner (pytest, JUnit) | Logic errors, edge cases, wrong outputs | Requires good test suite; cannot catch untested paths | Seconds–minutes |
| Static Analyzer (Semgrep) | Security vulnerabilities, anti-patterns | Novel hallucinations not in rule database | Seconds |
Real-World Hallucinations: Production Case Studies
Code hallucinations are not hypothetical — they have all happened in production. Every case shows why code outputs from LLMs must be treated as untrusted input requiring verification.
- Copilot Suggests Non-Existent npm Package
- GitHub Copilot suggested
require('express-validator-sanitizer')— a package that does not exist. A researcher later registered the hallucinated name and found that other developers had already attempted to install it, demonstrating real-world slopsquatting risk. - ChatGPT Generates Deprecated TensorFlow Code
- ChatGPT frequently generated TensorFlow 1.x session-based code for users on TensorFlow 2.x. The pattern
tf.Session().run()was removed in TF 2.0, but the model generates it confidently. - Hallucinated Django ORM Methods
- LLM-generated Django code frequently hallucinates QuerySet methods:
.filter_active(),.get_or_none(). These follow Django's naming conventions perfectly but do not exist. - Incorrect Cryptographic Patterns
- Security audits found LLM-generated code frequently uses
hashlib.md5(password)for password hashing — an insecure practice. The correct approach isbcryptorargon2.
Building Hallucination-Resistant Workflows
A practical framework you can apply to production systems and course projects. This workflow layers multiple mitigation strategies for increasing levels of protection.
- Step 1: Use RAG for Grounding
- Provide relevant documentation and code context in your prompts.
- Step 2: Set Low Temperature
- Use T=0.0–0.2 for code you plan to ship.
- Step 3: Add Explicit Constraints
- Specify which libraries, versions, and patterns to use. Include negative constraints.
- Step 4: Run Linting and Type Checking
- Automatically catch naming and resource hallucinations with static analysis.
- Step 5: Generate and Run Tests
- Ask the LLM to generate tests for its own code, then execute them. Failing tests indicate hallucinations.
- Step 6: Human Review for Critical Logic
- Logic hallucinations cannot be caught by automated tools alone.
- Step 7: Monitor in Production
- Log LLM outputs, track error rates, set up alerts for unexpected failures.
Measuring and Evaluating Hallucinations
Quantifying code hallucinations requires specialized benchmarks and metrics. The CodeHaluEval benchmark provides a rigorous framework.
- MiHN (Micro Hallucination Number)
- Total count of individual hallucinated elements across all generated samples. Lower is better.
- MaHR (Macro Hallucination Rate)
- Fraction of generated code samples that contain at least one hallucination. Ranges from 0 to 1.0.
- Edit Distance
- How many edits are needed to fix hallucinated code to match correct code.
- API Recall
- Fraction of required API calls that are correctly included in generated code.
- Test Pass Rate
- Percentage of generated code that passes the specification's test suite. The ultimate metric.
Summary and Key Takeaways
Code hallucinations are a critical challenge in AI-assisted software development. Understanding their mechanisms, taxonomies, and mitigation strategies is essential for safely deploying LLM-generated code in production systems.
- Hallucinations are Unavoidable but Manageable
- LLMs will always hallucinate at some rate. The goal is not to eliminate them entirely but to detect and mitigate them systematically.
- Taxonomies Enable Precision
- CodeHalu provides a framework for precise classification. Different hallucination types require different detection strategies.
- Multiple Defense Layers Work Better Than One
- Linting catches naming errors. Type checkers catch parameters. Tests catch logic errors. RAG prevents information gaps.
- RAG and Iterative Refinement Are Powerful
- Providing documentation and repeatedly refining generated code dramatically reduces hallucinations. De-Hallucinator and MARIN show 50–70% reduction.
- Production Code Requires Human Oversight
- For critical code (security, infrastructure, algorithms), human code review remains essential.
Course materials
Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.
— · No lab artifact this module
Hallucinations is taught as a red-team workshop on the cohort's own models from Labs A–C. No standalone notebook.