Genetic Algorithms & LLMs — CSCI 455 / 555

Introduction: Evolution Meets AI

Genetic algorithms (GAs) are optimization techniques inspired by Charles Darwin's theory of natural selection. Instead of computing an exact solution, GAs evolve a population of candidate solutions over successive generations, guided by a fitness function. When combined with Large Language Models, this creates a powerful framework for automated software engineering: the GA evolves prompts and specifications, the LLM converts them to code, and a fitness function measures quality.

Why Genetic Algorithms?

GAs are uniquely suited to problems with enormous search spaces where gradients don't exist or are expensive to compute. Unlike random search, they learn from previous evaluations.

Why Combine with LLMs?

LLMs excel at generating code from specifications. GAs excel at searching for the best specifications. Together, they automate both the design and implementation of software artifacts.

The Evaluation Challenge

The critical barrier: functional correctness is the gold standard for evaluating code, but running full test suites is expensive (minutes to hours per candidate). Fitness approximation provides orders-of-magnitude speedup.

Core Concepts: Building Blocks of Evolution

Every genetic algorithm operates with the same fundamental vocabulary. A population is a collection of candidate solutions called individuals or chromosomes. Each chromosome is a sequence of genes. The fitness function evaluates the quality of each individual; higher fitness means a better solution. A generation is one complete cycle of evaluation, selection, crossover, and mutation.

Genotype vs. Phenotype: The genotype is the encoded representation. The phenotype is the decoded solution it represents. Genetic Algorithms (GA) optimize parameters; Genetic Programming (GP) evolves programs as tree structures.

Population: A collection of candidate solutions. Typical sizes range from 20 to 200.
Gene: The smallest unit of a solution. Each chromosome is a sequence of genes.
Chromosome: A complete candidate solution encoded as a string of genes.
Fitness Function: Evaluates the quality of each individual. This is the only problem-specific component.
Generation: One complete cycle of evaluation, selection, crossover, and mutation.

The GA Lifecycle

A genetic algorithm follows a cyclical process that mimics natural evolution.

Initialize

Create a random population of N chromosomes.

Evaluate

Compute fitness of every individual. Often the most expensive step.

Select

Choose parent pairs based on fitness. Fitter individuals have higher probability of reproducing.

Crossover

Combine two parents to create offspring. Swap portions at crossover points.

Mutate

Randomly alter genes in offspring with small probability. Prevents premature convergence.

Terminate?

Check stopping criteria: max generations, fitness threshold met, or population converged.

Selection: Biasing Reproduction Toward Fitness

Selection determines which individuals get to reproduce. The goal is to bias reproduction toward fitter individuals while maintaining diversity to avoid premature convergence.

Strategy	Idea	Pressure	Diversity	When to use
Roulette Wheel	Probability proportional to fitness	High	Low	Uniform fitness distributions
Tournament (k=2)	Pick k randomly, keep the fittest	Moderate	Good	Most practical applications
Rank-Based	Assign probability based on rank	Controlled	Good	Fitness varies by orders of magnitude
Elitism	Copy top k unchanged	Guaranteed	Reduces slightly	Use alongside another method

Crossover: Combining Building Blocks

Crossover (recombination) combines genetic material from two parents to create offspring. It is the primary driver of exploration in the search space.

text

// Single-Point Crossover: cut at one position, swap tails
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
                     ^-- crossover point at 3
Child 1:   [1, 0, 1, 0, 1, 1]  // A's head + B's tail
Child 2:   [0, 1, 0, 1, 0, 1]  // B's head + A's tail

// Two-Point Crossover: two cut points, middle segment swaps
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
                ^   ^-- cuts at 2 and 4
Child:     [1, 0, 0, 0, 0, 1]  // A's edges + B's middle

// Uniform Crossover: coin flip per gene
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
Coin:       A   B   A   B   B   A
Child:     [1, 1, 1, 0, 1, 1]

For code and prompts, crossover must respect structure. Sentence-level crossover on prompts and function-level crossover on code templates produce valid offspring far more often than character-level cuts. This is why most GA+LLM work evolves natural language prompts rather than code tokens directly.

Mutation: Creating New Genetic Material

Mutation introduces random changes to individual genes. While crossover combines existing solutions, mutation creates genuinely new genetic material — essential for escaping local optima.

Bit-Flip Mutation

For binary chromosomes: flip a random bit. Each gene has probability 1/L of being flipped.

Gaussian Mutation

For real-valued chromosomes: add small random noise drawn from a Gaussian. Used in continuous optimization.

LLM-Assisted Mutation

For code/prompts: ask an LLM to suggest a mutation. Example: "Improve this prompt to better handle edge cases." The most powerful mutation type for code.

Fitness Landscapes: Understanding the Search Space

A fitness landscape conceptually maps every possible solution to its fitness value. Smooth, unimodal landscapes are easy. Rugged, multi-modal landscapes are hard. Code fitness landscapes are extremely rugged — a single character change can turn a working program into one that crashes.

The core tension: exploration (mutation, large populations) searches new regions. Exploitation (selection, crossover) refines known good solutions.

Parameter Tuning

GA behavior is controlled by several key parameters. Choosing the right values is critical — poor settings lead to premature convergence or wasted computation.

Parameter	Typical range	Too low	Too high	Recommendation
Population Size	20–200	Low diversity	Slow evaluation	Start with 50–100
Crossover Rate	0.6–0.9	Limited recombination	Disrupts good solutions	0.8 is a safe default
Mutation Rate	0.01–0.05	Cannot escape local optima	Becomes random search	1/L is classic
Generations	50–500	May not converge	Diminishing returns	Use convergence detection
Tournament k	2–7	Near-random selection	Loss of diversity	k=2 or k=3

The Evaluation Bottleneck

The critical challenge in GA for code generation: how do you know if generated code is actually good? Functional correctness requires compiling the code and executing it against a test suite. With 500 individuals × 100 generations = 50,000 evaluations, and each test execution taking 2 hours, that's 100,000 hours = 11.4 years of computation.

~0.1s

CodeBLEU per eval — text similarity, does NOT tell you if code works

minutes-hours

Test execution — gold standard, computationally infeasible at scale

~0.5s

LLM predictor — ~95% accuracy, the practical middle ground

Fitness Approximation: The Key Insight

What if we could predict whether code passes tests without actually running them? Fitness approximation replaces expensive test execution with a fast predictor trained to estimate correctness. The predictor analyzes code across multiple dimensions: structural features (AST depth, branches, loops), specification alignment, pattern matching, and semantic similarity to known-good references.

text

// Exact fitness evaluation: 11.4 YEARS
Evaluations: 500 individuals × 100 gens = 50,000
Time per eval: 2 hours for test execution
Total: 100,000 hours = 11.4 years

// Approximate fitness: ~7 HOURS
Evaluations: 500 individuals × 100 gens = 50,000
Time per eval: 0.5 seconds for LLM predictor
Total: 25,000 seconds ≈ 7 hours

// Speedup: 5 orders of magnitude (11.4 years / 7 hours)

GA + LLM: The Combined Architecture

The breakthrough idea: Use LLMs as a code generation function and genetic algorithms to search for the best inputs (prompts, specifications) to that function.

GA Evolves Prompts

Population of natural language prompts describing code to generate

LLM Generates Code

Each prompt produces a code implementation via an LLM API call

Tests Evaluate Code

Compile and run the code against a test suite

Fitness = Test Results

Measure pass/fail, coverage, or other quality metrics

GA Evolves Better Prompts

Selection, crossover, mutation on the best-performing prompts

The key innovation: Expensive test execution happens ONCE at the end, on a handful of final candidates. During evolution, each new individual costs only an LLM call for generation plus an LLM predictor call for fitness estimation — both seconds, not hours.

Application: Evolving Test Descriptions

One of the most promising applications of GA+LLM is automated test generation. Instead of manually writing test cases, we evolve natural language descriptions that an LLM converts into executable tests. The fitness is code coverage achieved by the generated tests.

Generate Initial Descriptions

Create N random test description texts.

LLM → Executable Tests

Convert each description into JUnit/pytest.

Execute & Measure Coverage

Compile and run each test; measure line/branch/mutation coverage.

Selection

Tournament selection on descriptions whose tests achieved highest coverage.

Crossover & Mutation

Combine clauses from two descriptions; rephrase; add constraints.

Repeat for N Generations

Continue until coverage plateaus.

Honest Limitations and Open Challenges

GA+LLM is a promising research direction, not a production-ready solution. Intellectual honesty requires acknowledging significant open challenges.

Fitness Landscape Mismatch

The LLM predictor may not capture the true fitness landscape. Candidates that look correct may fail real tests.

Representation Problem

How do you represent code as a chromosome? Naive string-level operations produce syntactically invalid offspring.

Deceptive Fitness

If the predictor consistently gets the same 5% wrong, the GA evolves solutions that game the predictor rather than solve the actual problem.

Constraint Modeling

Real SE problems have complex constraints (performance, memory, concurrency, backward compatibility). Modeling all of these in one fitness function is extremely difficult.

The combinatorial explosion: A 50-token prompt with a vocabulary of 50,000 tokens has 50,000^50 possible configurations. Economic constraints: At $0.01 per LLM call, a population of 500 across 100 generations costs $1,000 in API fees alone.

Design Checklist: Building a GA for Code

1. Define the Objective Precisely: What exactly does 'good code' mean for your task? Write it down as a measurable criterion.
2. Choose the Representation: Will you evolve prompts, code templates, parameters, or raw code? Prompts are almost always the best starting point.
3. Design the Fitness Function: Prefer fine-grained, continuous fitness over binary pass/fail. Include partial credit.
4. Select Genetic Operators: Choose crossover and mutation that respect your representation's structure.
5. Set Initial Parameters: Start with: population=50, tournament k=2, crossover=0.8, mutation=1/L, elitism=2.
6. Establish a Baseline: Before running the GA, measure what random search achieves with the same budget.
7. Monitor Diversity: Track population diversity alongside fitness. If diversity drops to near-zero while fitness stalls, increase mutation or population size.
8. Validate Final Output: Always run actual tests on the top-K solutions. Never trust approximate fitness for the final result.

Research Frontiers

Multi-Objective Optimization

Instead of collapsing correctness, efficiency, and readability into one score, optimize them simultaneously. Maintain a Pareto front.

Neural-Guided Search

Use a neural network to bias the GA's mutation operator toward likely-useful changes.

Surrogate-Assisted Evolution

Train increasingly accurate fitness predictors as the GA runs. Each generation provides new training data.

Coevolution

Evolve test cases alongside code solutions in parallel populations. Code evolves to pass tests; tests evolve to break code.

Summary & Key Takeaways

GAs Evolve Solutions

Maintain a population of candidates, using selection, crossover, and mutation to iteratively improve without requiring gradients.

Fitness Drives Everything

The fitness function is the only problem-specific component. For SE, fitness can be test pass rate, coverage, or compilation success.

Balance Explore vs Exploit

Mutation and large populations drive exploration; selection and crossover exploit known good solutions.

GA + LLM = Synergy

GAs evolve prompts and specifications; LLMs convert them to code and tests. This combination automates both design and implementation.

Fitness Approximation

When test execution is expensive, approximate fitness with an LLM predictor. Accept small error for orders-of-magnitude speedup.

Honest Assessment

GA+LLM is promising research, not production-ready. Fitness landscape mismatch, deceptive fitness, and representation problems remain open challenges.