mastropaolo.dev playbook Back to playbook

Module GA · Optimization

Genetic Algorithms & LLMs

Discover how genetic algorithms evolve prompts, test cases, and code specifications to dramatically improve AI-generated software artifacts.

Reading time~45 min read InstructorDr. Mastropaolo CohortSpring 2026

Introduction: Evolution Meets AI

Genetic algorithms (GAs) are optimization techniques inspired by Charles Darwin's theory of natural selection. Instead of computing an exact solution, GAs evolve a population of candidate solutions over successive generations, guided by a fitness function. When combined with Large Language Models, this creates a powerful framework for automated software engineering: the GA evolves prompts and specifications, the LLM converts them to code, and a fitness function measures quality.

Why Genetic Algorithms?

GAs are uniquely suited to problems with enormous search spaces where gradients don't exist or are expensive to compute. Unlike random search, they learn from previous evaluations.

Why Combine with LLMs?

LLMs excel at generating code from specifications. GAs excel at searching for the best specifications. Together, they automate both the design and implementation of software artifacts.

The Evaluation Challenge

The critical barrier: functional correctness is the gold standard for evaluating code, but running full test suites is expensive (minutes to hours per candidate). Fitness approximation provides orders-of-magnitude speedup.

Core Concepts: Building Blocks of Evolution

Every genetic algorithm operates with the same fundamental vocabulary. A population is a collection of candidate solutions called individuals or chromosomes. Each chromosome is a sequence of genes. The fitness function evaluates the quality of each individual; higher fitness means a better solution. A generation is one complete cycle of evaluation, selection, crossover, and mutation.

Genotype vs. Phenotype: The genotype is the encoded representation. The phenotype is the decoded solution it represents. Genetic Algorithms (GA) optimize parameters; Genetic Programming (GP) evolves programs as tree structures.

Population
A collection of candidate solutions. Typical sizes range from 20 to 200.
Gene
The smallest unit of a solution. Each chromosome is a sequence of genes.
Chromosome
A complete candidate solution encoded as a string of genes.
Fitness Function
Evaluates the quality of each individual. This is the only problem-specific component.
Generation
One complete cycle of evaluation, selection, crossover, and mutation.

The GA Lifecycle

A genetic algorithm follows a cyclical process that mimics natural evolution.

Initialize
Create a random population of N chromosomes.
Evaluate
Compute fitness of every individual. Often the most expensive step.
Select
Choose parent pairs based on fitness. Fitter individuals have higher probability of reproducing.
Crossover
Combine two parents to create offspring. Swap portions at crossover points.
Mutate
Randomly alter genes in offspring with small probability. Prevents premature convergence.
Terminate?
Check stopping criteria: max generations, fitness threshold met, or population converged.

Selection: Biasing Reproduction Toward Fitness

Selection determines which individuals get to reproduce. The goal is to bias reproduction toward fitter individuals while maintaining diversity to avoid premature convergence.

StrategyIdeaPressureDiversityWhen to use
Roulette WheelProbability proportional to fitnessHighLowUniform fitness distributions
Tournament (k=2)Pick k randomly, keep the fittestModerateGoodMost practical applications
Rank-BasedAssign probability based on rankControlledGoodFitness varies by orders of magnitude
ElitismCopy top k unchangedGuaranteedReduces slightlyUse alongside another method

Crossover: Combining Building Blocks

Crossover (recombination) combines genetic material from two parents to create offspring. It is the primary driver of exploration in the search space.

text
// Single-Point Crossover: cut at one position, swap tails
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
                     ^-- crossover point at 3
Child 1:   [1, 0, 1, 0, 1, 1]  // A's head + B's tail
Child 2:   [0, 1, 0, 1, 0, 1]  // B's head + A's tail

// Two-Point Crossover: two cut points, middle segment swaps
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
                ^   ^-- cuts at 2 and 4
Child:     [1, 0, 0, 0, 0, 1]  // A's edges + B's middle

// Uniform Crossover: coin flip per gene
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
Coin:       A   B   A   B   B   A
Child:     [1, 1, 1, 0, 1, 1]

For code and prompts, crossover must respect structure. Sentence-level crossover on prompts and function-level crossover on code templates produce valid offspring far more often than character-level cuts. This is why most GA+LLM work evolves natural language prompts rather than code tokens directly.

Mutation: Creating New Genetic Material

Mutation introduces random changes to individual genes. While crossover combines existing solutions, mutation creates genuinely new genetic material — essential for escaping local optima.

Bit-Flip Mutation

For binary chromosomes: flip a random bit. Each gene has probability 1/L of being flipped.

Gaussian Mutation

For real-valued chromosomes: add small random noise drawn from a Gaussian. Used in continuous optimization.

LLM-Assisted Mutation

For code/prompts: ask an LLM to suggest a mutation. Example: "Improve this prompt to better handle edge cases." The most powerful mutation type for code.

Fitness Landscapes: Understanding the Search Space

A fitness landscape conceptually maps every possible solution to its fitness value. Smooth, unimodal landscapes are easy. Rugged, multi-modal landscapes are hard. Code fitness landscapes are extremely rugged — a single character change can turn a working program into one that crashes.

The core tension: exploration (mutation, large populations) searches new regions. Exploitation (selection, crossover) refines known good solutions.

Parameter Tuning

GA behavior is controlled by several key parameters. Choosing the right values is critical — poor settings lead to premature convergence or wasted computation.

ParameterTypical rangeToo lowToo highRecommendation
Population Size20–200Low diversitySlow evaluationStart with 50–100
Crossover Rate0.6–0.9Limited recombinationDisrupts good solutions0.8 is a safe default
Mutation Rate0.01–0.05Cannot escape local optimaBecomes random search1/L is classic
Generations50–500May not convergeDiminishing returnsUse convergence detection
Tournament k2–7Near-random selectionLoss of diversityk=2 or k=3

The Evaluation Bottleneck

The critical challenge in GA for code generation: how do you know if generated code is actually good? Functional correctness requires compiling the code and executing it against a test suite. With 500 individuals × 100 generations = 50,000 evaluations, and each test execution taking 2 hours, that's 100,000 hours = 11.4 years of computation.

~0.1s
CodeBLEU per eval — text similarity, does NOT tell you if code works
minutes-hours
Test execution — gold standard, computationally infeasible at scale
~0.5s
LLM predictor — ~95% accuracy, the practical middle ground

Fitness Approximation: The Key Insight

What if we could predict whether code passes tests without actually running them? Fitness approximation replaces expensive test execution with a fast predictor trained to estimate correctness. The predictor analyzes code across multiple dimensions: structural features (AST depth, branches, loops), specification alignment, pattern matching, and semantic similarity to known-good references.

text
// Exact fitness evaluation: 11.4 YEARS
Evaluations: 500 individuals × 100 gens = 50,000
Time per eval: 2 hours for test execution
Total: 100,000 hours = 11.4 years

// Approximate fitness: ~7 HOURS
Evaluations: 500 individuals × 100 gens = 50,000
Time per eval: 0.5 seconds for LLM predictor
Total: 25,000 seconds ≈ 7 hours

// Speedup: 5 orders of magnitude (11.4 years / 7 hours)

GA + LLM: The Combined Architecture

The breakthrough idea: Use LLMs as a code generation function and genetic algorithms to search for the best inputs (prompts, specifications) to that function.

GA Evolves Prompts
Population of natural language prompts describing code to generate
LLM Generates Code
Each prompt produces a code implementation via an LLM API call
Tests Evaluate Code
Compile and run the code against a test suite
Fitness = Test Results
Measure pass/fail, coverage, or other quality metrics
GA Evolves Better Prompts
Selection, crossover, mutation on the best-performing prompts

The key innovation: Expensive test execution happens ONCE at the end, on a handful of final candidates. During evolution, each new individual costs only an LLM call for generation plus an LLM predictor call for fitness estimation — both seconds, not hours.

Application: Evolving Test Descriptions

One of the most promising applications of GA+LLM is automated test generation. Instead of manually writing test cases, we evolve natural language descriptions that an LLM converts into executable tests. The fitness is code coverage achieved by the generated tests.

Generate Initial Descriptions
Create N random test description texts.
LLM → Executable Tests
Convert each description into JUnit/pytest.
Execute & Measure Coverage
Compile and run each test; measure line/branch/mutation coverage.
Selection
Tournament selection on descriptions whose tests achieved highest coverage.
Crossover & Mutation
Combine clauses from two descriptions; rephrase; add constraints.
Repeat for N Generations
Continue until coverage plateaus.

Honest Limitations and Open Challenges

GA+LLM is a promising research direction, not a production-ready solution. Intellectual honesty requires acknowledging significant open challenges.

Fitness Landscape Mismatch

The LLM predictor may not capture the true fitness landscape. Candidates that look correct may fail real tests.

Representation Problem

How do you represent code as a chromosome? Naive string-level operations produce syntactically invalid offspring.

Deceptive Fitness

If the predictor consistently gets the same 5% wrong, the GA evolves solutions that game the predictor rather than solve the actual problem.

Constraint Modeling

Real SE problems have complex constraints (performance, memory, concurrency, backward compatibility). Modeling all of these in one fitness function is extremely difficult.

The combinatorial explosion: A 50-token prompt with a vocabulary of 50,000 tokens has 50,000^50 possible configurations. Economic constraints: At $0.01 per LLM call, a population of 500 across 100 generations costs $1,000 in API fees alone.

Design Checklist: Building a GA for Code

1. Define the Objective Precisely
What exactly does 'good code' mean for your task? Write it down as a measurable criterion.
2. Choose the Representation
Will you evolve prompts, code templates, parameters, or raw code? Prompts are almost always the best starting point.
3. Design the Fitness Function
Prefer fine-grained, continuous fitness over binary pass/fail. Include partial credit.
4. Select Genetic Operators
Choose crossover and mutation that respect your representation's structure.
5. Set Initial Parameters
Start with: population=50, tournament k=2, crossover=0.8, mutation=1/L, elitism=2.
6. Establish a Baseline
Before running the GA, measure what random search achieves with the same budget.
7. Monitor Diversity
Track population diversity alongside fitness. If diversity drops to near-zero while fitness stalls, increase mutation or population size.
8. Validate Final Output
Always run actual tests on the top-K solutions. Never trust approximate fitness for the final result.

Research Frontiers

Multi-Objective Optimization

Instead of collapsing correctness, efficiency, and readability into one score, optimize them simultaneously. Maintain a Pareto front.

Neural-Guided Search

Use a neural network to bias the GA's mutation operator toward likely-useful changes.

Surrogate-Assisted Evolution

Train increasingly accurate fitness predictors as the GA runs. Each generation provides new training data.

Coevolution

Evolve test cases alongside code solutions in parallel populations. Code evolves to pass tests; tests evolve to break code.

Summary & Key Takeaways

GAs Evolve Solutions

Maintain a population of candidates, using selection, crossover, and mutation to iteratively improve without requiring gradients.

Fitness Drives Everything

The fitness function is the only problem-specific component. For SE, fitness can be test pass rate, coverage, or compilation success.

Balance Explore vs Exploit

Mutation and large populations drive exploration; selection and crossover exploit known good solutions.

GA + LLM = Synergy

GAs evolve prompts and specifications; LLMs convert them to code and tests. This combination automates both design and implementation.

Fitness Approximation

When test execution is expensive, approximate fitness with an LLM predictor. Accept small error for orders-of-magnitude speedup.

Honest Assessment

GA+LLM is promising research, not production-ready. Fitness landscape mismatch, deceptive fitness, and representation problems remain open challenges.

— · No lab artifact this module

GA is a research-oriented capstone module — the work happens inside each team's final project rather than a shared notebook.

No lab artifact — this module is taught without a standalone notebook.