Introduction: Evolution Meets AI
Genetic algorithms (GAs) are optimization techniques inspired by Charles Darwin's theory of natural selection. Instead of computing an exact solution, GAs evolve a population of candidate solutions over successive generations, guided by a fitness function. When combined with Large Language Models, this creates a powerful framework for automated software engineering: the GA evolves prompts and specifications, the LLM converts them to code, and a fitness function measures quality.
Why Genetic Algorithms?
Why Combine with LLMs?
The Evaluation Challenge
Core Concepts: Building Blocks of Evolution
Every genetic algorithm operates with the same fundamental vocabulary. A population is a collection of candidate solutions called individuals or chromosomes. Each chromosome is a sequence of genes. The fitness function evaluates the quality of each individual; higher fitness means a better solution. A generation is one complete cycle of evaluation, selection, crossover, and mutation.
Genotype vs. Phenotype: The genotype is the encoded representation. The phenotype is the decoded solution it represents. Genetic Algorithms (GA) optimize parameters; Genetic Programming (GP) evolves programs as tree structures.
- Population
- A collection of candidate solutions. Typical sizes range from 20 to 200.
- Gene
- The smallest unit of a solution. Each chromosome is a sequence of genes.
- Chromosome
- A complete candidate solution encoded as a string of genes.
- Fitness Function
- Evaluates the quality of each individual. This is the only problem-specific component.
- Generation
- One complete cycle of evaluation, selection, crossover, and mutation.
The GA Lifecycle
A genetic algorithm follows a cyclical process that mimics natural evolution.
Selection: Biasing Reproduction Toward Fitness
Selection determines which individuals get to reproduce. The goal is to bias reproduction toward fitter individuals while maintaining diversity to avoid premature convergence.
| Strategy | Idea | Pressure | Diversity | When to use |
|---|---|---|---|---|
| Roulette Wheel | Probability proportional to fitness | High | Low | Uniform fitness distributions |
| Tournament (k=2) | Pick k randomly, keep the fittest | Moderate | Good | Most practical applications |
| Rank-Based | Assign probability based on rank | Controlled | Good | Fitness varies by orders of magnitude |
| Elitism | Copy top k unchanged | Guaranteed | Reduces slightly | Use alongside another method |
Crossover: Combining Building Blocks
Crossover (recombination) combines genetic material from two parents to create offspring. It is the primary driver of exploration in the search space.
// Single-Point Crossover: cut at one position, swap tails
Parent A: [1, 0, 1, 1, 0, 1]
Parent B: [0, 1, 0, 0, 1, 1]
^-- crossover point at 3
Child 1: [1, 0, 1, 0, 1, 1] // A's head + B's tail
Child 2: [0, 1, 0, 1, 0, 1] // B's head + A's tail
// Two-Point Crossover: two cut points, middle segment swaps
Parent A: [1, 0, 1, 1, 0, 1]
Parent B: [0, 1, 0, 0, 1, 1]
^ ^-- cuts at 2 and 4
Child: [1, 0, 0, 0, 0, 1] // A's edges + B's middle
// Uniform Crossover: coin flip per gene
Parent A: [1, 0, 1, 1, 0, 1]
Parent B: [0, 1, 0, 0, 1, 1]
Coin: A B A B B A
Child: [1, 1, 1, 0, 1, 1]
For code and prompts, crossover must respect structure. Sentence-level crossover on prompts and function-level crossover on code templates produce valid offspring far more often than character-level cuts. This is why most GA+LLM work evolves natural language prompts rather than code tokens directly.
Mutation: Creating New Genetic Material
Mutation introduces random changes to individual genes. While crossover combines existing solutions, mutation creates genuinely new genetic material — essential for escaping local optima.
Bit-Flip Mutation
Gaussian Mutation
LLM-Assisted Mutation
Fitness Landscapes: Understanding the Search Space
A fitness landscape conceptually maps every possible solution to its fitness value. Smooth, unimodal landscapes are easy. Rugged, multi-modal landscapes are hard. Code fitness landscapes are extremely rugged — a single character change can turn a working program into one that crashes.
The core tension: exploration (mutation, large populations) searches new regions. Exploitation (selection, crossover) refines known good solutions.
Parameter Tuning
GA behavior is controlled by several key parameters. Choosing the right values is critical — poor settings lead to premature convergence or wasted computation.
| Parameter | Typical range | Too low | Too high | Recommendation |
|---|---|---|---|---|
| Population Size | 20–200 | Low diversity | Slow evaluation | Start with 50–100 |
| Crossover Rate | 0.6–0.9 | Limited recombination | Disrupts good solutions | 0.8 is a safe default |
| Mutation Rate | 0.01–0.05 | Cannot escape local optima | Becomes random search | 1/L is classic |
| Generations | 50–500 | May not converge | Diminishing returns | Use convergence detection |
| Tournament k | 2–7 | Near-random selection | Loss of diversity | k=2 or k=3 |
The Evaluation Bottleneck
The critical challenge in GA for code generation: how do you know if generated code is actually good? Functional correctness requires compiling the code and executing it against a test suite. With 500 individuals × 100 generations = 50,000 evaluations, and each test execution taking 2 hours, that's 100,000 hours = 11.4 years of computation.
Fitness Approximation: The Key Insight
What if we could predict whether code passes tests without actually running them? Fitness approximation replaces expensive test execution with a fast predictor trained to estimate correctness. The predictor analyzes code across multiple dimensions: structural features (AST depth, branches, loops), specification alignment, pattern matching, and semantic similarity to known-good references.
// Exact fitness evaluation: 11.4 YEARS
Evaluations: 500 individuals × 100 gens = 50,000
Time per eval: 2 hours for test execution
Total: 100,000 hours = 11.4 years
// Approximate fitness: ~7 HOURS
Evaluations: 500 individuals × 100 gens = 50,000
Time per eval: 0.5 seconds for LLM predictor
Total: 25,000 seconds ≈ 7 hours
// Speedup: 5 orders of magnitude (11.4 years / 7 hours)
GA + LLM: The Combined Architecture
The breakthrough idea: Use LLMs as a code generation function and genetic algorithms to search for the best inputs (prompts, specifications) to that function.
The key innovation: Expensive test execution happens ONCE at the end, on a handful of final candidates. During evolution, each new individual costs only an LLM call for generation plus an LLM predictor call for fitness estimation — both seconds, not hours.
Application: Evolving Test Descriptions
One of the most promising applications of GA+LLM is automated test generation. Instead of manually writing test cases, we evolve natural language descriptions that an LLM converts into executable tests. The fitness is code coverage achieved by the generated tests.
Honest Limitations and Open Challenges
GA+LLM is a promising research direction, not a production-ready solution. Intellectual honesty requires acknowledging significant open challenges.
Fitness Landscape Mismatch
Representation Problem
Deceptive Fitness
Constraint Modeling
The combinatorial explosion: A 50-token prompt with a vocabulary of 50,000 tokens has 50,000^50 possible configurations. Economic constraints: At $0.01 per LLM call, a population of 500 across 100 generations costs $1,000 in API fees alone.
Design Checklist: Building a GA for Code
- 1. Define the Objective Precisely
- What exactly does 'good code' mean for your task? Write it down as a measurable criterion.
- 2. Choose the Representation
- Will you evolve prompts, code templates, parameters, or raw code? Prompts are almost always the best starting point.
- 3. Design the Fitness Function
- Prefer fine-grained, continuous fitness over binary pass/fail. Include partial credit.
- 4. Select Genetic Operators
- Choose crossover and mutation that respect your representation's structure.
- 5. Set Initial Parameters
- Start with: population=50, tournament k=2, crossover=0.8, mutation=1/L, elitism=2.
- 6. Establish a Baseline
- Before running the GA, measure what random search achieves with the same budget.
- 7. Monitor Diversity
- Track population diversity alongside fitness. If diversity drops to near-zero while fitness stalls, increase mutation or population size.
- 8. Validate Final Output
- Always run actual tests on the top-K solutions. Never trust approximate fitness for the final result.
Research Frontiers
Multi-Objective Optimization
Neural-Guided Search
Surrogate-Assisted Evolution
Coevolution
Summary & Key Takeaways
GAs Evolve Solutions
Fitness Drives Everything
Balance Explore vs Exploit
GA + LLM = Synergy
Fitness Approximation
Honest Assessment
— · No lab artifact this module
GA is a research-oriented capstone module — the work happens inside each team's final project rather than a shared notebook.