Fifteen undergraduate scholars spent twelve weeks at the College of William & Mary learning how code-aware language models actually behave: where they hallucinate, how to measure their failures, and how to build workflows that don't collapse in production. Eleven projects shipped this spring.
The 2026 class.
Eight chapters of theory.
An overview of how the term is built — each chapter is read end-to-end, then drilled in a notebook. Lectures hand off to lab handouts; lab handouts hand off to seminar discussions. The notebook stays open the entire semester.
Mining repositories
Collecting, cleaning, and tokenizing source-code data from public repositories. PyDriller, BPE, deduplication with MinHash, and the ethics of training on copyleft code.
Modeling code
From n-grams to the naturalness hypothesis. Probability refresher, MLE, perplexity, smoothing, sampling temperature — and why code is more predictable than English.
Evaluating rigorously
Classification metrics, BLEU and its discontents, CodeBLEU, pass@k, embeddings, SIDE, and the unglamorous human-evaluation rubric the best papers include without making a show of it.
Deep learning
Neural networks, backpropagation, embeddings, LSTM/GRU, attention, transformers, autoregressive generation, pre-training, and fine-tuning — the engine room.
Prompting LLMs
In-context learning, few-shot, chain-of-thought, prompt engineering, RAG, tool use, context-window management, prompt chaining, self-consistency, and prompt evaluation.
Hallucinations in code
How LLMs fabricate, the CodeHalu taxonomy, RAG mitigation, prompt defenses, tool-augmented generation, production case studies, and hallucination-resistant workflows.
NP-completeness
Reductions, hardness, and what LLMs do when the underlying problem isn't tractable. Where statistical pattern-matching collides with the unforgiving floor of computational complexity.
Genetic algorithms
Population search, fitness landscapes, crossover, fitness approximation with LLM predictors, the GA+LLM architecture, and the honest limits of evolutionary search over code.
Grading scheme
Assignments avg · midterm · capstone split · participation| Deliverable | What it is | Weight |
|---|---|---|
| Assignments I-III (avg) | Average of three coding assignments — mining, modeling, evaluation. | 40% |
| Midterm | Mid-semester written examination of theory and methods. | 10% |
| Final project | Capstone block · 5–7 page write-up paired with a ten-minute in-class demo of the shipped product. Graded on five rubric criteria. | 45% |
| Participation | Office-hour engagement, seminar discussion, peer review. | 5% |
| Bonus | Additive, not weighted — for exceptional contributions. | +0–5 |
| Total | Weighted components sum to 100%; bonus remains additive. | 100% |
Five labs, one notebook each.
Each lab pairs a chapter of theory with a hands-on notebook — the artifact a future student inherits. Run them locally, modify them, break them. Lab handouts and source notebooks are linked from each module.
Repository mining & dedup
Clone three permissively-licensed Java repos, extract methods, tokenize, and dedup. Deliverable: a clean JSONL corpus plus a short report on what got cut and why.
ML warmup — spam
Naive Bayes vs Random Forest on a spam corpus — the MLE / smoothing / evaluation muscle you'll re-use on code tokens in the n-gram lab.
Evaluating a code model
Fine-tune CodeT5 for code translation, then evaluate with BLEU and CodeBLEU. Two notebooks — one runs the model, one computes the metrics.
Deep code completion
An end-to-end seq2seq + attention setup for code completion with a small hyperparameter tuner. Reproduces every number in the corresponding lecture.
Prompting & RAG
Talk to an LLM API through code. ICL, chain-of-thought, and a simple regression suite against your own prompts.
Real shipped products.
Each group chose a real problem, scoped a system, and built something that runs. Scored on market analysis, differentiation, and technical framework. All eleven groups shipped on schedule — five cleared the bar, six fell short. Results below tell the whole story, sorted by group number.


Stock Investment AI
An algorithmic stock-prediction interface with explainable retrieval-grounded recommendations. Pairs price-signal modeling with LLM-generated reasoning over filings.


Multimodal Video Indexing
Natural-language search across video archives, replacing brittle metadata-only retrieval with vision-and-language embeddings indexed at scene granularity.

Sports Betting Arbitrage
Real-time cross-sportsbook arbitrage detection with risk-aware position sizing. Surfaces price disagreements before they close.


PlotForge
A plotting interface for data analysis aimed at students, educators, and lightweight analysts. Natural-language to charting with iterative refinement.

BURT++
A bug-report assistant that translates non-technical user complaints into actionable engineering tickets — clarifying reproduction steps as it goes.


GenAI Claim Verification
Retrieval-augmented evidence pipeline for verifying factual claims, attaching source citations with calibrated confidence.

W&M Degree Map
A planning tool for liberal-arts students navigating complex general-education requirements. Goal-aware course recommendations with clear-eyed prerequisite traversal.

RAG Rules · Ultimate Frisbee
A retrieval-augmented rules interpreter for self-officiated Ultimate Frisbee. Answers in-game questions by grounding in the official rulebook.

CodeCaster
A coding assistant for social-science students learning to program for data analysis. Designed for the first hundred lines, not the next thousand.

Youth Sports Registration
A multilingual youth-sports registration platform with serious accessibility focus — built to reach families current platforms exclude.

AI-Powered Job Search
A unified career platform consolidating fragmented job-seeker tooling into one assistant — resume, search, outreach, and prep in a single workflow.