Ultimate Frisbee Rules Interpreter
A RAG-powered rules adjudication assistant: describe a play in plain English, get a cited ruling grounded in the official USAU rulebook.
Ultimate Frisbee is one of the only competitive sports where the players are also the referees, and disputes routinely stall games because almost nobody has read the 28-page USAU rulebook end-to-end. The Ultimate Rules Interpreter is a web application where a player describes what happened on the field in plain English and gets a structured ruling back in seconds — verdict, plain-English explanation, exact rule citation, and an ambiguity note. The system uses Retrieval-Augmented Generation against the rulebook so the LLM cites source material rather than guessing from training data.
The brief.
Ultimate runs on "Spirit of the Game" — a shared agreement that players call their own fouls and resolve disputes through calm discussion. In theory it works great; in practice, when a high-stakes point gets stopped by a contested call, things get messy fast. Players argue about what they think the rulebook says, someone confidently declares the wrong ruling, and the game sits on pause for five minutes while everyone talks over each other.
The core problem is that almost nobody has actually read the rulebook. The USAU 11th Edition is 28 pages of dense, legalistic text full of cross-references that are hard to follow even at home with time to read carefully — let alone sweaty and tired in the middle of a game. Most players go off memory or defer to whoever sounds most confident, which is a bad way to settle anything. USA Ultimate has over 50,000 registered members in the United States, and the casual base is much larger; the addressable audience for a rulebook-grounded interpreter is the entire competitive and recreational player community.
Target: Recreational and competitive Ultimate players — especially college teams who compete without observers — plus league directors who want a quick, accurate way to settle on-field disputes.
What makes this more than just asking ChatGPT a question is that it uses Retrieval-Augmented Generation. Instead of relying on a general AI model to remember sports rules from its training data (which is exactly how you end up with confidently wrong answers), the system turns the official rulebook into a searchable database.From the write-up
The landscape.
| Tool | Approach | Weakness | Our edge |
|---|---|---|---|
| ChatGPT / Claude | General LLM, no rulebook grounding | Hallucinates plausible but wrong rules | Grounded in official PDF via RAG |
| USAU Website FAQ | Static Q&A pages | Cannot handle novel or composite scenarios | Natural-language scenario input |
| Rules PDF (generic) | Rule lookup by number | Requires knowing the rule number first | Scenario-first workflow — describe the play, not the rule |
| Disc golf apps (UDisc) | Different sport entirely | No Ultimate coverage at all | Ultimate-specific accuracy |
The Ultimate Rules Interpreter is the only tool that combines natural language input with rulebook-grounded output. General LLMs are accessible and conversational but routinely hallucinate plausible-sounding but incorrect rules. Static resources like the USAU FAQ require the user to already know which rule applies. The RAG architecture closes the gap by retrieving the actual relevant rule text and forcing the model to reason from it.
The system.
The application has three main parts: an ingestion pipeline that processes the rulebook, a backend API server that handles queries, and a frontend interface that players actually interact with. Data flows in one direction — the rulebook gets ingested and stored once, and then every player query triggers a retrieval and generation cycle against that stored data.
The ingestion pipeline uses pypdf to extract text from the official USAU PDF. The most important decision was chunking strategy. Storing the entire 28-page document as one block overwhelms retrieval; chunks that are too small lose context and sever connections between related rules. The final approach splits by major rule section, keeping all subsections together under their parent rule. When a chunk exceeds 2,000 characters it splits at a paragraph break, and the parent rule name gets prepended to the next piece so the AI never loses track of where it is in the document. Each chunk is converted into a vector embedding via the Hugging Face Inference API and stored in Pinecone.
The backend is a Flask application. When a player submits a scenario, the backend embeds it with the same Hugging Face model and queries Pinecone for the six most semantically relevant rule sections. Those chunks and the player's original question are sent to Llama 3.3 70B via the Groq API, which uses specialized LPU hardware so responses arrive in near real-time — critical when players are standing on a sideline waiting. The system prompt instructs the model to act as a professional rules official and base its ruling only on the text it has been given, and to respond in JSON so the frontend can cleanly separate the verdict, the citation, and the plain-English explanation.
The implementation.
The frontend is a single-page web application built with HTML, CSS, and vanilla JavaScript. The design priority was mobile usability since most players will pull it up on a phone during a game. It shows a live progress view of the pipeline as it runs so players see the system actively searching and analyzing rather than staring at a loading spinner. A history sidebar persists earlier rulings within the same session so teams can reference previous calls if a similar situation comes up again.
Getting the app running without a hosting bill was its own engineering challenge. The original local sentence-transformers model consumed 650MB of RAM on startup, which instantly crashes Render's free tier (capped at 512MB). Switching to Voyage AI for cloud embeddings solved the RAM problem but introduced a 3-request-per-minute rate limit that was too slow to finish ingesting the 20-chunk rulebook. The final architecture landed on the Hugging Face Inference API for embeddings (low RAM, reasonable rate limits) and Pinecone for persistent cloud storage so the database survives server restarts. The frontend uses relative API paths so the same codebase runs locally and in production without environment-specific config.
The most measurable improvement during development came from switching to parent-level chunking. Across ten representative queries on common call types (travel, pick, strip foul, stall count, down disc), parent-level chunking retrieved the correct definition-containing chunk in 9 out of 10 cases compared to 5 out of 10 with sub-bullet chunking — a clean, repeatable win attributable to a single design choice.
Built with AI.
Where AI helped
- REST endpoint structure, JSON response formatting, and CSS layout came out clean on the first or second try.
- Python data-processing pipelines around pypdf and the embedding loop were largely first-pass correct.
- Library documentation (ChromaDB and then Pinecone upsert/query semantics) was explained well enough to skip reading docs in detail.
- Flask API scaffolding and the base CSS layout were dramatically faster than writing them by hand.
Where AI struggled
- AI could not validate anything that required running the system against real inputs — bad retrieval results only surfaced after I built tests and described the failure pattern back.
- Long sessions accumulated redundant functions and slightly inconsistent patterns that looked fine in isolation but created confusion together; I had to learn to stop and explicitly request refactor passes.
- Production hosting tradeoffs (RAM caps, rate limits, embedding-model swaps) confused the LLM and I had to work through them myself.
Effective vibe coding is fundamentally a communication skill — the more precisely you can describe what failed, what you expected, and what the structure looks like, the better the output gets. Vague prompts produce generic suggestions; specific prompts produce specific solutions.
The evidence.
Limits & next.
Limits
- Plays involving multiple simultaneous events (foul + pick + contested possession in one sequence) can confuse the model about which rule takes priority — the rulebook itself does not always give a clean priority ordering.
- No memory between sessions: refreshing the browser clears the ruling history, which is a real limitation for tournament-day use.
- Retrieval validation rests on a hand-curated 10-query test set; there is no formal held-out evaluation harness yet.
- Coverage is USAU-only; the tool does not handle WFDF (international) rules.
- No voice input — typing a detailed play description while out of breath after a hard point is genuinely difficult.
Next
- Build a proper evaluation dataset of around 50 scenarios with documented correct rulings so future changes can be measured against a ground truth.
- Add voice input so players can describe a play hands-free.
- Add WFDF rule support to open the tool to the global Ultimate community and double the addressable market.
- Persist ruling history across sessions to support full tournament-day workflows.
- Explore a premium tier for leagues and national governing bodies — a flat annual fee for embedding the tool in league websites or scheduling platforms.