Group IV · Generative analytics · CSCI 455 / 555 · Spring 2026

PlotForge

An AI-built interactive plotting and statistical analysis interface, delivered end-to-end through vibe coding with Claude Code.

James He&Jack Stawasz William & Mary · May 2026

PlotForge is a browser-based platform for mathematical function plotting and statistical data analysis, built end-to-end through vibe coding with Claude Code. It targets students, educators, and lightweight analysts who need to explore datasets without writing code. The platform spans function plotting, data import, eight statistical analysis modules, and seven machine-learning algorithms in a single dark-theme web application backed by Python Flask. The codebase totals approximately 17,700 lines across 31 source files.

Python FlaskSciPyscikit-learnChart.jsVanilla JSClaude Code

The brief.

PlotForge addresses three underserved user segments. STEM students regularly need to plot functions, run inferential tests, and inspect model results, but face steep setup costs in Python or MATLAB. Educators need live interactive demonstrations of statistical concepts without managing classroom software environments; Desmos and GeoGebra are interactive but support neither real data nor inferential statistics. Lightweight analysts need a fast scratchpad for uploading a CSV, visualizing distributions, running a quick regression, and exporting findings — before committing to a full programming stack.

The Anaconda 2022 State of Data Science report found that 45% of data practitioners spend significant time navigating between multiple tools rather than performing analysis, and "inadequate tools" ranked among the top three workflow challenges. Python and R dominate data-science usage but consistently require environment setup, package management, and scripting skill that excludes non-programmers. No lightweight, browser-based tool covers the full introductory curriculum from descriptive statistics through supervised ML without requiring any code.

Target: STEM students, educators teaching introductory statistics, and lightweight analysts who need a fast, code-free scratchpad for CSV exploration, distribution fitting, regression, and ML.

Running python src/app.py is the only setup step. The tool covers the complete introductory statistics curriculum — descriptive statistics, distribution fitting, 8 hypothesis test types, correlation, linear/logistic regression, and supervised/unsupervised ML — with no package installation or coding by end users.From the write-up

The landscape.

Tool	Approach	Weakness	Our edge
Desmos / GeoGebra	Interactive function graphers	Limited to function graphing; no data import or inferential statistics	Adds real data, inferential stats, and ML in the same browser surface
MATLAB	Numerical computing environment	Not interactive in a browser; requires install	Zero-install and browser-native
Jupyter Notebook	Code-first scientific notebook	Requires Python scripting skill and environment management	Code-free chip UI exposes the same statistical coverage to non-programmers
JASP	Free GUI for statistics	Lacks function plotting, machine learning, and a web-based deployment	Adds plotting, full ML coverage, and a browser deployment

PlotForge's advantages are concrete and feature-specific: a single-command setup with full introductory-stats coverage; an integrated workflow where data imported once is immediately available in all analysis tabs without re-importing; transparent one-hot encoding for categorical variables (chip UI marks them with a dashed border and a cat badge); interactive curve fitting where users set initial parameters by clicking on the plot instead of entering numerical guesses; and an animated ML training progress bar with a model-specific ETA so users do not abandon training on larger datasets.

The system.

PlotForge uses a two-tier architecture: a vanilla HTML/CSS/JavaScript frontend with no build step and no framework, and a Python Flask backend that serves both static files and a REST API. The Flask root route (/) serves index.html via send_from_directory, ensuring API fetch paths resolve correctly regardless of how the HTML is opened.

The backend (app.py, 1,451 lines) exposes 15 REST endpoints under /api/stats/. SciPy powers the inferential layer — one-sample and two-sample t-tests, Mann-Whitney, Kruskal-Wallis, ANOVA, chi-squared, Shapiro-Wilk, curve fitting (Gaussian, exponential, power, logistic, polynomial, sine), Q-Q plots, ECDF, and preprocessing transforms. scikit-learn supplies linear and logistic regression, random forest, gradient boosting, decision tree, KNN, SVM, K-Means, and PCA, with cross-validation, ROC/PR curves, confusion matrices, and feature importance.

The frontend (stats.js 2,378 lines, index.html 646 lines, components.css 1,168 lines, plus 20 additional JS/CSS files — roughly 17,700 lines total) renders all visualizations with Chart.js and implements a chip-style variable selection system across every analysis tab. Chips support three selection modes: single-select radio, multi-select, and type-filtered (Hypothesis Testing automatically switches between numeric-only chips for t-tests/ANOVA and categorical-only chips for Chi-squared).

The implementation.

The data flow is linear and stateless on the client. A user imports a CSV/JSON/Excel/pickle/hdf5/numpy file. Columns become list-variable objects with kind:'list'; the user selects via the chip UI; _statsPost(endpoint, payload) posts JSON to Flask; categorical variables are one-hot encoded client-side via _expandVar(v) before transmission; Flask returns JSON results; the frontend renders tables, Chart.js visualizations, confusion matrices, ROC/PR curves, and feature-importance bars.

All features proposed in the original specification were delivered (interactive plotting, data import, regression, hypothesis testing). Additions beyond the proposal included chip-style variable selectors, transparent categorical variable support, seven ML algorithms with full evaluation metrics, an animated training progress bar with ETA computed as pct = 90(1 - e^(-3t/Tˆ)), and six histogram subtypes (histogram, violin, Q-Q, ECDF, boxplot, frequency bar).

The entire platform was built using Claude Code (Claude Sonnet 4.6) running in the terminal with direct file-editing access. A CLAUDE.md at the project root captured architecture, file roles, and naming conventions, allowing Claude Code to resume context across sessions without re-explanation. The chip-selector retrofit across all stats tabs, estimated at 3-4 hours of manual work, was completed in roughly 5 minutes with a single prompt.

Built with AI.

Where AI helped

UI scaffolding at scale: chip system, progress bar, and results rendering across 8 tabs were generated with high first-pass correctness from single specifications.
Backend statistics implementation: every SciPy and scikit-learn call (cross-validation, ROC curves, Kruskal-Wallis, PCA) was implemented correctly without iteration.
CSS consistency: the existing design token system (--acc1, --border2) was maintained across every new component without visual drift.

Where AI struggled

Frontend-backend field name mismatches (elbow_k vs elbow_ks, y_test vs y_orig, flat vs nested roc object) required runtime error output to be pasted back for diagnosis — Claude could not predict them from code alone.
A CSS height-collapse edge case where the confusion-matrix wrapper had height:0 despite an inner <table> of 120px (caused by overflow-x:auto collapsing flex-column children) needed getBoundingClientRect() inspection and one debug pass to fix.
Open-ended architectural questions ("React or vanilla JS?") drew confident answers without full codebase context, sometimes requiring course-correction.

Runtime context is essential for debugging: show Claude the exact error output, not just the code. Scope specificity outperforms vague directives by a wide margin. Humans should make the structural decisions (REST API design, single-file Flask, shared variable state) and hand them to Claude as constraints to execute against.

The evidence.

~17,700 lines

Codebase size

31 source files, mostly AI-authored

REST endpoints

All under /api/stats/

ML algorithms

Linear/logistic regression, RF, GB, DT, KNN, SVM, plus K-Means and PCA

8 types

Hypothesis tests

t-tests, Mann-Whitney, Kruskal-Wallis, ANOVA, chi-squared, Shapiro-Wilk, plus more

~5 minutes

Chip-selector retrofit

Estimated 3-4 hours manual; Claude Code, one pass

0.517

ML CV mean accuracy

Random Forest on random synthetic data — correctly exposes overfitting vs 100% in-sample

One-Sample t-Test on score (μ₀ = 0), n=60 synthetic observations from Uniform[50,100] t = 39.10, df = 59, p = 6.90 × 10⁻⁴⁴, 95% CI = [71.61, 79.34]. Switching to Chi-Squared automatically filters chips to show only the categorical group variable, confirming type-aware filtering. End-to-end verification run on a synthetic 60-observation dataset.

Limits & next.

Limits

Open-ended architectural questions to the LLM (vanilla vs framework, file structure) returned confident answers without full codebase context.
Backend field-name drift between Python results and JS renderers needed explicit runtime error output to diagnose — code alone was insufficient.
Categorical encoding is automatic but limited to one-hot; the write-up does not document ordinal or target encoding support.

Artifacts · Group IV

Source GitHub repository jackStawasz/PlotForge

← All capstones

Other capstones.

Group I Stock Investment AI An educational overlay on equities — AI explanations layered on top of live stock data. → Group II Multimodal Video Indexing Natural-language search over video archives, replacing keyword metadata. → Group III Sports Betting Arbitrage Detection Cross-book arbitrage finder for live sportsbook odds. →