Group II · Multimodal media · CSCI 455 / 555 · Spring 2026

Multimodal Video Indexing

Replacing flawed metadata search with natural-language retrieval over short-form video, powered by a locally-hosted multimodal LLM.

Nathaniel Callabresi&Lily Walker William & Mary · May 2026

Short-form video lacks searchable metadata, relying on user-supplied captions or tags. The team builds a minimal video platform powered by a localized multimodal pipeline that enables natural-language search of short-form videos. The pipeline runs Qwen2.5-Omni 7B locally to auto-generate text descriptions from synchronized audio and visual streams, then indexes those descriptions in Elasticsearch. One-shot prompting lets a 7B model recognize specific entities typically reserved for 30B+ models.

React / ViteFastAPIQwen2.5-Omni 7BElasticsearchBackblaze B2all-MiniLM-L6-v2

The brief.

Short-form video on TikTok, Instagram Reels, and YouTube Shorts is discoverable almost entirely through user-generated captions, hashtags, and creator account names. That metadata is subjective, spam-prone, and frequently fails to describe the actual visual and auditory events inside the clip. When a user remembers a specific quote or a visual moment but does not know the exact user-supplied tags, retrieving the video becomes nearly impossible.

A resource-efficient multimodal pipeline closes the gap. By running Qwen2.5-Omni 7B locally inside a 16GB RAM constraint and synchronizing tokens from audio and video via Time-aligned Multimodal RoPE, the platform produces structured textual summaries that an Elasticsearch index can serve as natural-language search results — without sending sensitive video to a third-party cloud.

Target: Niche social-media platforms, internal enterprise asset managers, and smaller creator-economy startups that need automated metadata enrichment for private or localized video repositories.

This technical efficiency ensures that our platform remains responsive and scalable on local infrastructure, bypassing the latency and data-privacy concerns associated with sending sensitive video data to third-party cloud providers.From the write-up

The landscape.

Tool	Approach	Weakness	Our edge
Google Cloud Video AI	Cloud-hosted multimodal video indexing	Prohibitive per-minute cost for smaller platforms and individual creators	Local-first inference inside 16GB RAM, no per-minute cloud bill
AWS Rekognition	Cloud-hosted visual recognition and indexing	Same per-minute cost structure; sensitive video must leave the user's infrastructure	Sensitive video never leaves the user's machine
ASR-only search (Whisper-style)	Index spoken words inside videos	Captures speech but misses the visual context and the timing between audio and visuals	TMRoPE anchors visual entities to specific audio cues — searchable visual + sound events

The pipeline anchors visual entities to specific audio cues through time-aligned tokenization, enabling queries like a person jumping paired with a specific sound — a task tag-based systems cannot perform. One-shot prompting on a 7B model also delivers entity-level specificity (e.g., naming a specific public figure) at a fraction of the memory cost of the 70B+ models competitors rely on.

The system.

The platform is a minimal video upload and search application. Users drag-and-drop a short-form clip into a glassmorphism web UI; the FastAPI backend writes a temporary copy, pushes the file to a private Backblaze B2 bucket, and enqueues a job into a Strict Single-Worker FIFO Queue. The queue is load-bearing — running concurrent inferences would crash the PyTorch MPS allocator on the 16GB device the platform targets.

Deep multimodal inference is performed by Qwen2.5-Omni 7B, loaded via Hugging Face transformers and qwen_vl_utils on the Apple-Silicon MPS backend. The model ingests synchronized audio and visual frames in a single pass — no separate speech-to-text or frame-analysis sub-models — and emits a structured textual summary (topic, what happens, audio and speech, tone, key takeaway).

The generated summary is indexed in a Dockerized Elasticsearch 8.13.2 node. The text is also embedded with all-MiniLM-L6-v2 into a 384-dimensional dense_vector, so the search UI can offer both a BM25 keyword mode and a k-NN semantic mode that retrieves videos whose meaning matches the query even when the exact keyword does not appear in the summary.

Qwen2.5-Omni Thinker-Talker architecture diagram showing vision and audio encoders feeding synchronized tokens into a streaming codec decoder. — **Figure 1** · Qwen2.5-Omni architecture · time-aligned multimodal tokenization across audio and vision streams.

The implementation.

The pipeline is split across a React + Vite + react-router-dom frontend (glassmorphism UI, dynamic timers, drag-and-drop upload) and an ASGI FastAPI + Uvicorn backend. Videos are uploaded to private Backblaze B2 buckets through boto3 (Signature Version 4), and the backend mints short-lived presigned URLs so the browser can stream the file directly without exposing the bucket.

A major engineering hurdle was getting the 7B model to actually run inside the 16GB target. Native FP16 inference swapped to SSD and pushed latency to 6–8 hours per video. Switching to INT8 via QuantoConfig brought a single video down to 50–60 minutes; pushing further to INT4 cut it to 25–35 minutes without significant semantic degradation.

On the search side, BM25 handles term-frequency / inverse-document-frequency keyword matching, while approximate k-NN with k=5 retrieves the top semantically similar videos. The dual-mode search lets a user query for cat and retrieve the clip with BM25, then re-query for kitten and still retrieve the same clip via the embedding space — proving the semantic pipeline works even when the literal word is absent.

Dark glassmorphism upload UI with a dashed drop zone labeled Upload Video for Analysis. — **Figure 2** · VideoSense AI upload surface · drag-and-drop entry into the single-worker inference queue.

Built with AI.

Where AI helped

Generated standard web-server logic, boilerplate files, and CSS styling exceptionally fast.
Drafting the initial Product Requirements Document and parsing it into routed views via Gemini 3.1 Pro.
Iterative meta-prompting to refine the analyze-page flow into the exact manual-trigger states the team wanted.

Where AI struggled

Suggested CUDA-only bitsandbytes quantization for an Apple Silicon MPS backend — incompatible and silent.
Crashed repeatedly on a Python 3.9 types.GenericAlias type-hinting issue inside qwen_vl_utils until the team wrote a manual patch.
Universally defaulted UI styling to indigo-500 (the Tailwind UI fingerprint) — the team kept the indigo intentionally but acknowledged the bias.

Generative coding tools lack intrinsic awareness of undocumented library bugs or localized hardware limitations; the developer must still enforce physical constraints.

The evidence.

19–36 ms

Search latency

Per Elasticsearch query, BM25 or semantic

25–35 min

Inference time (INT4)

Per video, on a 16GB Apple M3 Air

50–60 min

Inference time (INT8)

Same hardware, less aggressive quantization

384

Embedding dimension

all-MiniLM-L6-v2 dense vectors

10–30 s

Clip length tested

Reels, Shorts, and TikTok samples

kitten Successfully retrieved a clip whose AI-generated summary describes a cat being bottle-fed milk — the literal word kitten never appears in the summary, but the embedding model resolves the semantic proximity between cat and kitten. Semantic-search demonstration · semantic recall when literal keyword matching fails.

Completed processing screen showing a cat video alongside a multi-section AI Summary with topic, what happens, and audio and speech fields. — **Figure 3** · Structured AI summary · Qwen2.5-Omni emits topic, scene, audio, and tone fields the index can search.

Limits & next.

Limits

Inference is slow: even at INT4 a single short clip still takes 25–35 minutes on the 16GB target machine.
The strict single-worker queue serializes uploads — multiple concurrent users would block on each other.
One-shot prompting works for known entities (e.g., Donald Trump) but requires a hand-curated reference image per recognized figure.
Pipeline tested on a small dataset of 10–30 second clips; behavior on longer-form video is unverified.

Batch and parallelize inference across multiple devices to break the single-worker bottleneck.
Expand the one-shot reference library so the model recognizes a broader set of public figures.
Extend support beyond 30-second clips toward longer-form video.

Artifacts · Group II

Source GitHub repository F4llow/video_search_app

← All capstones

Other capstones.

Group I Stock Investment AI An educational overlay on equities — AI explanations layered on top of live stock data. → Group III Sports Betting Arbitrage Detection Cross-book arbitrage finder for live sportsbook odds. → Group IV PlotForge Conversational interface for plotting and data analysis. →