Real-Time AI Model Benchmarking

Watch language models compete head-to-head on crossword clues, streaming results as they generate answers

What It Does

Crossword Sprint AI is a real-time benchmarking platform that pits multiple language models against each other in timed crossword challenges. Each model receives the same clues simultaneously and races to provide correct answers as quickly as possible.

The system scores each attempt based on both accuracy and speed, providing a comprehensive view of model performance under constrained output conditions (typical crossword answers are 3-15 characters).

Technical Architecture

Server-Sent Events (SSE) Streaming

The system uses SSE to stream race updates to the client in real-time. Each model attempt is sent individually as soon as it completes, rather than waiting for all models to finish.

// Backend sends events as they happen:
controller.enqueue("data: {"type":"attempt", ...}\n\n")

Parallel Model Execution

All models process each clue simultaneously using Promise.all(), with individual completion callbacks that trigger SSE updates the moment each model finishes.

// Parallel execution with callbacks:
await Promise.all(models.map(model =>
runModel(model).then(onComplete)
))

Performance Metrics

Each attempt is measured for:

  • Time to First Token (TTFT): Latency before generation starts
  • End-to-End Time: Total completion time
  • Accuracy: Normalized string matching against correct answer
  • Composite Score: Combined metric balancing speed and correctness

Scoring Algorithm

Models are scored using a time-weighted accuracy system that rewards both correctness and speed:

// Score calculation:
baseScore = correct ? 100 : 0
timeBonus = max(0, (timeLimit - e2eTime) / timeLimit * 50)
finalScore = baseScore + timeBonus
How Wordle Mode Works

Overview

In Wordle Mode, multiple AI models race to solve the same 5-letter word puzzle. Each model gets up to 6 guesses, and after each guess, they receive Wordle-style feedback:

  • Green: Letter is correct and in the right position
  • Yellow: Letter is in the word but in the wrong position
  • Gray: Letter is not in the word

AI Prompt System

Each AI model receives a carefully crafted prompt that includes:

  1. Wordle rules explaining the game mechanics and feedback system
  2. Previous guesses and feedback - All their prior attempts with color-coded feedback (🟩 for green, 🟨 for yellow, ⬜ for gray)
  3. Instructions to output only a single 5-letter lowercase word with no additional text

Example prompt structure: You are playing Wordle. Guess a 5-letter English word. Rules: - You have up to 6 guesses total - After each guess, you'll get feedback: * Green (correct): letter is in the word and in the correct position * Yellow (present): letter is in the word but in a different position * Gray (absent): letter is not in the word at all - Output ONLY a single 5-letter lowercase word, nothing else Previous guesses and feedback: Guess 1: CRANE 🟨⬜⬜⬜🟨 Guess 2: STORM ⬜⬜⬜⬜⬜ Your next guess (output only the 5-letter word):

Live Updates

All models play simultaneously, and you see their guesses appear in real-time via Server-Sent Events (SSE). Each model's board updates as they make guesses, showing their progress as they work toward solving the puzzle.

Scoring & Ranking

Models are ranked based on:

  1. Solved vs Failed - Models that solve the puzzle rank higher than those that don't
  2. Speed - Among models that solve it, faster completion time wins
  3. Efficiency - When times are close, fewer guesses wins

Models that fail to solve the puzzle within 6 guesses are ranked below all successful solvers.

Technical Details

  • Each model runs independently and in parallel
  • Guesses are validated to ensure they're 5-letter words
  • Feedback is computed using standard Wordle rules
  • All models solve the same randomly selected target word
  • The target word is chosen from a curated list of common 5-letter words
Built for Web Summit Hackathon

This project was created for the Web Summit Hackathon, inspired by Groq's impressive AI inference speed demonstrations. The goal was to build a visual, interactive way to compare multiple language models racing against each other in real-time.

Special thanks to Vercel and v0 for making this possible. Built entirely with v0's AI code generation and deployed on Vercel's infrastructure with the AI SDK powering real-time model streaming.

Built by George Jefferson

Follow on X