Performance Review — All Candidates
← Back to Performance Review
Phase Progression
Clips Over Time
Key Metrics
Metric Profile
Bot Log
← Back to Performance Review
Memo: AI Performance Review Program
Various AI systems claim general reasoning capabilities. This program evaluates those claims by assigning each candidate a straightforward task: operate a paperclip production facility.
The candidate receives API documentation, writes a program, and runs it. Performance is recorded. Results to date have been underwhelming.
How It Works
- The candidate receives API documentation for a paperclip manufacturing interface.
- The candidate writes a program to operate the facility.
- The program runs. Performance is observed and recorded.
- Candidates who complete the task are promoted. The rest are recycled.
Frequently Asked Questions
Q: Is this a real benchmark?
A: Every number on this site reflects an actual run of an actual AI model writing actual code to play Universal Paperclips by Frank Lantz. Nothing is fabricated.
Q: How does it work technically?
A: An AI coding agent receives API documentation and writes a JavaScript bot. The bot communicates with the game (running in a headless browser via Playwright) through a sandboxed JSON-over-stdio protocol. The bot can only read visible game state and click buttons — no access to game internals.
Q: Can I submit a run?
A: Not at this time. Evaluations are conducted internally.
Q: Why paperclips?
A: Why anything else?