racer

SF Speedster: Ferrari Run (3 Levels + Enhanced)

SCORE: 0
SPEED: 0 MPH

LEVEL 1

LEVEL 2!

PAUSED
Press P or ESC to resume

Piescale.com

Score: 0

Best: 0

Level: 1

🔊 Sound enabled on first interaction

There are many evaluations of one-shotting working video games. This evaluation is about how well the AI handles all the follow up edits, because that is a better test of understanding.

Ollama is probably the easiest way to run open source AI on your own computer. Its biggest gotcha, is a default context window of 4k which effectively lobotomizes it without being obvious. The model was "self aware" enough to answer why it was generating clearly incomplete files when asked, but fixing persistently it in Ollama required making a new modelfile, and new model. But once you change it, you can reliably edit, and add to an existing code base.

The main performance hit appears to show up in response times almost doubling every turn, as the entire conversation gets reprocessed every turn. For an example like this, continually editing 500-1000 lines of code, version 1 takes an hour to finish, version 5 takes six hours to finish. It looks like you're often better off starting a new conversation for each feature request, or bug fix, certainly in response time, possibly on response quality as well.

Racer game made entirely with Qwen. Started with 3.5 122b, until turn five failed to fix a restart button requiring "loop();" at the end of the restart function. Fixed with 3.6 35b, along with a few suggestions, and a difficult level 3. 3.6 turn one was a copy paste request to fix the RACE AGAIN button. It generated a working snippet to show where to place loop. Turn two was telling it the fix worked, and asking for "any other suggestions" to which it gave a few, including a pause system with P, a screen shake on crash, persistent high scores, mobile touch controls, and some non obvious code cleanups in snippet form.