There are many evaluations of one-shotting working video games.  This evaluation is about how well the AI handles all the follow up edits, because that is a better test of understanding.  

After you dial up the context window, Qwen 3.6 is better than ChatGPT.

If you have 32 GB of RAM, you can run it.

Curious if this applies to other domains that are not as easy to test, or did the Qwen team just do a really good job fine tuning this on video games after seeing game development as a trending AI test.  Watching Gemma 4 vs Qwen 3.5 tests, it's clear the ability to one-shot anything resembling a working game is new for any model this small.

Limited testing is already suggesting the 35 billion parameter model is outperforming the 122 billion parameter model.  This is further proof that intelligence is going to be small, distributed, able to run on garbage hardware, and impossible to embargo, or monopolize.  This will not be winner take all network effects of social media.  This will not generate super normal profit for anyone but a few Nvidia/Google types who already have healthy moats and margins of physical scale, and a tech lead that doesn't depend on IP law.  Anyone still dependent on those types will at best become a hot restaurant with immediately jacked up rents that leave the proprietor at subsistence remuneration.

Ollama is probably the easiest way to run open source AI on your own computer.  Its biggest gotcha, is a default context window of 4k which effectively lobotomizes it without being obvious.  The model was "self aware" enough to answer why it was generating clearly incomplete files when asked,  but fixing persistently it in Ollama required making a new modelfile, and new model.  But once you change it, you can reliably edit, and add to an existing code base.

The main performance hit appears to show up in response times almost doubling every turn, as the entire conversation gets reprocessed every turn.  For an example like this, continually editing 500-1000 lines of code, version 1 takes an hour to finish, version 5 takes six hours to finish.  It looks like you're often better off starting a new conversation for each feature request, or bug fix, certainly in response time, possibly on response quality as well.

Racer game made entirely with Qwen.  Started with 3.5 122b, until turn five failed to fix a restart button requiring "loop();" at the end of the restart function.  Fixed with 3.6 35b, along with a few suggestions, and a difficult level 3.  3.6 turn one was a copy paste request to fix the RACE AGAIN button.  It generated a working snippet to show where to place loop.  Turn two was telling it the fix worked, and asking for "any other suggestions" to which it gave a few, including a pause system with P, a screen shake on crash, persistent high scores, mobile touch controls, and some non obvious code cleanups in snippet form.  I was unable to paste its snippets into the existing code, get anything to work, so I told it that on turn three, and asked for a complete .html which it generated, while shrinking lines of code from 608 to 574, and growing to file size from 18.9 to 19.9 kB.  Turn four asked for a Level 3.

Context Window is a control knob ChatGPT, Grok, Gemini, and Claude use to turn up, and down compute resources.  The details are a bit of a mystery, but it is becoming common knowledge that all the hosted models use some equivalent of rolling brownouts that feel like rolling lobotomies to manage peak demand.  At the free tier level, ChatGPT, and Gemini are unable to coherently edit a bare minimum game framework.  Asking for one thing, loses another, or sometimes several.  Forward progress in a coherent environment is impossible.  ChatGPT burns tokens talking up "visual enhancements" that transform squares into rectangles.  Grok appeared capable, consistent, and coherent for fourteen turns, and hit a wall of not being able to fix a win screen after seemingly infinite attempts.  Starting a new Grok conversation with the most recent working version, and asking for the fix did not work.  

For the time being, if I had to pick a tool to hammer out a complete working game of roughly 8 bit Nintendo Quality, it would be Qwen 3.6.