This is a cache of https://arstechnica.com/gaming/2025/02/microsofts-new-interactive-ai-world-model-still-has-a-long-way-to-go/. It is a snapshot of the page at 2025-02-20T01:13:37.880+0000.
Microsoft shows progress toward real-time AI-generated game worlds - Ars Technica
Skip to content
Mile to go...

Microsoft shows progress toward real-time AI-generated game worlds

Despite improvements, Microsoft's new model is still mainly useful for low-res prototypes.

Kyle Orland | 18
Adding character using WHAM is as simple as dropping an image into existing footage. Credit: Microsoft / Nature
Adding character using WHAM is as simple as dropping an image into existing footage. Credit: Microsoft / Nature
Story text

For a while now, many AI researchers have been working to integrate a so-called "world model" into their systems. Ideally, these models could infer a simulated understanding of how in-game objects and characters should behave based on video footage alone, then create fully interactive video that instantly simulates new playable worlds based on that understanding.

Microsoft Research's new World and Human Action Model (WHAM), revealed today in a paper published in the journal Nature, shows how quickly those models have advanced in a short time. But it also shows how much further we have to go before the dream of AI crafting complete, playable gameplay footage from just some basic prompts and sample video footage becomes a reality.

More consistent, more persistent

Much like Google's Genie model before it, WHAM starts by training on "ground truth" gameplay video and input data provided by actual players. In this case, that data comes from Bleeding Edge, a four-on-four online brawler released in 2020 by Microsoft subsidiary Ninja Theory. By collecting actual player footage since launch (as allowed under the game's user agreement), Microsoft gathered the equivalent of seven player-years' worth of gameplay video paired with real player inputs.

Early in that training process, Microsoft Research's Katja Hoffman said the model would get easily confused, generating inconsistent clips that would "deteriorate [into] these blocks of color." After 1 million training updates, though, the WHAM model started showing basic understanding of complex gameplay interactions, such as a power cell item exploding after three hits from the player or the movements of a specific character's flight abilities. The results continued to improve as the researchers threw more computing resources and larger models at the problem, according to the Nature paper.

To see just how well the WHAM model generated new gameplay sequences, Microsoft tested the model by giving it up to one second's worth of real gameplay footage and asking it to generate what subsequent frames would look like based on new simulated inputs. To test the model's consistency, Microsoft used actual human input strings to generate up to two minutes of new AI-generated footage, which was then compared to actual gameplay results using the Frechet Video Distance metric.

Microsoft boasts that WHAM's outputs can stay broadly consistent for up to two minutes without falling apart, with simulated footage lining up well with actual footage even as items and environments come in and out of view. That's an improvement over even the "long horizon memory" of Google's Genie 2 model, which topped out at a minute of consistent footage.

Microsoft also tested WHAM's ability to respond to a diverse set of randomized inputs not found in its training data. These tests showed broadly appropriate responses to many different input sequences based on human annotations of the resulting footage, even as the best models fell a bit short of the "human-to-human baseline."

The most interesting result of Microsoft's WHAM tests, though, might be in the persistence of in-game objects. Microsoft provided examples of developers inserting images of new in-game objects or characters into pre-existing gameplay footage. The WHAM model could then incorporate that new image into its subsequent generated frames, with appropriate responses to player input or camera movements. With just five edited frames, the new object "persisted" appropriately in subsequent frames anywhere from 85 to 98 percent of the time, according to the Nature paper.

A long way to go

Despite all the improvements Microsoft boasts about in its WHAM model, the company says it still sees rough prototyping by game developers as the primary current use case. Developers can play around with a prototype "WHAM Demonstrator" on the Azure AI Foundry to see how the system can generate new interactive gameplay sequences based on just a few frames of video.

That demonstrator currently generates the resulting video based on pre-recorded inputs, at a rate much slower than necessary for actual live gameplay. In a private demonstration for press, though, Microsoft also showed an early prototype of a real-time WHAM-powered video-generation tool, which instantly generates new frames of gameplay based on immediate inputs from the user. Users can even jump from scene to scene instantly just by feeding a fresh set of sample frames into the system.

That kind of real-time, "generate as you go" world model is something of a holy grail for this branch of AI research. And while the current version Microsoft showed off "is definitely not the same as playing the game," as Hoffman said during the demonstration, it's also "decidedly not like a traditional video game experience," she said. "It has a new quality. It's really interesting to explore and see what I can do in this setting."

Don't get your hopes up for a new wave of AI-generated games any time soon, though. Microsoft's prototype WHAM tool is still severely limited to a very muddy 300×180 resolution (comparable to a screen on the original Nintendo DS) at 10 frames per second—well below the playable baseline for modern games.

And despite all the much-ballyhooed improvements in consistency and persistence, there's still an ethereal, dreamlike quality to many of the objects shown, even in the low-res WHAM footage. The player character, in particular, tends to morph and stretch like a shapeshifter rather than a tight player model with a solid and consistent skeleton.

Still, Microsoft says it hopes WHAM is a first step toward a future where AI can craft high-end interactive experiences at the drop of a hat. "Hopefully this gives you a sense of just what we might be thinking about as we start to work towards interactive experiences that are generated on the fly by these real-time-capable generative AI models," Hoffman said.

Photo of Kyle Orland
Kyle Orland Senior Gaming Editor
Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.
18 Comments
Staff Picks