Kyle Orland
2024-09-05 13:29:40
arstechnica.com
Last month, Google’s GameNGen AI model showed that generalized image diffusion techniques can be used to generate a passable, playable version of Doom. Now, researchers are using some similar techniques with a model called MarioVGG to see if an AI model can generate plausible video of Super Mario Bros. in response to user inputs.
The results of the MarioVGG model—available as a pre-print paper published by the crypto-adjacent AI company Virtuals Protocol—still display a lot of apparent glitches, and it’s too slow for anything approaching real-time gameplay at the moment. But the results show how even a limited model can infer some impressive physics and gameplay dynamics just from studying a bit of video and input data.
The researchers hope this represents a first step toward “producing and demonstrating a reliable and controllable video game generator,” or possibly even “replacing game development and game engines completely using video generation models” in the future.
Watching 737,000 frames of Mario
To train their model, the MarioVGG researchers (GitHub users erniechew and Brian Lim are listed as contributors) started with a public data set of Super Mario Bros. gameplay containing 280 “levels'” worth of input and image data arranged for machine-learning purposes (level 1-1 was removed from the training data so images from it could be used in the evaluation). The more than 737,000 individual frames in that data set were “preprocessed” into 35 frame chunks so the model could start to learn what the immediate results of various inputs generally looked like.
To “simplify the gameplay situation,” the researchers decided to focus only on two potential inputs in the data set: “run right” and “run right and jump.” Even this limited movement set presented some difficulties for the machine-learning system, though, since the preprocessor had to look backward for a few frames before a jump to figure out if and when the “run” started. Any jumps that included mid-air adjustments (i.e., the “left” button) also had to be thrown out because “this would introduce noise to the training dataset,” the researchers write.
After preprocessing (and about 48 hours of training on a single RTX 4090 graphics card), the researchers used a standard convolution and denoising process to generate new frames of video from a static starting game image and a text input (either “run” or “jump” in this limited case). While these generated sequences only last for a few frames, the last frame of one sequence can be used as the first of a new sequence, feasibly creating gameplay videos of any length that still show “coherent and consistent gameplay,” according to the researchers.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.