arxiv:2402.15391

Genie: Generative Interactive Environments

Published on Feb 23

· Submitted by

akhaliq on Feb 26

#2 Paper of the day

Upvote

Authors:

Feryal Behbahani ,

Sherjil Ozair ,

Jingwei Zhang ,

Abstract

We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

View arXiv page View PDF Add to collection

Community

cagataydev

Feb 26

Looks really interesting!

hangsiin

Feb 26

great work! !

cunniet

Feb 26

incroyable !

Enuriru

Feb 26

What.

librarian-bot

Feb 27

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

leegao19

Feb 27

Can this model be Lego-ed / broken apart and recombined to do other things? Especially since the LAM seem to be already so capable with a bit of additional FT.

For e.g., I'd imagine you can predict the next optimal action (assuming the dynamics model is trained to play the game well) by just reconfiguring things a bit (without doing additional training or FT?):

Send a sequence of video frames $(z_1, ... z_{t-1})$ and actions $(a_1, ...)$ into the dynamics model to get the next video frame token $z_t$, then use the LAM encoder to annotate and output the next action to take (assuming the LAM encoder can autoregressively generate the inputs).

You can do this autoregressively to "hallucinate" the gameplay, or you can use this as a frame-by-frame agent to play the game.