We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer—RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer—although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles.
RealPlay involves a two-stage training process. Stage-1: We adapt a pre-trained image-to-video generator (Figure (a))—which generates an entire video in a single pass conditioned on a single frame—into a chunk-wise generation model (Figure (b)), which generates video chunks iteratively, conditioned on the previously generated chunk. This adaptation includes several key modifications detailed in Section 3.1. Stage-2: RealPlay (Figure (c)) is trained on a combination of a labeled game dataset and an unlabeled real-world dataset, enabling action transfer from controlling a car in the game environment to manipulating various entities in the real world. This is achieved by modifying the chunk-wise generation model to incorporate action control through an adaptive LayerNorm mechanism.