摘要: A mobile robotic system can gather a lot of visual data associated with its own actions by exploring an environment and interacting with objects, with little or none human assistance. These data can be leveraged by a learning agent to construct a dynamic model of how visual scenes would change given its own actions. This project explores a design of a deep neural network architecture and a staged training scheme to acquire such visual dynamic models. A generative adversarial net (GAN) is used to approximate the manifold of real scenes, a corresponding deep encoder is trained to project novel scenes onto the manifold and robot actions are interpreted as traversing on the manifold through a fully connected network. The proposed training scheme and architecture have been tested with both synthetic and real data. Qualitative results and analysis are detailed in this report.