Figure 2: Overview of our methodology of scene representation for a sample text observation taken from Zork1 using text-to-image generative model. We call this model SceneIT and use it by default for all our experiments in this paper. In this case, we use the inbuilt CNN-based image encoder (Inception v3 szegedy2016rethinking ) to map the generated images to the image features.
Since the reward from the game can guide the text-to-image generator (AttnGAN) to generate meaningful images for the current context of the game, we finetune the pre-trained AttnGAN along with the encoders and the action selector to yield the best results.
Based on the reward from the game environment, we update text and image encoders and the action selector. ), which maps the encoding features to action scores using a multi-layer perceptron (MLP) to select the next action. The text and image encoding features are then concatenated and passed to the action selector (as shown in Figure 2 Specifically, we use Resnet-50 for encoding the retrieved images and for the images generated from the pre-trained AttnGAN. Improves the performance of agents in various TBG settings. Variety of visual representations on offer allow the agent to generate a better Game 'scene' and objects' relationships to the world around them, and the This improves the agent's overall understanding of the Represent specific instances of text observations from the world and train ourĪgents on such images. Specifically, we propose to retrieve images that In this paper that there is much yet to be learned from visual representations While the recent use of text-based resources for increasing anĪgent's knowledge and improving its generalization have shown promise, we posit TBGs is identifying the objects in the world, and those objects' relations with The crux of the problem for a reinforcement learning agent in such
Text-based games (TBGs) have become a popular proving ground for theĭemonstration of learning-based agents that make decisions in quasi real-world