Learning to Play Minecraft with Video PreTraining

Paper

Notes while Reading

General idea is to pretrain model on video, since this trains some behavioral prior. This is done in a semi supervised way, in which there’s some small labelling set created to train an inverse dynamics model, which is then used to psuedo-label the rest of the video to do large scale imitation learning.
- Mentions future work as including general computer usage - wonder if this pipeline is generally similar to how Adept AI trains their models?
Imitation Learning is more sample efficient than reinforcement learning (especially for hard tasks where reward is sparse) - but in general in robotics we lack labeled demonstrations that allow us to do imitation learning
Inverse Dynamics Modeling for sample efficient behavioral cloning
- Previous work (Torabi et al - Behavioral cloning from Observation) approached the problem of learning from unlabeled demonstrations by simultaneously training an inverse dynamics model (which aims to recover \(p(a_t \mid o_t, o_{t+1})\)), alongside a behavioral cloning model on trajectories of observations labeled with the IDM. Thus, as the IDM labels improve, so does the behavioral cloning model. The problem with this approach is that it can be inefficient if the IDM performs poorly on some sequence of observations in the dataset, which can cause the BC model to perform actions that could lead to inefficient exploration of the action space.
- This paper first gets a good IDM via supervised training on a smaller amount data collected from humans playing minecraft (with keystrokes and mouse presses monitored). It then keeps this IDM constant, using it to generate psuedo-labels for the rest of the data, and then carrying out large scale BC.
  - The small amount of collected data for training the supervised IDM still requires some easy way to record human labels. So the equivalent for a robot in the real world would be something like a human expert controlling the robot, and logging the inputted actions, and the observations would be a video of the robot.
    - Related work/cool papers I found while digging around:
  - Trained with 4 days of 32 A100 GPUs - bigger VPT model trained with 720 V100s
  - This approach is shown to be more sample efficient for modeling intent, since it can train on both past and future observations as opposed to causal behavior cloning, which can only model based on past obeservations
Bigger scale pretraining is just simple behavioral cloning using the psudo-labels for actions generated by IDM
Data filtering for videos done by first collecting labels of clean data, then generating embeddings for those images using a ResNet CLIP model, then training an SVM on the embeddings to get a frame classifier
In order to combat “catastrophic forgetting”, a KL divergence loss is added between the RL model and the frozen pretrained policy to ensure that the learned policy does not diverge too drastically from pretrained one
- During fine tuning, the KL loss replaced the entropy maximization loss, which is used to encourage exploration, but doesn’t work as well for super large action spaces with sparse rewards like Minecraft

Summary

One of the main contributions of this paper is proposing to use inverse dynamics models to learn the action distribution of an environment as a behavioral prior for encouraging effective exploration using RL.
Future directions:
- Using VPT as a general representation learning method, since learning a distribution over actions agents are taking in a video should lead to a good understanding of what is happening in a scene
- Asking VPT to perform specific tasks by conditioning on text