Abstract: Large Vision Language Models (VLMs) have been adopted in robotics for their strong common sense understanding and generalization capabilities. Existing works leverage VLMs for task and ...