Policy Improvement using Language Feedback Models

prompt-engineering

architectures

production

LFMs identify desirable behavior for imitation learning, outperforming LLMs and improving task-completion rate.

Author

Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté

Published

February 12, 2024

Summary:

The article introduces Language Feedback Models (LFMs) that identify desirable behavior for imitation learning in instruction following. LFMs are trained using feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. The article presents three major findings: 1. LFMs improve task-completion rate over strong behavioral cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). 2. LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. 3. LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Additionally, LFMs can provide human-interpretable feedback without performance loss, allowing human verification of desirable behavior for imitation learning.

Major Findings:

LFMs improve task-completion rate over strong behavioral cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld).
LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens.
LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation.

Analysis and Critique:

The article does not address potential biases in the LLM feedback that could influence the training of LFMs.
The comparison to Dagger shows that LFMs outperform using LLMs as an expert for imitation learning, but it would be beneficial to further investigate the reasons for this performance difference.
The article does not discuss the potential ethical implications of using LFMs for policy improvement, especially in real-world applications. Further exploration of the broader impact of LFMs is necessary.

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	https://arxiv.org/abs/2402.07876v1
HTML	https://browse.arxiv.org/html/2402.07876v1
Truncated	False
Word Count	8834