LLMs for Robotic Object Disambiguation

prompt-engineering
Large language models (LLMs) excel at solving decision-making challenges in robotics, but struggle with object disambiguation without additional prompting.
Authors

Connie Jiang

Yiqing Xu

David Hsu

Published

January 7, 2024

Major Findings

  1. Pre-trained large language models (LLMs) demonstrate capability in solving complex decision making challenges in robotics such as object disambiguation tasks within tabletop environments. The model efficiently identifies and retrieves a desired object from a cluttered scene.
  2. The LLMs are capable of efficiently disambiguating any object from any arbitrarily large tabletop scene by harnessing the “common sense” knowledge embedded in the model.
  3. Few-shot prompt engineering significantly improves the LLM’s ability to pose disambiguating queries, allowing the model to generate and navigate down a precise decision tree to the correct object, even when faced with identical options.

I Introduction

  • Several challenges in disambiguating objects from a scene.
    • Developing a multi-step plan for disambiguation.
    • Inferring new features if the scene description provided is insufficient.
  • Previous methods of solving this task have limitations.

III Problem Formulation

  • Generalizing user requests to interpret and respond to any reasonable, generalized request.
  • Maneuvering occluding objects, such as relocating obstructing objects to access the desired one.
  • Disambiguating the target object stands as the primary focus with a detailed description of this task and its limitations.

IV Proposed Method

  • Few-shot prompt-engineering approach proposed to enable the LLM to generate its own features.
  • Results from employing this approach and an example of results are provided to illustrate the improvement in the model’s ability to infer features.

V Experiments

  • Comparison of the model’s performance with four baseline methods, including optimal split, enumeration, human performance, and POMDP-ATTR.
  • Conducted experiments in twelve distinct scenes to evaluate the model’s performance and accuracy.

VI Results

  • The effective performance of the proposed model is highlighted, detailing its efficiency and success rate in disambiguating target objects.
  • Visual representations of the results are used to present the findings effectively.

VII Next Steps

  • Plans for completing the visual portion of the pipeline and further details on zero-shot and few-shot prompting are outlined as the next steps for the research.

Critique

  • While the study demonstrates the effectiveness of LLMs for object disambiguation, limitations in inferring unspecified features are highlighted, posing potential challenges in more complex scenes.
  • The comparison with baseline methods provides a benchmark, but the study could benefit from a more extensive comparison with a wider range of existing methods in the field.

Appendix

Model gpt-3.5-turbo-1106
Date Generated 2024-02-26
Abstract http://arxiv.org/abs/2401.03388v1
HTML https://browse.arxiv.org/html/2401.03388v1
Truncated False
Word Count 5796