LLMs for Robotic Object Disambiguation

prompt-engineering

Large language models (LLMs) excel at solving decision-making challenges in robotics, but struggle with object disambiguation without additional prompting.

Authors

Connie Jiang

Yiqing Xu

David Hsu

Published

January 7, 2024

Major Findings

Pre-trained large language models (LLMs) demonstrate capability in solving complex decision making challenges in robotics such as object disambiguation tasks within tabletop environments. The model efficiently identifies and retrieves a desired object from a cluttered scene.
The LLMs are capable of efficiently disambiguating any object from any arbitrarily large tabletop scene by harnessing the “common sense” knowledge embedded in the model.
Few-shot prompt engineering significantly improves the LLM’s ability to pose disambiguating queries, allowing the model to generate and navigate down a precise decision tree to the correct object, even when faced with identical options.

I Introduction

Several challenges in disambiguating objects from a scene.
- Developing a multi-step plan for disambiguation.
- Inferring new features if the scene description provided is insufficient.
Previous methods of solving this task have limitations.

III Problem Formulation

Generalizing user requests to interpret and respond to any reasonable, generalized request.
Maneuvering occluding objects, such as relocating obstructing objects to access the desired one.
Disambiguating the target object stands as the primary focus with a detailed description of this task and its limitations.

IV Proposed Method

Few-shot prompt-engineering approach proposed to enable the LLM to generate its own features.
Results from employing this approach and an example of results are provided to illustrate the improvement in the model’s ability to infer features.

V Experiments

Comparison of the model’s performance with four baseline methods, including optimal split, enumeration, human performance, and POMDP-ATTR.
Conducted experiments in twelve distinct scenes to evaluate the model’s performance and accuracy.

VI Results

The effective performance of the proposed model is highlighted, detailing its efficiency and success rate in disambiguating target objects.
Visual representations of the results are used to present the findings effectively.

VII Next Steps

Plans for completing the visual portion of the pipeline and further details on zero-shot and few-shot prompting are outlined as the next steps for the research.

Critique

While the study demonstrates the effectiveness of LLMs for object disambiguation, limitations in inferring unspecified features are highlighted, posing potential challenges in more complex scenes.
The comparison with baseline methods provides a benchmark, but the study could benefit from a more extensive comparison with a wider range of existing methods in the field.

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	http://arxiv.org/abs/2401.03388v1
HTML	https://browse.arxiv.org/html/2401.03388v1
Truncated	False
Word Count	5796