LLMs for Robotic Object Disambiguation
prompt-engineering
Large language models (LLMs) excel at solving decision-making challenges in robotics, but struggle with object disambiguation without additional prompting.
Major Findings
- Pre-trained large language models (LLMs) demonstrate capability in solving complex decision making challenges in robotics such as object disambiguation tasks within tabletop environments. The model efficiently identifies and retrieves a desired object from a cluttered scene.
- The LLMs are capable of efficiently disambiguating any object from any arbitrarily large tabletop scene by harnessing the “common sense” knowledge embedded in the model.
- Few-shot prompt engineering significantly improves the LLM’s ability to pose disambiguating queries, allowing the model to generate and navigate down a precise decision tree to the correct object, even when faced with identical options.
I Introduction
- Several challenges in disambiguating objects from a scene.
- Developing a multi-step plan for disambiguation.
- Inferring new features if the scene description provided is insufficient.
- Previous methods of solving this task have limitations.
III Problem Formulation
- Generalizing user requests to interpret and respond to any reasonable, generalized request.
- Maneuvering occluding objects, such as relocating obstructing objects to access the desired one.
- Disambiguating the target object stands as the primary focus with a detailed description of this task and its limitations.
IV Proposed Method
- Few-shot prompt-engineering approach proposed to enable the LLM to generate its own features.
- Results from employing this approach and an example of results are provided to illustrate the improvement in the model’s ability to infer features.
V Experiments
- Comparison of the model’s performance with four baseline methods, including optimal split, enumeration, human performance, and POMDP-ATTR.
- Conducted experiments in twelve distinct scenes to evaluate the model’s performance and accuracy.
VI Results
- The effective performance of the proposed model is highlighted, detailing its efficiency and success rate in disambiguating target objects.
- Visual representations of the results are used to present the findings effectively.
VII Next Steps
- Plans for completing the visual portion of the pipeline and further details on zero-shot and few-shot prompting are outlined as the next steps for the research.
Critique
- While the study demonstrates the effectiveness of LLMs for object disambiguation, limitations in inferring unspecified features are highlighted, posing potential challenges in more complex scenes.
- The comparison with baseline methods provides a benchmark, but the study could benefit from a more extensive comparison with a wider range of existing methods in the field.
Appendix
Model | gpt-3.5-turbo-1106 |
Date Generated | 2024-02-26 |
Abstract | http://arxiv.org/abs/2401.03388v1 |
HTML | https://browse.arxiv.org/html/2401.03388v1 |
Truncated | False |
Word Count | 5796 |