ALRM

Paper Video 🔊

Abstract

The ALRM framework consists of three modules. The Task Planner Agent, built on ReAct, decomposes high-level instructions into subtasks through iterative cycles of reasoning and feedback. The Task Executor Agent translates these subtasks into actions via two strategies: Code-as-Policy (CAP), which generates Python code to call functions in one run, and Tool-as-Policy (TAP), which generates nested tool calls. The API Server provides RESTful endpoints for robotic control, including pick-and-place, motion, and perception. Execution results are returned as observations, enabling the planner to refine subsequent actions until the task is completed or the step limit is reached.

Task: Can you move the spoon, the coke can, and the spatula to the basket

Task: Put first the metal box at the center-right, the wood box at the center, and the cardboard box at the center-left position

Task: Pickup the sourest and the biggest fruits and place them in the bin

Task: I hate lemons. Throw it away. I'm also in a diet and want to eat two fruits. Put two fruits with the lowest calories in the bowl

Architecture

The proposed LLM-based agent architecture for solving high-level arm manipulation robotic tasks. The proposed architecture contain three main modules: (1) task planner agent, (2) task executor agent, and (3) API server.
TBC...

Benchmark

We created a benchmark of 54 high-level, linguistically diverse tasks across three environments to evaluate LLM performance under the ALRM framework. Evaluation uses an LLM-as-a-judge, which compares the natural language task, the ground-truth action sequence, and the LLM-generated actions, assigning a score of 0 (no subtasks solved), 1 (some subtasks solved), or 2 (all subtasks solved).

Success rate

Evaluation involved 10 LLMs across CAP and TAP modes, grouped into large-scale (GPT-5, Gemini-2.5-Pro, Claude-4.1-Opus, DeepSeek-V3.1) and small-scale (Falcon-H1-7B, Qwen3-8B, Llama-3.1-8B, DeepSeek-R1-7B, Granite-3.3-8B, Mistral-7B). Three distinct LLMs (GPT-4.1, Claude-Sonnet-4, Gemini-2.5-Flash) served as judges, using majority vote or averaging in case of disagreement.

In summary, the main findings related to success rate are:

Large-scale LLMs achieve higher success rates with TAP rather than CAP.
Claude-4.1-Opus had the best performance for both CAP and TAP across all models.
Some small-scale LLMs achieve competitive results when using CAP despite having much less parameters.
Falcon-H1-7B achieved the highest performance under CAP across all small-scale models, even surpassing Gemini-2.5-pro and matching DeepSeek-V3.1.
Small-scale LLMs struggle with tool calling capabilities.
Qwen3-8B is the best small-scale LLM under TAP.

Latency

In summary, the main findings related to latency are:

TAP in general leads to higher latency when compared to CAP.
Claude-4.1-Opus leads to the lowest latency across large-scale LLMs, highlighting its strong performance.
Falcon-H1-7B in CAP leads to a good trade-off between success rate and latency.
Qwen3-8B has the longest processing time, mainly given its <thinking> reasoning output.

BibTeX


      @misc{santos2026alrmagenticllmrobotic,
          title={ALRM: Agentic LLM for Robotic Manipulation}, 
          author={Vitor Gaboardi dos Santos and Ibrahim Khadraoui and Ibrahim Farhat and Hamza Yous and Samy Teffahi and Hakim Hacid},
          year={2026},
          eprint={2601.19510},
          archivePrefix={arXiv},
          primaryClass={cs.RO},
          url={https://arxiv.org/abs/2601.19510}, 
      }

ALRM: Agentic Llm for Robotic Manipulation

ICRA 2026 (Under Review)