Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation

์ €์ž: Congcong Wen, Geeta Chandra Raju Bethala, Yu Hao, Niraj Pudasaini, Hao Huang, Shuaihang Yuan, Baoru Huang, Anh Nguyen, Mengyu Wang, Anthony Tzes, Yi Fang | ๋‚ ์งœ: 2025-04-13 | URL: https://arxiv.org/abs/2504.09532 📄 PDF


Essence

Figure 1

Fig. 1.

์ธํ˜•๋กœ๋ด‡์˜ ์ „์‹  ๋ณดํ–‰-์กฐ์ž‘์„ ์œ„ํ•ด ๊ธฐ์ดˆ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ Embodied Chain-of-Action (CoA) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ฉํ•œ ์ œ๋กœ์ƒท ์—์ด์ „ํŠธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๊ณ ์ˆ˜์ค€ ์ธ๊ฐ„ ์ง€์‹œ๋ฅผ affordance ๋ถ„์„, ๊ณต๊ฐ„ ์ถ”๋ก , ์ „์‹  ๋™์ž‘ ์ถ”๋ก ์„ ํ†ตํ•ด ์ฒด๊ณ„์ ์ธ ๋ณดํ–‰ ๋ฐ ์กฐ์ž‘ ์›์‹œ ๋™์ž‘ ์ˆ˜์—ด๋กœ ๋ถ„ํ•ดํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1.

How

Figure 1

Fig. 1.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ Foundation model์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ธํ˜•๋กœ๋ด‡ ๋ณดํ–‰-์กฐ์ž‘์— ์ฒ˜์Œ ํ†ตํ•ฉํ•œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋ฉฐ, CoA Reasoning ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์ž์—ฐ์–ด ์ง€์‹œ๋ฅผ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์‹คํ˜„ ๊ฐ€๋Šฅํ•œ ๋™์ž‘ ์ˆ˜์—ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ์„ ์ œ์‹œํ•œ๋‹ค. ์‹ค์ œ ์ธํ˜•๋กœ๋ด‡์—์„œ ๊ฐ•๊ฑดํ•œ ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”๋ฅผ ์ž…์ฆํ•œ ์ ์—์„œ ๋†’์€ ์‹ค์šฉ์  ๊ฐ€์น˜๋ฅผ ๊ฐ–๋Š”๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •