Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

์ €์ž: Yingzhao Jian, Zhongan Wang, Yi Yang, Hehe Fan | ๋‚ ์งœ: 2025-10-28 | URL: https://arxiv.org/abs/2511.00041 📄 PDF


Essence

Figure 1

Figure 1: BiBo is a humanoid agent powered by an off-the-shelf VLM. It consists of an embodied

off-the-shelf VLM(GPT-4)์„ humanoid agent์˜ ์ œ์–ด์— ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด embodied instruction compiler์™€ diffusion-based motion executor๋กœ ๊ตฌ์„ฑ๋œ BiBo ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์—†์ด ๊ฐœ๋ฐฉํ˜• ํ™˜๊ฒฝ์—์„œ์˜ ์œ ์—ฐํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ.

Motivation

Achievement

Figure 1

Figure 1: BiBo is a humanoid agent powered by an off-the-shelf VLM. It consists of an embodied

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ off-the-shelf VLM๊ณผ humanoid control์„ ์—ฐ๊ฒฐํ•˜๋Š” ์ฐฝ์˜์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜๊ณ , structured representation๊ณผ LDM์˜ novel application์„ ํ†ตํ•ด ๊ธฐ์ˆ ์  ๊ธฐ์—ฌ๋ฅผ ํ•˜์˜€์œผ๋ฉฐ, ์‹ค์ œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ๋ณ‘๋ชฉ์„ ํ•ด์†Œํ•˜๋ ค๋Š” ์‹ค์งˆ์  ์˜์˜๊ฐ€ ์žˆ์Œ. ๋‹ค๋งŒ ์‹ค์ œ ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฒ€์ฆ๊ณผ robustness ๋ถ„์„์ด ๋ณด๊ฐ•๋œ๋‹ค๋ฉด ๋”์šฑ ๊ฐ•๋ ฅํ•œ ์ž‘์—…์ด ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋จ.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •