$ฯ€_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

์ €์ž: Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, James Tanner, Quan Vuong, Homer Walke, Anna Walling, Haohuan Wang, Lili Yu, Ury Zhilinsky | ๋‚ ์งœ: 2025-04-22 | URL: https://arxiv.org/abs/2504.16054 📄 PDF


Essence

Figure 1

Fig. 1: The ฯ€0.5 model transfers knowledge from a heterogeneous range of data sources, including other robots, high-leve

ฯ€0.5๋Š” heterogeneousํ•œ ๋‹ค์ค‘ ๋ฐ์ดํ„ฐ ์†Œ์Šค(๋‹ค์–‘ํ•œ ๋กœ๋ด‡, ์›น ๋ฐ์ดํ„ฐ, ์˜๋ฏธ๋ก ์  ์˜ˆ์ธก)์—์„œ co-trainingํ•˜์—ฌ ์‹ค์ œ ๊ฐ€์ •์—์„œ ์žฅ์‹œ๊ฐ„์˜ ๋ณต์žกํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Vision-Language-Action ๋ชจ๋ธ์ด๋‹ค.

Motivation

Achievement

Figure 2

Fig. 2: ฯ€0.5 cleaning a new kitchen. The robot is tasked with cleaning a kitchen in a home that was not in the training

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ฯ€0.5๋Š” heterogeneous ๋ฐ์ดํ„ฐ ์†Œ์Šค์˜ ์ฒด๊ณ„์  ํ†ตํ•ฉ์„ ํ†ตํ•ด VLA ๋ชจ๋ธ์˜ ์‹ค์ œ ํ™˜๊ฒฝ ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์‹ค์งˆ์ ์œผ๋กœ ํ•ด๊ฒฐํ•œ ์„ฑ๊ณผ์ด๋ฉฐ, ๊ณ„์ธต์  ์˜๋ฏธ๋ก ์  ๊ตฌ์กฐ์™€ co-training ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋กœ๋ด‡ ํ•™์Šต์˜ ์ค‘์š”ํ•œ ์„ค๊ณ„ ์›์น™์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •