Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

์ €์ž: Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu | ๋‚ ์งœ: 2025-07-21 | URL: https://arxiv.org/abs/2507.15597 📄 PDF


Essence

Figure 1

Figure 1: Being-H0 acquires dexterous manipulation skills by learning from large-scale human videos in the

Being-H0๋Š” ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๋ฏผ์ฒฉํ•œ Vision-Language-Action ๋ชจ๋ธ๋กœ, physical instruction tuning ํŒจ๋Ÿฌ๋‹ค์ž„์„ ํ†ตํ•ด ์ธ๊ฐ„์˜ ์† ๋™์ž‘์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜์—ฌ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์œผ๋กœ ์ „์ดํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Overview of Being-H0. The text tokenizer and visual encoder are shared by both pretraining

How

Figure 3

Figure 3: Physical Instruction Tuning. Our training paradigm bridges human video datasets and robotic

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Being-H0๋Š” ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋ฏผ์ฒฉํ•œ ๋กœ๋ด‡ ์กฐ์ž‘์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•˜๋ฉฐ, physical instruction tuning๊ณผ part-level motion tokenization์„ ํ†ตํ•ด ๊ธฐ์กด VLA์˜ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ํ˜์‹ ์ ์œผ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค. ๋ช…์‹œ์  ๋™์ž‘ ๋ชจ๋ธ๋ง ์ ‘๊ทผ๋ฒ•๊ณผ UniHand ๋ฐ์ดํ„ฐ์…‹์€ ๋กœ๋ด‡ ๊ณตํ•™ ๋ถ„์•ผ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •