Being-H0.7: A Latent World-Action Model from Egocentric Videos

์ €์ž: Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, Zongqing Lu | ๋‚ ์งœ: 2026-04-30 | URL: https://arxiv.org/abs/2605.00078 📄 PDF


Essence

Figure 2

Figure 2: Latent reasoning and latent world-action model. Left: Learnable latent queries are inserted

์ด ๋…ผ๋ฌธ์€ egocentric video๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ latent world-action model์ธ Being-H0.7์„ ์ œ์‹œํ•œ๋‹ค. ํ–‰๋™ ์ƒ์„ฑ ์‚ฌ์ด์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ latent query๋ฅผ ์ถ”๋ก  ์ธํ„ฐํŽ˜์ด์Šค๋กœ ๋„์ž…ํ•˜๊ณ , future-informed dual-branch ์„ค๊ณ„๋ฅผ ํ†ตํ•ด ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ ์—†์ด ์„ธ๊ณ„ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋Šฅ๋ ฅ์„ VLA์˜ ํšจ์œจ์„ฑ๊ณผ ๊ฒฐํ•ฉํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: Being-H0.7 Architecture. We pack the prior and posterior branches into a single MoT sequence

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๋Šฅ: 6๊ฐœ ๋ฒค์น˜๋งˆํฌ(LIBERO, LIBERO Plus, RoboTwin 2.0, RoboCasa, CALVIN ABC, CALVIN ABCD)์—์„œ ์ตœ์ฒจ๋‹จ ๋˜๋Š” ๋™๋“ฑํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ. ์‹ค์ œ ๋กœ๋ด‡ ์ž‘์—…: 3๊ฐœ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ 12๊ฐœ ๋„์ „์  ์ž‘์—…(๋น ๋ฅธ ๊ตด๋Ÿฌ์˜ค๋Š” ๊ณต ์žก๊ธฐ, ์›€์ง์ด๋Š” ์šฉ๊ธฐ์— ๋ถ“๊ธฐ, ์˜ท ์ ‘๊ธฐ, ์ปจ๋ฒ ์ด์–ด์—์„œ ํŒจํ‚ค์ง€ ๋ถ„๋ฅ˜, ๋ชป ๋ฐ•๊ธฐ ๋“ฑ) ํ‰๊ฐ€์—์„œ 5๊ฐœ ๋Šฅ๋ ฅ ์ง€ํ–ฅ ์Šค์œ„ํŠธ ๋ชจ๋‘ ์ตœ๊ณ  ์„ฑ๋Šฅ. ๋ฐฐํฌ ํšจ์œจ์„ฑ: 3-4 ms/step ์ฒด์ œ์—์„œ ๋™์ž‘ํ•˜๋ฉฐ test-time future generation ๋ถ€๋‹ด ์—†์Œ.

How

Figure 3

Figure 3: Being-H0.7 Architecture. We pack the prior and posterior branches into a single MoT sequence

โ€ข Learnable latent query๋ฅผ perception๊ณผ action ์‚ฌ์ด ๋ช…์‹œ์  ์ถ”๋ก  ์ธํ„ฐํŽ˜์ด์Šค๋กœ ๋„์ž…

โ€ข Future-informed dual-branch ์„ค๊ณ„: prior branch๋Š” ํ˜„์žฌ ๋ฌธ๋งฅ์—์„œ latent ์ƒํƒœ ์ถ”๋ก , posterior branch๋Š” ํ•™์Šต ์ค‘์—๋งŒ ์‚ฌ์šฉ๋˜๋ฉฐ ๋ฏธ๋ž˜ ๊ด€์ฐฐ ์ž„๋ฒ ๋”ฉ์œผ๋กœ query ๋Œ€์ฒด

โ€ข Dual-branch ์ •๋ ฌ์„ ํ†ตํ•ด prior query๊ฐ€ future-aware, action-useful ๊ตฌ์กฐ ํ•™์Šต

โ€ข Norm๊ณผ rank ์ œ์•ฝ์„ ํ†ตํ•œ ์ •์น™ํ™”๋กœ latent ์ƒํƒœ collapse ๋ฐฉ์ง€

โ€ข Mixture-of-Transformers ์‹œํ€€์Šค๋กœ dual-branch ๊ตฌํ˜„ํ•˜์—ฌ ํšจ์œจ์  ๋ฌธ๋งฅ ๊ณต์œ 

โ€ข Latency-aware universal asynchronous chunking(UAC)์œผ๋กœ ๋ฐฐํฌ ์ตœ์ ํ™”

Originality

โ€ข World-action modeling์„ latent ๊ณต๊ฐ„์œผ๋กœ ์žฌ์ •์˜ํ•˜์—ฌ ํ”ฝ์…€ ๊ณต๊ฐ„ ์˜ˆ์ธก์˜ ๋น„ํšจ์œจ์„ฑ ํ•ด๊ฒฐ (๊ธฐ์กด์€ WAM์ด ๋น„๋””์˜ค ์ƒ์„ฑ์— ์˜์กด)

โ€ข Future-informed dual-branch ์„ค๊ณ„๋กœ test-time์—๋Š” posterior branch ์ œ๊ฑฐ ๊ฐ€๋Šฅ (training-only privileged supervision)

โ€ข Learnable latent query๋ฅผ ๋ช…์‹œ์  ์ถ”๋ก  ์ธํ„ฐํŽ˜์ด์Šค๋กœ ๋„์ž… (๊ธฐ์กด VLA๋Š” ์ด๋Ÿฌํ•œ ๋ช…์‹œ์  ๊ตฌ์กฐ ์—†์Œ)

โ€ข Hidden-state alignment๊ณผ ๊ฐ€๋ฒผ์šด ์ •์น™ํ™”๋กœ stableํ•˜๊ณ  scalableํ•œ latent ํ•™์Šต ๋‹ฌ์„ฑ

Limitation & Further Study

โ€ข Posterior branch ์„ค๊ณ„์˜ ์ •๋‹น์„ฑ: ์™œ posterior embedding์ด ์ •ํ™•ํžˆ ํ–‰๋™์— ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š”์ง€ ๋ช…ํ™•ํ•œ ์ด๋ก ์  ์„ค๋ช… ๋ถ€์กฑ. ๋‹จ์ˆœํžˆ future observation์˜ ์ž„๋ฒ ๋”ฉ์ด ๋ฏธ๋ž˜ ๊ด€๋ จ ์ •๋ณด๋ฅผ ๋‹ด๋ณดํ•˜๋Š”์ง€ ๋ถˆ๋ช…ํ™•.

โ€ข Latent query์˜ ์ˆ˜์™€ dimensionality ์„ ํƒ ๊ธฐ์ค€ ๋ฏธ๋ช…์‹œ: ablation study๋ฅผ ํ†ตํ•ด ์ด๋“ค ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์„ฑ๋Šฅ๊ณผ ๋ฐฐํฌ ํšจ์œจ์„ฑ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋” ์ƒ์„ธํžˆ ๋ถ„์„ ํ•„์š”.

โ€ข Generalization ํ‰๊ฐ€์˜ ์ œํ•œ: ์‹ค์ œ ๋กœ๋ด‡ ์ž‘์—…์ด 12๊ฐœ๋กœ ์ œํ•œ์ ์ด๋ฉฐ, ๋งค์šฐ ๋‹ค์–‘ํ•œ domain shift๋‚˜ zero-shot generalization ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ์„ฑ๋Šฅ ๋ถ€์กฑ.

โ€ข ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ ์ƒ์„ธ ๋ถ„์„ ๋ถ€์กฑ: prior์™€ posterior branch ๋™์‹œ ํ•™์Šต ์‹œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋น„์šฉ ์ƒ์„ธ ๋ถ„์„ ํ•„์š”. ๋ฐฐํฌ ํšจ์œจ์„ฑ ์ฃผ์žฅ์ด ์ฃผ๋กœ latency ๊ธฐ์ค€.

โ€ข ํ›„์† ์—ฐ๊ตฌ: ๋” ํฐ ๋ชจ๋ธ ์Šค์ผ€์ผ์—์„œ์˜ ์„ฑ๋Šฅ, ๋” ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ embodiment, ๊ทน๋‹จ์  ๋ฏธ๋ž˜ ์˜ˆ์ธก ๊ฑฐ๋ฆฌ์—์„œ์˜ ์•ˆ์ •์„ฑ ๊ฐœ์„  ํ•„์š”.

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Being-H0.7์€ world-action modeling์„ latent ๊ณต๊ฐ„์œผ๋กœ ์žฌ์ •์˜ํ•˜์—ฌ ๋ฏธ๋ž˜ ์˜ˆ์ธก์˜ ์ด๋“์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ํ”ฝ์…€ ์ƒ์„ฑ์˜ ๋น„ํšจ์œจ์„ฑ์„ ์ œ๊ฑฐํ•œ ๊ฐ•๋ ฅํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. Future-informed dual-branch ์„ค๊ณ„์™€ latent query ๊ธฐ๋ฐ˜ ์ธํ„ฐํŽ˜์ด์Šค๋Š” ์ฐฝ์˜์ ์ด๊ณ  ํšจ๊ณผ์ ์ด๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡ ํ‰๊ฐ€์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ž…์ฆํ•œ๋‹ค. ๋‹ค๋งŒ posterior branch์˜ ์ •๋‹น์„ฑ, latent ๊ตฌ์กฐ์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ, ๊ทธ๋ฆฌ๊ณ  ์ผ๋ถ€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ ํƒ์˜ ๋ช…ํ™•ํ™”๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •