InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

์ €์ž: Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu | ๋‚ ์งœ: 2025-10-15 | URL: https://arxiv.org/abs/2510.13778 📄 PDF


Essence

Figure 1

Figure 1. InternVLA-M1 integrates spatial grounding into the visionโ€“languageโ€“action training pipeline.

InternVLA-M1์€ ๊ณต๊ฐ„ ๊ทธ๋ผ์šด๋”ฉ์„ ์‹œ๊ฐ-์–ธ์–ด-ํ–‰๋™ ํ•™์Šต์˜ ์ค‘์‹ฌ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ๋กœ ํ™œ์šฉํ•˜์—ฌ, ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ ๋กœ๋ด‡์˜ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์ผ๋ฐ˜ ์ง€๋Šฅ์„ ๊ตฌํ˜„ํ•œ ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค.

Motivation

Achievement

Figure 2

Figure 2. Overview of InternVLA-M1. InternVLA-M1 adopts a spatially guided two-stage training

How

Figure 2

Figure 2. Overview of InternVLA-M1. InternVLA-M1 adopts a spatially guided two-stage training

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: InternVLA-M1์€ ๊ณต๊ฐ„ ๊ทธ๋ผ์šด๋”ฉ์„ ์ค‘์ถ”๋กœ ํ•˜๋Š” ์ด์ค‘ ์‹œ์Šคํ…œ ์„ค๊ณ„๋กœ instruction-following๊ณผ embodied control ๊ฐ„ ๋ช…ํ™•ํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ ํ™•์žฅ์„ฑ์„ ์ž…์ฆํ•œ ๋งค์šฐ ๊ฒฌ๊ณ ํ•œ ์—ฐ๊ตฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •