OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

์ €์ž: Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu | ๋‚ ์งœ: 2025-11-03 | URL: https://arxiv.org/abs/2511.01210 📄 PDF


Essence

Figure 2

Fig. 2: System Overview. OmniVLA processes diverse sensor data into image-like 2D spatial representations, and then

OmniVLA๋Š” RGB, ์ ์™ธ์„ , mmWave ๋ ˆ์ด๋”, ์Œํ–ฅ ๋งˆ์ดํฌ๋กœํฐ ๋“ฑ ๋‹ค์ค‘ ์„ผ์„œ๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ์ตœ์ดˆ์˜ VLA ๋ชจ๋ธ๋กœ, ์„ผ์„œ-๋งˆ์Šคํฌ๋œ ์ด๋ฏธ์ง€๋ผ๋Š” ํ†ต์ผ๋œ ํ‘œํ˜„์„ ํ†ตํ•ด ๋ฌผ๋ฆฌ์  ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ๋กœ๋ด‡ ์กฐ์ž‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

Motivation

Achievement

Figure 5

Fig. 5: Examples of Robotic Manipulation Task Completion

How

Figure 2

Fig. 2: System Overview. OmniVLA processes diverse sensor data into image-like 2D spatial representations, and then

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: OmniVLA๋Š” ๋‹ค์ค‘ ์„ผ์„œ๋ฅผ VLA์— ํ†ตํ•ฉํ•˜๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•ด ์šฐ์•„ํ•˜๊ณ  ์‹ค์šฉ์ ์ธ ์†”๋ฃจ์…˜์„ ์ œ์‹œํ•˜๋ฉฐ, ์„ผ์„œ-๋งˆ์Šคํฌ๋œ ์ด๋ฏธ์ง€๋ผ๋Š” ๋‹จ์ˆœํ•˜๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ธ ํ‘œํ˜„์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ๊ณผ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •