GR-3 Technical Report

์ €์ž: Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang | ๋‚ ์งœ: 2025-07-21 | URL: https://arxiv.org/abs/2507.15493 📄 PDF


Essence

Figure 1

Figure 1 Overview. GR-3 is able to learn from three types of data: vision-language data, robot trajectory data,

GR-3๋Š” vision-language-action (VLA) ๋ชจ๋ธ๋กœ, ์›น ๊ทœ๋ชจ vision-language ๋ฐ์ดํ„ฐ์™€ ๋กœ๋ด‡ ๊ถค์  ๋ฐ์ดํ„ฐ์˜ co-training์„ ํ†ตํ•ด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ, ํšจ์œจ์  ๋ฏธ์„ธ์กฐ์ •, ์žฅ๊ธฐ ์ง€ํ‰ ์ž‘์—… ์ˆ˜ํ–‰ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ๋ฒ”์šฉ ๋กœ๋ด‡ ์ •์ฑ…์„ ๊ตฌํ˜„ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2 Capabilities. GR-3 strictly follows instructions and is capable of understanding unseen instructions involving

How

Figure 3

Figure 3 The GR-3 Model. GR-3 is co-trained on both robot trajectories and vision-language data with a flow-matching

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: GR-3๋Š” co-training, auxiliary supervision, VR ๊ธฐ๋ฐ˜ ํšจ์œจ์  ์ ์‘ ๋“ฑ ์—ฌ๋Ÿฌ ํ˜์‹  ๊ธฐ๋ฒ•์„ ์ข…ํ•ฉํ•œ ์‹ค์งˆ์ ์œผ๋กœ ๊ฒฌ๊ณ ํ•œ VLA ๋ชจ๋ธ๋กœ์„œ, ์žฅ๊ธฐ ์ง€ํ‰๊ณผ ์ •๊ตํ•œ ์กฐ์ž‘ ์ž‘์—…์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์œผ๋‚˜, ํ‰๊ฐ€ ๋ฒ”์œ„์˜ ์ œํ•œ๊ณผ ๋ถ€๋ถ„์  ablation ๋ถ„์„์œผ๋กœ ์ธํ•ด ์™„์ „ํ•œ ๊ธฐ์—ฌ ๋ช…ํ™•ํ™”์—๋Š” ๋‹ค์†Œ ๋ฏธํกํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •