RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

์ €์ž: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich | ๋‚ ์งœ: 2023-07-28 | URL: https://arxiv.org/abs/2307.15818 📄 PDF


Essence

Figure 1

Figure 1 | RT-2 overview: we represent robot actions as another language, which can be cast into text tokens and

์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ vision-language ๋ชจ๋ธ์„ ๋กœ๋ด‡ ์ œ์–ด์— ์ง์ ‘ ํ†ตํ•ฉํ•˜์—ฌ end-to-end ๋กœ๋ด‡ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” RT-2 ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๋กœ๋ด‡ ์•ก์…˜์„ ํ…์ŠคํŠธ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„ํ•˜์—ฌ VLM์˜ ์‚ฌ์ „ํ•™์Šต ์ด์ ์„ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ ์ €์ˆ˜์ค€์˜ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2 | RT-2 is able to generalize to a variety of real-world situations that require reasoning, symbol

How

Figure 1

Figure 1 | RT-2 overview: we represent robot actions as another language, which can be cast into text tokens and

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: RT-2๋Š” ์›น ๊ทœ๋ชจ vision-language ๋ชจ๋ธ์˜ ์˜๋ฏธ๋ก ์  ์ง€์‹์„ ๋กœ๋ด‡ ์ œ์–ด์— ์ง์ ‘ ํ†ตํ•ฉํ•˜๋Š” ์šฐ์•„ํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๋ฏธํ•™์Šต ๊ฐ์ฒด ์ผ๋ฐ˜ํ™”์™€ ์˜๋„ํ•œ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ž…์ฆํ•œ๋‹ค. ๋กœ๋ด‡ ๊ณตํ•™์—์„œ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ํ™œ์šฉ์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์•ˆํ•œ ๊ฒƒ์œผ๋กœ ์‚ฐ์—…์ , ํ•™๋ฌธ์  ๊ธฐ์—ฌ๋„๊ฐ€ ํฌ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •