RVT-2: Learning Precise Manipulation from Few Demonstrations

์ €์ž: Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, Dieter Fox | ๋‚ ์งœ: 2024-06-12 | URL: https://arxiv.org/abs/2406.08545 📄 PDF


Essence

Figure 1

Fig. 1: RVT-2 performing high precision tasks. Given a language instruction, a single RVT-2 model can perform multiple 3

RVT-2๋Š” ์ ์€ ์ˆ˜์˜ ์‹œ์—ฐ์œผ๋กœ๋ถ€ํ„ฐ ๊ณ ์ •๋ฐ€ 3D ์กฐ์ž‘ ์ž‘์—…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ชจ๋ธ๋กœ, ์ด์ „ RVT ๋Œ€๋น„ 6๋ฐฐ ๋น ๋ฅธ ํ•™์Šต ์†๋„์™€ 2๋ฐฐ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ RLBench์—์„œ 82%์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

Motivation

Achievement

Figure 3

Fig. 3: Training time vs Success rate on RLBench. All

  1. ์„ฑ๋Šฅ ํ–ฅ์ƒ: RLBench์—์„œ ์„ฑ๊ณต๋ฅ ์„ 65%์—์„œ 82%๋กœ ํ–ฅ์ƒ (state-of-the-art)
  2. ์†๋„ ๊ฐœ์„ : ํ•™์Šต ์†๋„ 6๋ฐฐ ํ–ฅ์ƒ (2.4M โ†’ 16M samples/day), ์ถ”๋ก  ์†๋„ 2๋ฐฐ ํ–ฅ์ƒ (11.6 fps โ†’ 20.6 fps)
  3. ์‹ค์„ธ๊ณ„ ๊ฒ€์ฆ: 10๊ฐœ์˜ ์‹œ์—ฐ๋งŒ์œผ๋กœ ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ์ˆ˜์ค€ ์ •๋ฐ€๋„๊ฐ€ ํ•„์š”ํ•œ ํ”Œ๋Ÿฌ๊ทธ ์‚ฝ์ž…, ํŽ˜๊ทธ ์‚ฝ์ž… ์ž‘์—… ์ˆ˜ํ–‰
  4. ์ผ๋ฐ˜ํ™”: ๋‹จ์ผ RGB-D ์นด๋ฉ”๋ผ์™€ ๋‹จ์ผ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ๋ชจ๋ธ๋กœ ์—ฌ๋Ÿฌ ์กฐ์ž‘ ์ž‘์—… ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

How

Figure 2

Fig. 2: RVT-2 Architecture. Given the current scene and a task instruction, RVT-2 predicts the next key-frame pose. It c

Originality

Limitation & Further Study

Evaluation

Novelty: 3/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: RVT-2๋Š” ์•„ํ‚คํ…์ฒ˜์™€ ์‹œ์Šคํ…œ ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ๊ณ ์ •๋ฐ€ 3D ์กฐ์ž‘์—์„œ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์ ์€ ์‹œ์—ฐ์œผ๋กœ ์‹ค์„ธ๊ณ„ ์ •๋ฐ€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฒ˜์Œ ์ž…์ฆํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •