Open-World Object Manipulation using Pre-trained Vision-Language Models

์ €์ž: Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kirmani, Brianna Zitkovich, Fei Xia, Chelsea Finn, Karol Hausman | ๋‚ ์งœ: 2023-03-02 | URL: https://arxiv.org/abs/2303.00905 📄 PDF


Essence

Figure 1

Figure 1: Overview of MOO. We train a language-conditioned policy conditioned on object locations from a

Pre-trained vision-language model(VLM)์„ ๋กœ๋ด‡ ์ •์ฑ…๊ณผ ์ธํ„ฐํŽ˜์ด์‹ฑํ•˜์—ฌ ๋กœ๋ด‡์ด ์ง์ ‘ ๊ฒฝํ—˜ํ•˜์ง€ ๋ชปํ•œ ์ƒˆ๋กœ์šด ๋ฌผ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ์— ๋Œ€ํ•œ ์ง€์‹œ๋ฅผ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” MOO(Manipulation of Open-World Objects) ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 5

Figure 5: Main Results. While baseline methods perform competitively on in-distribution combinations of

How

Figure 2

Figure 2: MOO architecture: We extract object location (represented as the center of the bounding box) on

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ pre-trained VLM์„ ๋กœ๋ด‡ ์กฐ์ž‘์— ์‹ค์งˆ์ ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ์˜๋ฏธ๋ก ์  ์ผ๋ฐ˜ํ™”๋ฅผ ๋‹ฌ์„ฑํ•œ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ์ด๋ฉฐ, ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜๊ณผ ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ํ™•์žฅ์„ ํ†ตํ•ด ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ–ˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •