ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

์ €์ž: Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox | ๋‚ ์งœ: 2019-12-03 | URL: https://arxiv.org/abs/1912.01734 📄 PDF


Essence

Figure 1

Figure 1: ALFRED consists of 25k language directives

ALFRED๋Š” ์ž์—ฐ์–ด ์ง€์‹œ์‚ฌํ•ญ๊ณผ egocentric vision์—์„œ ๊ฐ€์ •์šฉ ์ž‘์—…์„ ์œ„ํ•œ action sequence๋กœ์˜ ๋งคํ•‘์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ, 25k๊ฐœ์˜ ์ž์—ฐ์–ด ์ง€์‹œ๋ฌธ๊ณผ ๋น„๊ฐ€์—ญ์  ์ƒํƒœ ๋ณ€ํ™”๋ฅผ ํฌํ•จํ•˜์—ฌ ์‹ค์ œ ๋กœ๋ด‡ ์‘์šฉ๊ณผ์˜ ๊ฐ„๊ทน์„ ์ค„์ธ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: ALFRED annotations. We introduce 7 different task types parameterized by 84 object classes in 120 scenes.

How

Figure 2

Figure 2: ALFRED annotations. We introduce 7 different task types parameterized by 84 object classes in 120 scenes.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ALFRED๋Š” ์ž์—ฐ์–ธ์–ด์—์„œ ํ–‰๋™์œผ๋กœ์˜ grounding ์—ฐ๊ตฌ์— ํ˜„์‹ค์ ์ธ ๋„์ „ ๊ณผ์ œ๋“ค์„ ์ข…ํ•ฉ์ ์œผ๋กœ ์ œ์‹œํ•˜๋Š” ์ค‘์š”ํ•œ ๋ฒค์น˜๋งˆํฌ์ด๋‹ค. ๊ณ ์ˆ˜์ค€/์ €์ˆ˜์ค€ ์–ธ์–ด ์ฃผ์„, ๋น„๊ฐ€์—ญ์  ์ƒํƒœ ๋ณ€ํ™”, pixelwise interaction mask ๋“ฑ์˜ ํ˜์‹ ์  ์„ค๊ณ„๊ฐ€ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹๋ณด๋‹ค ์‹ค์ œ ๋กœ๋ด‡ ์‘์šฉ์— ๋” ๊ฐ€๊น๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •