Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

์ €์ž: Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu | ๋‚ ์งœ: 2023-09-20 | URL: https://arxiv.org/abs/2309.11489 📄 PDF


Essence

Figure 1

Figure 1: An overview of TEXT2REWARD of three stages: Expert Abstraction provides an abstraction

LLM์„ ํ™œ์šฉํ•˜์—ฌ ์ž์—ฐ์–ด๋กœ ๊ธฐ์ˆ ๋œ ๋ชฉํ‘œ๋กœ๋ถ€ํ„ฐ ์ž๋™์œผ๋กœ dense reward function์„ ์ƒ์„ฑํ•˜๊ณ  ํ˜•์„ฑํ•˜๋Š” data-free ํ”„๋ ˆ์ž„์›Œํฌ Text2Reward๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ƒ์„ฑ๋œ reward code๋Š” ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ณ  ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ํ”„๋กœ๊ทธ๋žจ ํ˜•ํƒœ๋กœ, ๊ธฐ์กด์˜ inverse RL์ด๋‚˜ sparse reward ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ๋„“์€ ๋ฒ”์œ„์˜ ์ž‘์—…์„ ์ง€์›ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Learning curves on MANISKILL2 under zero-shot and few-shot reward generation settings,

How

Figure 1

Figure 1: An overview of TEXT2REWARD of three stages: Expert Abstraction provides an abstraction

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ reward code ์ž๋™ ์ƒ์„ฑ์œผ๋กœ RL์˜ ์˜ค๋žœ challenge์ธ reward design์„ ํ˜์‹ ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋ฉฐ, Pythonic ์ถ”์ƒํ™”์™€ code execution feedback์„ ํ†ตํ•ด ๋†’์€ ํ•ด์„์„ฑ๊ณผ ์‹ ๋ขฐ์„ฑ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ๋กœ๋ด‡ ๋ฒค์น˜๋งˆํฌ์™€ ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐํฌ๋กœ ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ•˜๊ณ  human-in-the-loop ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์‹ค๋ฌด ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” ICLR 2024์˜ ์šฐ์ˆ˜ ๋…ผ๋ฌธ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •