Improving generalization of robot locomotion policies via sharpness-aware reinforcement learning

์ €์ž: S. Bochem, E. Gonzalez-Sanchez, Y. Bicker, G. Fadini (ETH Zรผrich) | ๋‚ ์งœ: 2024 | DOI: arXiv:2411.19732 📄 PDF


Essence

๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ธฐ๋ฐ˜์˜ 1์ฐจ ์ •์ฑ… ์ตœ์ ํ™”(first-order policy gradient) ๋ฐฉ๋ฒ•์€ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์€ ์šฐ์ˆ˜ํ•˜๋‚˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, Sharpness-Aware Minimization (SAM) ๊ธฐ๋ฒ•์„ ๋กœ๋ด‡ ๊ฐ•ํ™”ํ•™์Šต์— ์ฒ˜์Œ ๋„์ž…ํ•œ ์—ฐ๊ตฌ์ด๋‹ค. SHAC-ASAM ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์†์‹ค ํ•จ์ˆ˜์˜ ํ‰ํ‰ํ•œ ๊ทน์†Œ์ (flat minima)์„ ์ฐพ์Œ์œผ๋กœ์จ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ๋กœ๋ด‡ ์ œ์–ด ํ™˜๊ฒฝ์—์„œ ๊ฒฌ๊ณ ์„ฑ๊ณผ ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 2: Average episode reward as function of the noise strength for SHAC, SHAC-ASAM, and PPO

์•ก์…˜ ๋…ธ์ด์ฆˆ ๊ฐ•๋„์— ๋”ฐ๋ฅธ ํ‰๊ท  ์—ํ”ผ์†Œ๋“œ ๋ณด์ƒ ๋น„๊ต

  1. ๊ฐ•๊ฑด์„ฑ ํ–ฅ์ƒ: SHAC-ASAM์ด ํ‘œ์ค€ SHAC ๋Œ€๋น„ ์•ก์…˜ ๋…ธ์ด์ฆˆ(action noise)์— ๋Œ€ํ•ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ๋†’์€ ํ—ˆ์šฉ ๋ฒ”์œ„ ๋‹ฌ์„ฑ. ํŠนํžˆ Ant์™€ Humanoid ํ™˜๊ฒฝ์—์„œ ๋…ธ์ด์ฆˆ๊ฐ€ ์ฆ๊ฐ€ํ•ด๋„ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์ ์Œ
  2. ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ: 0์ฐจ ๋ฐฉ๋ฒ•(PPO)๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ 1์ฐจ ๋ฐฉ๋ฒ•์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ์œ ์ง€
Figure 3: Average episode reward as a function of the contact Coulomb friction for SHAC, SHAC-ASAM, and PPO

์ ‘์ด‰ ๋งˆ์ฐฐ ๊ณ„์ˆ˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต

  1. ํ™˜๊ฒฝ ๋ณ€๋™์„ฑ ๋Œ€์‘: ์ฟจ๋กฑ ๋งˆ์ฐฐ(Coulomb friction) ๋“ฑ ํ™˜๊ฒฝ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€ํ™”์— ๋Œ€ํ•œ ์ ์‘ ๋Šฅ๋ ฅ ํ–ฅ์ƒ

How

Figure 1: Average episode reward heatmaps for SHAC (left) and PPO (right) policies under varying noise conditions

๋‹ค์–‘ํ•œ ๋…ธ์ด์ฆˆ ์กฐ๊ฑด์—์„œ์˜ ์ •์ฑ… ์„ฑ๋Šฅ ํžˆํŠธ๋งต

Originality

Limitation & Further Study

Evaluation

์ดํ‰: SHAC๊ณผ ASAM์˜ ๊ฒฐํ•ฉ์„ ํ†ตํ•ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต์—์„œ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ๊ณผ ๊ฐ•๊ฑด์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ํšจ๊ณผ์ ์œผ๋กœ ๋‹ฌ์„ฑํ•œ ์‹ค์šฉ์  ์ ‘๊ทผ์ด๋‚˜, ์‹ค์ œ ๋กœ๋ด‡ ๊ฒ€์ฆ๊ณผ ์ด๋ก ์  ๋ถ„์„ ๊ฐ•ํ™”๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
688์€ ์˜คํ”„๋ผ์ธ ๊ฐ•ํ™”ํ•™์Šต์˜ ์ผ๋ฐ˜ํ™” ํ‰๊ฐ€์™€ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ ๋‚ด ๊ฐ•์ธ์„ฑ ์‹คํ—˜์„ ๋‹ค๋ค„์„œ, 422์˜ sharpness-aware minimization ์ ์šฉ ์‹œ ์‹คํ—˜์  ์ฐธ์กฐ๊ฐ€ ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋กœ๋ด‡ ์ •์ฑ…์˜ ์ผ๋ฐ˜ํ™”์™€ RL ํ•™์Šต์—์„œ LLM ํ™œ์šฉ ๋ฐ scaling ์ „๋žต์— ๋Œ€ํ•œ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Iterative Distillation for Reward-Guided Fine-Tuning of Diff ๋…ผ๋ฌธ์ด ๋ณด์ƒ-์œ ๋„ ๋ฏธ์„ธ์กฐ์ •์„ ํ†ตํ•œ ํ™•์‚ฐ๋ชจ๋ธ ์ผ๋ฐ˜ํ™” ํƒ์ƒ‰์„ ์‹œ๋„ํ•œ ์ ์—์„œ ๋กœ๋ด‡ ์ •์ฑ… ์ตœ์ ํ™”์— SAM์„ ์ ์šฉํ•œ ๋ณธ ๋…ผ๋ฌธ์˜ ๋Œ€์•ˆ์  ์„ฑ๊ฒฉ์„ ์ง€๋‹™๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
395 ๋…ผ๋ฌธ์€ ์ œ์–ด ์žฅ๋ฒฝํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์•ˆ์ „ํ•œ ๊ฐ•ํ™”ํ•™์Šต ์ •์ฑ…์„ ์ง์ ‘์ ์œผ๋กœ ๋ผ์›Œ๋„ฃ๋Š” ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋กœ๋ด‡ ์ œ์–ด์—์„œ ๊ฐ•ํ™”ํ•™์Šต(SAM ๊ธฐ๋ฐ˜)๊ณผ PINN-UKF์˜ ์„ผ์„œ๋ฆฌ์Šค ํ† ํฌ ์ œ์–ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ์ƒ์ดํ•œ ๋ฌผ๋ฆฌ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM๊ณผ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๊ณผ์ œ ์ž๋™ํ™”๋กœ, ์ •์ฑ… ์ตœ์ ํ™”/์ผ๋ฐ˜ํ™”์˜ ํ•œ๊ณ„๋ฅผ ์—์ด์ „ํŠธ ์„ค๊ณ„ ๊ด€์ ์—์„œ ๋ณด๋‹ค ์‹คํ—˜์ ์œผ๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
์ถ”๋ก  ๋‹จ๊ณ„์—์„œ์˜ ๋ณด์ƒ/์ •๋ ฌ ๊ฐœ์„  ๊ธฐ๋ฒ• ๋“ฑ SHAC-ASAM๊ณผ ์œ ์‚ฌํ•œ ๊ฐ•ํ™”ํ•™์Šต reward optimization ๋ฐฉ๋ฒ•๋ก ์˜ ๋ฐœ์ „ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
๋กœ๋ด‡ ๋ณดํ–‰ ๋“ฑ ์‹ค์„ธ๊ณ„ ์ •์ฑ… ์ผ๋ฐ˜ํ™” ๋ฐ ๊ณต์œ ๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•์— ์ง‘์ค‘ํ•˜์—ฌ, 010์—์„œ ์ œ์•ˆํ•œ ๋ณดํ–‰ ์ œ์–ด ๊ณ„์ธตํ™”์™€ ์‹ค์ œ ์ ์šฉ์„ฑ์„ ์ง๊ฒฐ์‹œ์ผœ ๋ณผ ์ˆ˜ ์žˆ์Œ.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •