Residual Off-Policy RL for Finetuning Behavior Cloning Policies

์ €์ž: Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi | ๋‚ ์งœ: 2025-09-23 | URL: https://arxiv.org/abs/2509.19301 📄 PDF


Essence

Figure 2

Fig. 2. Off-policy residual fine-tuning (ResFiT): A two-phase approach using online RL to improve BC policies. First, we

Behavior Cloning(BC) ์ •์ฑ…์„ ๊ธฐ๋ฐ˜์œผ๋กœ Residual Off-Policy RL์„ ์ ์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ ํšจ์œจ์ ์œผ๋กœ ์กฐ์ž‘ ์ •์ฑ…์„ ๊ฐœ์„ ํ•˜๋ฉฐ, ๊ณ ์ž์œ ๋„ ์ด์กฑ ๋กœ๋ด‡์—์„œ์˜ ์ฒซ ์‹ค์‹œ๊ฐ„ RL ํ•™์Šต์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

Motivation

Achievement

Figure 5

Fig. 5. Success rates of different approaches on our simulation tasks, showing ResFiT converging to high-performing poli

How

Figure 2

Fig. 2. Off-policy residual fine-tuning (ResFiT): A two-phase approach using online RL to improve BC policies. First, we

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: BC์™€ off-policy RL์„ residual learning์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ, ๊ณ ์ž์œ ๋„ ์‹ค์‹œ๊ฐ„ ๋กœ๋ด‡ ํ•™์Šต์˜ ์‹ค์šฉ์  ๊ฒฝ๋กœ๋ฅผ ์ œ์‹œํ–ˆ๋‹ค. ๋ธ”๋ž™๋ฐ•์Šค ๋ฐฉ์‹์˜ ์ผ๋ฐ˜์„ฑ๊ณผ ์ฒซ ํœด๋จธ๋…ธ์ด๋“œ RL ์‹ค์ฆ์ด ๋กœ๋ด‡ ํ•™์Šต ๋ถ„์•ผ์— ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋ฅผ ์ด๋ฃฌ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •