์ ์: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov | ๋ ์ง: 2024-12-17 | URL: https://arxiv.org/abs/2412.12953 📄 PDF
Figure 1: The proposed MoDE architecture (left) uses a transformer with causal masking, where each
MoDE๋ Mixture-of-Experts ์ํคํ ์ฒ๋ฅผ Diffusion Policy์ ์ ์ฉํ์ฌ noise-conditioned routing๊ณผ noise-conditioned self-attention์ ํตํด ๋งค๊ฐ๋ณ์๋ 40% ๊ฐ์์ํค๋ฉด์ 90% ์ ์ FLOPs๋ก ๋ ๋์ ์ฑ๋ฅ์ ๋ฌ์ฑํ๋ ํจ์จ์ ์ธ Imitation Learning ์ ์ฑ ์ด๋ค.
Figure 2: After training MoDE, the router is noise-conditioned, allowing pre-computation of the
Figure 1: The proposed MoDE architecture (left) uses a transformer with causal masking, where each
์ดํ: MoDE๋ noise-conditioned routing์ด๋ผ๋ ์ฐฝ์์ ์ธ ์์ด๋์ด๋ก Diffusion Policy์ ๊ณ์ฐ ํจ์จ์ฑ์ ํ๊ธฐ์ ์ผ๋ก ๊ฐ์ ํ๋ฉด์๋ ์ฑ๋ฅ์ ํฅ์์ํจ ๊ฐ๋ ฅํ ๊ธฐ์ฌ์ด๋ค. ๊ด๋ฒ์ํ ์คํ๊ณผ ablation study๋ฅผ ํตํด ๊ฒ์ฆ๋์์ผ๋, ์ด๋ก ์ ๊ธฐ์ด ๊ฐํ์ ๋ ๋ค์ํ ๋๋ฉ์ธ์์์ ํ๊ฐ๊ฐ ํ์ํ๋ค.