Semantic Co-Speech Gesture Synthesis and Real-Time Control for Humanoid Robots

์ €์ž: Gang Zhang | ๋‚ ์งœ: 2025-12-19 | DOI: 10.48550/arXiv.2512.17183 📄 PDF


Essence

Figure 1

Figure 1: System Overview: Training and Inference Pipeline.

์ด ์—ฐ๊ตฌ๋Š” ์Œ์„ฑ ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์˜๋ฏธ๋ก ์ ์œผ๋กœ ์ ์ ˆํ•œ ์ œ์Šค์ฒ˜๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์‹ค์‹œ๊ฐ„์œผ๋กœ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์— ๋ฐฐํฌํ•˜๋Š” end-to-end ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. LLM๊ณผ Motion-GPT๋ฅผ ํ™œ์šฉํ•œ ์ œ์Šค์ฒ˜ ์ƒ์„ฑ๊ณผ imitation learning ๊ธฐ๋ฐ˜์˜ MotionTracker ์ œ์–ด ์ •์ฑ…์„ ํ†ตํ•ฉํ•˜์—ฌ ์˜๋ฏธ ์žˆ๋Š” ๋น„์–ธ์–ด์  ์†Œํ†ต์„ ์‹คํ˜„ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Comparison of Original vs. Reconstructed G1 Mo-

How

Figure 1

Figure 1: System Overview: Training and Inference Pipeline.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ์Œ์„ฑ ๊ธฐ๋ฐ˜ ์˜๋ฏธ๋ก ์  ์ œ์Šค์ฒ˜ ์ƒ์„ฑ๊ณผ ์‹ค์‹œ๊ฐ„ ๋กœ๋ด‡ ๋ฐฐํฌ๋ฅผ ํ†ตํ•ฉํ•œ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ตฌ๋กœ, LLM, Motion-GPT, imitation learning์„ ์ฐฝ์˜์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ์™„์ „ํ•œ end-to-end ํŒŒ์ดํ”„๋ผ์ธ์„ ์‹คํ˜„ํ–ˆ๋‹ค. ๋‹ค๋งŒ ํ‰๊ฐ€์˜ ์ •๋Ÿ‰์„ฑ ๊ฐ•ํ™”์™€ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ์˜ robustness ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •