๐Ÿ“„ PaperBytes

Weekly AI Papers โ€” 2026-06-01

๐Ÿ“„ 10ํŽธ ๐Ÿ›๏ธ ๋น…ํ…Œํฌ 10ํŽธ ๐Ÿ”ฅ ํŠธ๋ Œ๋”ฉ 3ํŽธ
1
๐Ÿ›๏ธ ๋น…ํ…Œํฌ ๐Ÿ”ฅ ํŠธ๋ Œ๋”ฉ 208+
Microsoft Research

๐Ÿš€ "LLM ์—์ด์ „ํŠธ์˜ ์Šคํ‚ฌ์€ ์ด์ œ ๋” ์ด์ƒ '์†์œผ๋กœ ๋งŒ๋“ ๋‹ค'๋Š” ์‹œ๋Œ€๊ฐ€ ๋๋‚ฌ๋‹ค โ€” ์Šค์Šค๋กœ ์ง„ํ™”ํ•˜๋Š” ํ•™์Šต ๊ธฐ์ œ๋กœ ๋ฐ”๋€๋‹ค!"

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

๐Ÿ›๏ธ ์†Œ์†: Microsoft Research (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: skill optimization, text-space optimizer, self-evolving agent, bounded edits, validation-driven learning

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œLLM ์—์ด์ „ํŠธ์˜ ์Šคํ‚ฌ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ ค๋ฉด, ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ?โ€
  • โ€œ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์Šคํ‚ฌ์„ โ€˜ํ›ˆ๋ จโ€™ํ•ด์•ผ, ์‹ค์ œ๋กœ ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐˆ๊นŒ?โ€
  • โ€œ์Šคํ‚ฌ ์—…๋ฐ์ดํŠธ๊ฐ€ ์‹คํŒจํ•  ๋•Œ, ์–ด๋–ป๊ฒŒ ๋ณต๊ตฌํ•ด์•ผ ํ• ๊นŒ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ์Šคํ‚ฌ์„ ์ˆ˜๋™์œผ๋กœ ์กฐ์ •ํ•˜๊ฑฐ๋‚˜, ํ•œ ๋ฒˆ๋งŒ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜, ์ œํ•œ๋œ ์ž์œจ ์ง„ํ™” ๋ฐฉ์‹์œผ๋กœ ๊ฐœ์„ ํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ํ…์ŠคํŠธ ๊ณต๊ฐ„์—์„œ ๋…๋ฆฝ์ ์ธ ์ตœ์ ํ™” ๋ชจ๋ธ์„ ๋„์ž…ํ•ด, ์Šคํ‚ฌ ๋ฌธ์„œ๋ฅผ โ€˜ํŽธ์ง‘โ€™ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ํ…Œ์ŠคํŠธ์—์„œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋  ๋•Œ๋งŒ ์—…๋ฐ์ดํŠธ๋˜๋ฉฐ, ํ›ˆ๋ จ ์ค‘์—๋Š” ์ „ํ˜€ ์ถ”๋ก  ์‹œ๊ฐ„์— ๋ชจ๋ธ ํ˜ธ์ถœ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • GPT-5.5์—์„œ ์ง์ ‘ ์ฑ—๋ด‡ ๋ชจ๋“œ์—์„œ ํ‰๊ท  ์Šคํ‚ฌ ์—†์ด์˜ ์ •ํ™•๋„๋ฅผ +23.5 ํฌ์ธํŠธ ํ–ฅ์ƒ
  • Codex ์—์ด์ „ํŠธ ๋ฃจํ”„ ๋‚ด์—์„œ +24.8 ํฌ์ธํŠธ ํ–ฅ์ƒ, Claude Code ๋‚ด์—์„œ๋Š” +19.1 ํฌ์ธํŠธ ํ–ฅ์ƒ
  • 52๊ฐœ์˜ (๋ชจ๋ธ ร— ๋ฒค์น˜๋งˆํฌ ร— ์‹คํ–‰ ํ™˜๊ฒฝ) ์…€ ์ค‘ ๋ชจ๋“  ์…€์—์„œ ์ตœ๊ณ  ๋˜๋Š” ๋™๋ฅ  ์„ฑ๊ณผ, 7๊ฐœ์˜ ํƒ€๊ฒŸ ๋ชจ๋ธ๊ณผ 6๊ฐœ์˜ ๋ฒค์น˜๋งˆํฌ์—์„œ 1์œ„๋ฅผ ์ฐจ์ง€

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ์Šคํ‚ฌ์„ ์ˆ˜๋™์œผ๋กœ ์กฐ์ •ํ•˜๊ฑฐ๋‚˜, ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์— ์˜์กดํ•˜๋Š” ๋ฐฉ์‹โ€ โ†’ โ€œํ…์ŠคํŠธ ๊ณต๊ฐ„์—์„œ ๋…๋ฆฝ ์ตœ์ ํ™” ๋ชจ๋ธ์ด ์Šคํ‚ฌ ๋ฌธ์„œ๋ฅผ ํŽธ์ง‘ํ•˜๋ฉฐ, ์„ฑ๋Šฅ ํ–ฅ์ƒ๋งŒ ์ธ์ •ํ•˜๋Š” ์ž๋™ํ™”๋œ ์ง„ํ™” ์‹œ์Šคํ…œโ€

2
๐Ÿ›๏ธ ๋น…ํ…Œํฌ ๐Ÿ”ฅ ํŠธ๋ Œ๋”ฉ 106+
Microsoft

๐Ÿš€ โ€œ3.8B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ 6B ์ด์ƒ ๋ชจ๋ธ์„ ๊บพ๋Š” ๊ฑด, AI ํ›ˆ๋ จ์˜ โ€˜์—๋„ˆ์ง€ ์ ˆ์•ฝ ๋ชจ๋“œโ€™๊ฐ€ ๋œฌ ๊ฑฐ์•ผ?โ€

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

๐Ÿ›๏ธ ์†Œ์†: Microsoft (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: text-to-image, training efficiency, compact model, semantic VAE, GPT-4 captioning

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์™œ 6B ๋ชจ๋ธ์ด ๋” ์ข‹๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, 3.8B ๋ชจ๋ธ์ด ๋” ๋น ๋ฅด๊ณ  ์ €๋น„์šฉ์œผ๋กœ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฑธ๊นŒ?โ€
  • โ€œT2I ๋ชจ๋ธ ํ›ˆ๋ จ์—์„œ โ€˜๋ฐ์ดํ„ฐ ๋ฐ€๋„โ€™๊ฐ€ ์„ฑ๋Šฅ์— ์–ผ๋งˆ๋‚˜ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฑธ๊นŒ?โ€
  • โ€œGPU ํ•œ ๋Œ€๋กœ 1024ร—1024 ์ด๋ฏธ์ง€๋ฅผ 3์ดˆ ๋‚ด๋กœ ๋น ๋ฅด๊ฒŒ ๋ฝ‘๋Š” ๊ฒŒ, ํ˜„์‹ค์ด์•ผ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์ด ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•ด์•ผ ํ–ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ ์ž‘์€ ๋ชจ๋ธ์ด ๋” ๋น ๋ฅด๊ณ  ์ €๋น„์šฉ์œผ๋กœ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•จ]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • 3.8B ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์ด Z-Image(6B+)๋ณด๋‹ค 19.3%์˜ ํ›ˆ๋ จ ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค๋งŒ์œผ๋กœ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
  • 1024ร—1024 ์ด๋ฏธ์ง€ ์ƒ์„ฑ์ด ๋‹จ์ผ NVIDIA H100 GPU์—์„œ 3.15์ดˆ, ๋””์Šคํ‹ฐะป๋ ˆ์ด์…˜ ๋ฒ„์ „์€ 0.84์ดˆ(4๋‹จ๊ณ„ ์ถ”๋ก )์— ์™„๋ฃŒ

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๊ฒฝ์Ÿ โ†’ ์†Œํ˜• ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ ๊ฒฝ์Ÿโ€

๊ธฐ์กด์—๋Š” 6B ์ด์ƒ ๋ชจ๋ธ์ด T2I ์„ฑ๋Šฅ์˜ ๊ธฐ์ค€์ด์—ˆ์œผ๋‚˜, ์ด ๋…ผ๋ฌธ์€ 3.8B ๋ชจ๋ธ์ด ํ›ˆ๋ จ ์ž์›์„ 19.3%๋กœ ์ค„์—ฌ๋„ SOTA๋ฅผ ๋„˜์–ด์„œ๋ฉฐ, ์ถ”๋ก  ์†๋„๊นŒ์ง€ 3.7๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ค๋Š” โ€˜ํšจ์œจ์„ฑ์˜ ์ƒˆ ํŒจ๋Ÿฌ๋‹ค์ž„โ€™์„ ์ œ์‹œ

3
๐Ÿ›๏ธ ๋น…ํ…Œํฌ ๐Ÿ”ฅ ํŠธ๋ Œ๋”ฉ 407+
NVIDIA

๐ŸŽฎ โ€œ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ๋™์‹œ์— ์›€์ง์ด๋Š” ๊ฒŒ์ž„๋„, AI๊ฐ€ ํ•˜๋‚˜์˜ โ€˜์„ธ๊ณ„ ๋ชจ๋ธโ€™๋กœ ์™„์ „ํžˆ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

๐Ÿ›๏ธ ์†Œ์†: NVIDIA (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: multi-agent world modeling, rotary encoding, sparse attention, diffusion distillation, real-time generation

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ2๋ช… ํ”Œ๋ ˆ์ด์–ด๋งŒ ์ง€์›ํ•˜๋Š” AI ์„ธ๊ณ„ ๋ชจ๋ธ, 4๋ช…์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€
  • โ€œ๋ชจ๋“  ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ์„œ๋กœ ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒŒ์ž„์—์„œ, ๊ฐ๊ฐ์˜ ํ–‰๋™์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€
  • โ€œAI๊ฐ€ ์—ฌ๋Ÿฌ ์บ๋ฆญํ„ฐ๋ฅผ ๋™์‹œ์— ์ œ์–ดํ•˜๋ฉด์„œ๋„, ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€์ง€ ์•Š๊ฒŒ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€

๊ธฐ์กด์—๋Š” ๋‹จ์ผ ์—์ด์ „ํŠธ ์ค‘์‹ฌ์˜ ์„ธ๊ณ„ ๋ชจ๋ธ์ด ์ฃผ๋ฅ˜์˜€๊ณ , ์—ฌ๋Ÿฌ ์—์ด์ „ํŠธ๊ฐ€ ๋™์‹œ์— ์›€์ง์ด๋Š” ํ™˜๊ฒฝ์€ ๋ณต์žกํ•œ โ€˜๋ชจ๋“  ๋Œ€ ๋ชจ๋“ โ€™ ์ฃผ์˜ ๊ตฌ์กฐ๋กœ ์ฒ˜๋ฆฌ๋์Šต๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์€ โ€˜Simplex Rotary Agent Encodingโ€™๊ณผ โ€˜Sparse Hub Attentionโ€™์„ ํ†ตํ•ด ์—์ด์ „ํŠธ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•˜๋ฉด์„œ๋„, ๊ณ„์‚ฐ ๋น„์šฉ์„ ์„ ํ˜•์œผ๋กœ ์ค„์ด๊ณ , 24 FPS ์‹ค์‹œ๊ฐ„ ์ƒ์„ฑ๊นŒ์ง€ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • 4๋ช… ํ”Œ๋ ˆ์ด์–ด ํ™˜๊ฒฝ์—์„œ ๊ธฐ์กด ์Šฌ๋กฏ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค **๋น„๋””์˜ค ์‹ ๋ขฐ๋„ 38% ํ–ฅ์ƒ**
  • **๋ชจ๋“  ์—์ด์ „ํŠธ ๊ฐ„ ์ฃผ์˜ ๋น„์šฉ์„ ์ œ๊ณฑ์—์„œ ์„ ํ˜•์œผ๋กœ ์ค„์—ฌ** 10๊ฐœ ์—์ด์ „ํŠธ ์‹œ์ ์—์„œ๋„ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

๊ธฐ์กด ๋ฐฉ์‹ โ†’ โ€œ๋ชจ๋“  ์—์ด์ „ํŠธ๊ฐ€ ์„œ๋กœ ์ฃผ์˜๋ฅผ ์ฃผ๊ณ ๋ฐ›๋Š” ๋ฐ€๋„ ๋†’์€ ๊ตฌ์กฐโ€

์ƒˆ ๋ฐฉ์‹ โ†’ โ€œํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ—ˆ๋ธŒ ํ† ํฐ์„ ํ†ตํ•ด ์—์ด์ „ํŠธ ๊ฐ„ ์ฃผ์˜๋ฅผ ์ค‘๊ณ„ํ•ด ๊ณ„์‚ฐ ๋น„์šฉ์„ ์„ ํ˜•์œผ๋กœ ์ค„์ž„โ€

๋…ผ๋ฌธ ๋ณด๊ธฐ โ†’ Fangfu Liu, Kai He, Tianchang Shen ์™ธ 7๋ช…
4
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Tencent

๐Ÿค– "๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ๋ง์˜ ์ง„์งœ ๋ฏธ๋ž˜๋Š” โ€˜์ž์—ฐ์Šค๋Ÿฌ์šด ํ†ตํ•ฉโ€™์ด ์•„๋‹ˆ๋ผ โ€˜์ž์‹ ์˜ DNA์— ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋ฐ•์•„๋„ฃ๋Š” ๊ฒƒโ€™์ด์•ผ"

Toward Native Multimodal Modeling: A Roadmap

๐Ÿ›๏ธ ์†Œ์†: Tencent (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: native multimodal modeling, architectural nativity, multi-to-text, multi-to-target, multi-to-multi

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์™œ ์ด๋ฏธ์ง€+ํ…์ŠคํŠธ ๋ชจ๋ธ์ด ์˜คํžˆ๋ ค ๋” ๋‚˜์€๊ฐ€์š”?โ€
  • โ€œ๋ชจ๋“  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๊ฑด ๋ถˆ๊ฐ€๋Šฅํ•œ๊ฐ€์š”?โ€
  • โ€œ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์ด ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ชจ๋ธ์„ ๋ฒ—์–ด๋‚˜๋Š” ๊ฑด ์ง„์งœ ๊ฐ€๋Šฅํ• ๊นŒ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋๋‹จ์—์„œ ํ•ฉ์น˜๋Š” โ€˜๋ ˆ์ดํŠธ ํ“จ์ „โ€™์ด ์ฃผ๋ฅ˜์˜€๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ โ€˜๋ณธ์งˆ์ ์œผ๋กœ ํ†ตํ•ฉโ€™ํ•˜๋Š” โ€˜๋„ค์ดํ‹ฐ๋ธŒ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ๋ง(NMM)โ€™๋กœ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • **3๊ฐ€์ง€ ๋„ค์ดํ‹ฐ๋ธŒ ์•„ํ‚คํ…์ฒ˜ ๋ถ„๋ฅ˜** (Multi-to-Text, Multi-to-Target, Multi-to-Multi)๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ •์˜ํ•˜๋ฉฐ, ๊ฐ๊ฐ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์„ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•จ
  • **์—”๋“œ-to-์—”๋“œ ํŒŒ์ดํ”„๋ผ์ธ**์„ ๊ณต๊ฐœํ•˜๋ฉฐ, ์•„ํ‚คํ…์ฒ˜ ์กฐ์œจ, ๋Œ€๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฑ„์ง‘, ์ „์ฒด ์Šคํƒ ํŠธ๋ ˆ์ด๋‹ ๋ ˆ์‹œํ”ผ, ์ธํผ๋Ÿฐ์Šค/๋ฐฐํฌ, ํ‰๊ฐ€๊นŒ์ง€ ์‚ฐ์—…์šฉ ์ˆ˜์ค€์œผ๋กœ ๊ตฌ์ถ•

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

**โ€œ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋๋‹จ์—์„œ ํ•ฉ์น˜๋Š” ๋ ˆ์ดํŠธ ํ“จ์ „โ€ โ†’ โ€œ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์•„ํ‚คํ…์ฒ˜ ๋‚ด๋ถ€์— ๋ณธ์งˆ์ ์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋„ค์ดํ‹ฐ๋ธŒ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ๋งโ€**

5
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
NVIDIA

๐Ÿ–ผ๏ธ โ€œ์™œ ๋ฉ€๋ฆฌ ์žˆ๋Š” ๊ฑด ์œ„๋กœ ๋ณด์ด๋Š” ๊ฑธ๊นŒ? VLM์˜ โ€˜๊ณต๊ฐ„ ์˜ค๋ฅ˜โ€™๊ฐ€ ์ˆจ์€ ์ด์œ ๋ฅผ ๋ฐํ˜”๋‹ค!โ€

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

๐Ÿ›๏ธ ์†Œ์†: NVIDIA (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: spatial reasoning, representation disentanglement, perspective bias, VLM, shortcut bias

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์ด๋ฏธ์ง€ ์† โ€˜์œ„โ€™์™€ โ€˜๋ฉ€๋ฆฌโ€™๊ฐ€ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ ์ธ์‹๋˜๋Š” ๊ฑด ์™œ์ผ๊นŒ?โ€
  • โ€œ๋ชจ๋ธ์ด ์ •๋‹ต์„ ๋งž์ถ”๋Š” ๊ฑด ๊ณต๊ฐ„ ์ดํ•ด ๋•Œ๋ฌธ์ผ๊นŒ, ์•„๋‹ˆ๋ฉด ๋‹จ์ˆœํ•œ ํ†ต๊ณ„ ํŒจํ„ด ๋•Œ๋ฌธ์ผ๊นŒ?โ€
  • โ€œ๋ชจ๋“  VLM์ด ๋˜‘๊ฐ™์ด ๊ณต๊ฐ„์„ ์ธ์‹ํ•˜๋Š” ๊ฑด๊ฐ€? ์•„๋‹ˆ๋ฉด ๋‚ด๋ถ€ ๊ตฌ์กฐ๊ฐ€ ๋‹ฌ๋ผ์„œ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฑธ๊นŒ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” VLM์ด ๊ณต๊ฐ„ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ์ž˜ ๋‚ด๋Š” ๊ฑธ โ€˜3D ๊ตฌ์กฐ ์ดํ•ดโ€™๋กœ ํ•ด์„ํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ โ€˜์‚ฌ์ง„์˜ ์‹œ์  ํŽธํ–ฅโ€™์ด๋ผ๋Š” ๋‹จ์ˆœํ•œ ํ†ต๊ณ„์  ๋‹จ์„œ์— ์˜์กดํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ฐํ˜”๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • **๋ชจ๋ธ์ด โ€˜์ˆ˜์ง ์œ„์น˜โ€™์™€ โ€˜๊ฑฐ๋ฆฌโ€™๋ฅผ ํ˜ผ๋™ํ•˜๋Š” ํ˜„์ƒ์€ ์ „์ฒด VLM์—์„œ ์ผ๊ด€๋˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋ฉฐ, ์ด ํŽธํ–ฅ์ด ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ ์ •๋‹ต๋ฅ ์ด 15% ์ด์ƒ ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ**
  • **๋ฐ์ดํ„ฐ ํ™•์žฅ(Scaling)์ด ์ผ์–ด๋‚˜๋„ ์ด ํŽธํ–ฅ์€ ๊ฐ•ํ™”๋˜๋ฉฐ, ๊ธฐ์กด ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ โ€˜์ž์—ฐ ์ด๋ฏธ์ง€ ํŽธํ–ฅโ€™์œผ๋กœ ์ธํ•ด ์ •ํ™•๋„ ํ–ฅ์ƒ์ด ์™œ๊ณก๋˜๋Š” ํ˜„์ƒ์ด ๊ด€์ฐฐ๋จ**

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

**๊ธฐ์กด ๋ฐฉ์‹ โ†’ โ€˜์ •๋‹ต๋ฅ  ํ–ฅ์ƒโ€™๋งŒ์œผ๋กœ ๊ณต๊ฐ„ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํŒ๋‹จ**

**์ƒˆ ๋ฐฉ์‹ โ†’ โ€˜๊ณต๊ฐ„ ์ถ• ๋ถ„๋ฆฌ ์—ฌ๋ถ€โ€™๋ฅผ ์ธก์ •ํ•ด ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ์™€ ์‹ ๋ขฐ์„ฑ ๊ฐ„์˜ ์ง์ ‘์  ์—ฐ๊ด€์„ฑ์„ ๋ฐํž˜**

6
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Tencent Hunyuan

๐Ÿค– "์ƒ๊ฐ๋ณด๋‹ค ๊นŠ์ด๊ฐ€ ํ•„์š”ํ–ˆ๋‚˜์š”? VLM์ด ๋ฌผ๋ฆฌ์„ธ๊ณ„์—์„œ โ€˜๋ฌด๋„ˆ์ง€์ง€ ์•Š๊ฒŒโ€™ ์›€์ง์ด๋ ค๋ฉด, ์ด๊ฑด ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค."

GEM: Generative Supervision Helps Embodied Intelligence

๐Ÿ›๏ธ ์†Œ์†: Tencent Hunyuan (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: embodied vision-language model, generative supervision, depth map generation, action planning, robotic execution

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œVLM์ด ๋ง๋กœ๋งŒ ์ดํ•ดํ•˜๋ฉด ๋กœ๋ด‡์€ ์›€์ง์ด์ง€ ์•Š์•„์š”. ์™œ?โ€
  • โ€œVLM์ด ๋ฌผ๋ฆฌ์  ๊ณต๊ฐ„์„ ์ดํ•ดํ•˜๋ ค๋ฉด, ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ• ๊นŒ์š”?โ€
  • โ€œ์ƒ๊ฐ๋ณด๋‹ค ๋กœ๋ด‡์ด โ€˜๊นŠ์ดโ€™๋ฅผ ๋ชจ๋ฅด๋ฉด, ์–ด๋–ป๊ฒŒ โ€˜์‹คํŒจโ€™ํ•˜์ง€ ์•Š๊ฒŒ ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” VLM์ด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ณ , ๋ฌผ๋ฆฌ์  ๊ณต๊ฐ„์€ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋กœ ๋ณด์™„ํ•ด์•ผ ํ–ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ โ€˜๊นŠ์ด ์ง€๋„ ์ƒ์„ฑโ€™์ด๋ผ๋Š” ์ƒ์„ฑ์  ๊ฐ๋…์„ VLM ํ•™์Šต ์ž์ฒด์— ์ง์ ‘ ํ†ตํ•ฉํ•จ์œผ๋กœ์จ, ๋ฌผ๋ฆฌ์  ๊ณต๊ฐ„ ์ดํ•ด์™€ ํ–‰๋™ ์‹คํ–‰์„ ๋™์‹œ์— ๊ฐ•ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • GEM-4M ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์ด, 4๊ฐœ์˜ ๋Œ€ํ‘œ์ ์ธ Embodied Intelligence Benchmarks์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ์„ฑ๊ณผ๋ณด๋‹ค ํ‰๊ท  **23.7% ๊ฐœ์„ **์„ ๋‹ฌ์„ฑํ•จ.
  • ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ํ…Œ์ŠคํŠธ๋œ GEM-VLA ๋ชจ๋ธ์€, ๊ธฐ์กด ๊ธฐ์ค€ ๋Œ€๋น„ **72.1%์˜ ์„ฑ๊ณต๋ฅ **์„ ๊ธฐ๋กํ•˜๋ฉฐ, ์‹คํŒจ์œจ์ด 27.9%๋กœ ๋‚ฎ์•„์ง.

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ VLM์ด ๋ฌผ๋ฆฌ์  ๊ณต๊ฐ„์„ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋กœ ๋ณด์™„ํ•ด์•ผ ํ–ˆ๋‹คโ€ โ†’ โ€œVLM์ด ์ง์ ‘ ๊นŠ์ด ์ง€๋„๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๊ณต๊ฐ„๊ณผ ํ–‰๋™์„ ๋™์‹œ์— ํ•™์Šตํ•˜๊ฒŒ ๋จโ€

7
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
alibaba-inc

๐Ÿง  โ€œLLM ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜๋Š” โ€˜์‚ฌ๋žŒ์ด ์ฐพ๊ธฐ ํž˜๋“ โ€™ ๋ฌธ์ œโ€ฆ ๊ทธ๋Ÿฐ๋ฐ ์ด ๋…ผ๋ฌธ์€ โ€˜์ž๋™์œผ๋กœ ์ถ”์ ํ•˜๊ณ  ์ˆ˜์ •โ€™ํ•ด๋ฒ„๋ ธ๋‹ค!โ€

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

๐Ÿ›๏ธ ์†Œ์†: alibaba-inc (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: memory tracing, error attribution, LLM memory systems, operational information flow, closed-loop optimization

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์™œ ๋‚ด LLM์ด ๊ธด ๋ฌธ๋งฅ์—์„œ ์˜ค๋ฅ˜๋ฅผ ๋ฑ‰๋Š” ๊ฑธ๊นŒ?โ€
  • โ€œRAG๋‚˜ Long-Context ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์‹คํŒจ ์›์ธ์„ ์–ด๋–ป๊ฒŒ ์ถ”์ ํ•ด์•ผ ํ• ๊นŒ?โ€
  • โ€œ์˜ค๋ฅ˜๋ฅผ ์ฐพ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์˜ค๋ฅ˜๋ฅผ โ€˜์ˆ˜์ •โ€™ํ•˜๋Š” ์‹œ์Šคํ…œ์ด ์žˆ์„๊นŒ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ถ”์ ํ•˜๊ณ  ๋ถ„์„ํ•ด์•ผ ํ–ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ ๋ฉ”๋ชจ๋ฆฌ ํ๋ฆ„์„ โ€˜์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๊ทธ๋ž˜ํ”„โ€™๋กœ ๋ณ€ํ™˜ํ•ด ์ž๋™์œผ๋กœ ์˜ค๋ฅ˜ ์›์ธ์„ ์ถ”์ ํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž๋™ ์ตœ์ ํ™”ํ•ด ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ํด๋กœ์ฆˆ๋“œ ๋ฃจํ”„ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ–ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์‹คํŒจ ์›์ธ์„ ์ •๋ฐ€ํ•˜๊ฒŒ ๋ถ„์„ํ•ด **end-task performance๋ฅผ 7.62% ํ–ฅ์ƒ**์‹œ์ผฐ๋‹ค.
  • MemTraceBench๋ฅผ ํ†ตํ•ด **4๊ฐ€์ง€ ๋Œ€ํ‘œ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ(Long-Context, RAG, Mem0, EverMemOS)์—์„œ ์‹œ์Šคํ…œ์ ์ธ ์˜ค๋ฅ˜ ํŒจํ„ด**์„ ๊ทœ๋ช…ํ–ˆ๋‹ค.

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

**์ˆ˜๋™ ์˜ค๋ฅ˜ ๋ถ„์„ โ†’ ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ํ๋ฆ„ ์ถ”์  + ์›์ธ ๋ถ„์„ + ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™” ํด๋กœ์ฆˆ๋“œ ๋ฃจํ”„**

์ด์ œ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜๋Š” โ€˜๋””๋ฒ„๊น…์˜ ๊ณ ํ†ตโ€™์ด ์•„๋‹ˆ๋ผ โ€˜์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ๊ธฐํšŒโ€™๋กœ ์ „ํ™˜๋œ๋‹ค.

8
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Tencent Hunyuan

๐ŸŽจ โ€œAI๊ฐ€ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆฌ๋Š” ๊ฑฐ๋ผ๊ธฐ๋ณด๋‹คโ€ฆ ์•„ํ‹ฐ์ŠคํŠธ๊ฐ€ ์ฝ”๋“œ๋กœ ์บ”๋ฒ„์Šค๋ฅผ ์กฐ์ž‘ํ•˜๋Š” ๊ฑฐ์•ผ?โ€

GenClaw: Code-Driven Agentic Image Generation

๐Ÿ›๏ธ ์†Œ์†: Tencent Hunyuan (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: code-driven, agentic image generation, visual reasoning, multimodal agent, executable sketching

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œAI๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, ์™œ ๋‚ด๊ฐ€ โ€˜๋‹ค์‹œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ฐ”๊ฟ”์•ผโ€™ ํ•˜๋Š” ๊ฑฐ์•ผ?โ€
  • โ€œ์™œ AI๋Š” ๋‚ด ์›ํ•˜๋Š” ๊ทธ๋ฆผ์„ โ€˜์ง์ ‘ ์กฐ์ž‘โ€™ํ•  ์ˆ˜ ์—†์ง€? ํŽ˜์ธํŠธ ๋ธŒ๋Ÿฌ์‹œ์ฒ˜๋Ÿผ ์“ฐ๊ณ  ์‹ถ์–ด!โ€
  • โ€œ์ด๋ฏธ์ง€ ์ƒ์„ฑ์ด โ€˜๊ฒ€์€ ์ƒ์žโ€™์ธ ์ด์œ ๊ฐ€ ๋ญ์•ผ? ์ฝ”๋“œ๋กœ ์บ”๋ฒ„์Šค๋ฅผ ์ง์ ‘ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” AI๊ฐ€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ ํ•˜๋ ค๋ฉด ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ๋ฐ”๊ฟ”์•ผ ํ–ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ AI๋ฅผ โ€˜์ฝ”๋“œ๋กœ ์กฐ์ž‘ ๊ฐ€๋Šฅํ•œ ์•„ํ‹ฐ์ŠคํŠธโ€™๋กœ ์ „ํ™˜์‹œ์ผœ, ๊ฐœ๋… โ†’ ์Šค์ผ€์น˜ โ†’ ์ปฌ๋Ÿฌ๋ง์˜ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋ฉฐ ์ธ๊ฐ„๊ณผ ๊ฐ™์€ ์ฐฝ์ž‘ ํ๋ฆ„์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • ์ฝ”๋“œ(์˜ˆ: SVG, HTML, Three.js)๋ฅผ ํ™œ์šฉํ•ด ์‹œ๊ฐ ์Šค์ผ€์น˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ, 85%์˜ ์‚ฌ์šฉ์ž๋“ค์ด โ€œ์ง๊ด€์ ์ด๊ณ  ์ œ์–ด ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผโ€๋ฅผ ๊ฒฝํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ตœ์ข… ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋‹จ๊ณ„์—์„œ, 92%์˜ ๊ฒฝ์šฐ ํ…์Šค์ฒ˜์™€ ๋ฌผ๋ฆฌ์  ์žฌํ˜„์„ฑ(photorealism)์ด ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ 3.7๋ฐฐ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ๊ฒ€์€ ์ƒ์ž ๊ธฐ๋ฐ˜์˜ ํ”„๋กฌํ”„ํŠธ ๋ฐ˜๋ณตโ€ โ†’ โ€œ์ฝ”๋“œ๋กœ ์กฐ์ž‘ ๊ฐ€๋Šฅํ•œ ์‹œ๊ฐ ์บ”๋ฒ„์Šค + ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๊ฒฐํ•ฉโ€

9
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
alibaba-inc

๐Ÿง  โ€œLoRA๊ฐ€ ๊ธฐ์–ตํ•˜๋Š” ๋ฐฉ์‹, ๊ทธ๊ฒŒ ์ง„์งœ ๊ธฐ์–ต์ธ๊ฐ€? ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…๋œ โ€˜๋ฉ”๋ชจ๋ฆฌ ๋ฒ•์น™โ€™์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค!โ€

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

๐Ÿ›๏ธ ์†Œ์†: alibaba-inc (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: LoRA, Parametric Memory, Power Law, Fine-tuning, Token-level Recall

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • LoRA๋กœ ํŠœ๋‹ํ•  ๋•Œ, ์™œ ์ผ๋ถ€ ํ† ํฐ์€ ์™„์ „ํžˆ ์žŠํžˆ๊ณ  ์ผ๋ถ€๋Š” ๊ผญ ๊ธฐ์–ตํ•˜๋‚˜์š”?
  • โ€˜๊ธฐ์–ตโ€™์ด๋ž€ ๋‹จ์–ด๊ฐ€ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ๋ฌด์Šจ ์ˆ˜์‹์œผ๋กœ ์ •์˜๋˜๋‚˜์š”?
  • ํ† ํฐ๋ณ„๋กœ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์ด ๋‹ค๋ฅด๋‹ค๋ฉด, ์–ด๋–ป๊ฒŒ ํšจ์œจ์ ์œผ๋กœ ๋ฐฐ๋ถ„ํ•ด์•ผ ํ•˜๋‚˜์š”?

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” LoRA์˜ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰๊ณผ ๋™์ž‘ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ •๋Ÿ‰ํ™”ํ•˜์ง€ ๋ชปํ–ˆ์œผ๋‚˜, ์ด ๋…ผ๋ฌธ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” โ€˜ํŒŒ๋ผ๋ฉ”ํŠธ๋ฆญ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ•์น™โ€™์„ ์ œ์‹œํ•จ]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • ํ† ํฐ ์ˆ˜์ค€ ๋ถ„์„์—์„œ, ์˜ˆ์ธก ํ™•๋ฅ  p > 0.5์ผ ๋•Œ greedy decoding์—์„œ ์ •ํ™•ํ•œ ํšŒ์ˆ˜์œจ์ด 98.7%์— ๋‹ฌํ•จ
  • MemFT ์ „๋žต ์ ์šฉ ์‹œ, ๋ฉ”๋ชจ๋ฆฌ ์‹ ๋ขฐ๋„ 17.3% ํ–ฅ์ƒ, ํŠธ๋ ˆ์ด๋‹ ์˜ˆ์‚ฐ ์žฌ๋ถ„๋ฐฐ๋กœ 22.1%์˜ ๊ณ„์‚ฐ ํšจ์œจ ์ฆ๊ฐ€

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

๊ธฐ์กด LoRA๋Š” ํ† ํฐ๋ณ„ ๊ธฐ์–ต๋ ฅ ์ฐจ์ด๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ๊ท ์ผํ•˜๊ฒŒ ํŠธ๋ ˆ์ด๋‹ โ†’ ์ด ๋…ผ๋ฌธ์€ โ€˜p > 0.5โ€™ ์ž„๊ณ„๊ฐ’ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์‚ฐ์„ ํ† ํฐ๋ณ„๋กœ ๋™์ ์œผ๋กœ ์žฌ๋ฐฐ๋ถ„

10
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Microsoft Research

๐Ÿง  โ€œAI๊ฐ€ ๊ฒฝํ—˜์„ ์Œ“์•„์„œ ์Šคํ‚ฌ์„ ๋งŒ๋“ ๋‹ค๊ณ  ํ•ด๋„โ€ฆ ๊ทธ ์Šคํ‚ฌ์ด ์ •๋ง๋กœ ์œ ์šฉํ•œ๊ฐ€?โ€

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

๐Ÿ›๏ธ ์†Œ์†: Microsoft Research (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: skill extraction, model-generated skills, agent evaluation, negative transfer, utility-grounded framework

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ๋ชจ๋ธ์ด ๋งŒ๋“  ์Šคํ‚ฌ์ด ์‹ค์ œ๋กœ ๋„์›€์ด ๋˜๋Š”๊ฐ€?โ€
  • โ€œ์–ด๋–ค ๋ชจ๋ธ์ด ์Šคํ‚ฌ์„ ์ž˜ ์ถ”์ถœํ•˜๋”๋ผ๋„, ๋‹ค๋ฅธ ๋ชจ๋ธ์ด ์“ฐ๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ์ด์œ ๋Š”?โ€
  • โ€œ์Šคํ‚ฌ์˜ ํ’ˆ์งˆ์ด ๋ชจ๋ธ ๊ทœ๋ชจ๋‚˜ ๊ธฐ๋ฐ˜ ํƒœ์Šคํฌ ๊ฐ•๋„์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š”๊ฐ€?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ์Šคํ‚ฌ ์ถ”์ถœ๊ณผ ์†Œ๋น„์˜ ์„ฑ๋Šฅ์„ ๋ถ„๋ฆฌํ•ด ํ‰๊ฐ€ํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ์ „์ฒด ์ƒ์• ์ฃผ๊ธฐ(๊ฒฝํ—˜ ์ƒ์„ฑ โ†’ ์ถ”์ถœ โ†’ ์†Œ๋น„)๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•ด โ€˜์‹ค์ œ ์œ ํšจ์„ฑโ€™์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • **๋ชจ๋ธ ์ƒ์„ฑ ์Šคํ‚ฌ์€ ํ‰๊ท ์ ์œผ๋กœ 15.2% ์„ฑ๋Šฅ ํ–ฅ์ƒ**์„ ๊ฐ€์ ธ์™”์ง€๋งŒ, **๋ถ€์ •์  ์ „์ด์œจ์€ ์ตœ๋Œ€ 37.1%**๋กœ ๋น„์ •์ƒ์ ์ธ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒ
  • **5๊ฐœ์˜ ๋‹ค์–‘ํ•œ ์• ์ „์ง€์  ๋„๋ฉ”์ธ์—์„œ 100๊ฐœ ์ด์ƒ์˜ ์Šคํ‚ฌ ์Œ์„ ์‹คํ—˜**ํ•ด, ์ถ”์ถœ์ž์™€ ์†Œ๋น„์ž ๊ฐ„์˜ ๋น„๋Œ€์นญ์„ฑ(๊ฐ• ์ถ”์ถœ์ž = ์•ฝ ์†Œ๋น„์ž ๋“ฑ)์„ 100% ํ™•์ธ

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ์Šคํ‚ฌ์„ ๋งŒ๋“œ๋Š” ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋‚˜ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•˜๋‹คโ€ โ†’ โ€œ์Šคํ‚ฌ์˜ ์‹ค์ œ ์œ ํšจ์„ฑ(์‚ฌ์šฉ์ž์—๊ฒŒ ์ฃผ๋Š” ์‹ค์งˆ์  ๊ฐ€์น˜)์ด ํ•ต์‹ฌโ€

โœ‰๏ธ

๋งค์ผ ๋ฐ›์•„๋ณด์„ธ์š”

AI ๋ฐ์ผ๋ฆฌ ๋‰ด์Šค ยท ๋…ผ๋ฌธ ยท GitHub ํŠธ๋ Œ๋“œ๋ฅผ ๋งค์ผ ํ•œ๊ตญ์–ด๋กœ ์ •๋ฆฌํ•ด ๋ณด๋‚ด๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์ŠคํŒธ ์—†์Œ ยท ์–ธ์ œ๋“  ๊ตฌ๋…์ทจ์†Œ ๊ฐ€๋Šฅ