๐Ÿ“„ PaperBytes

Weekly AI Papers โ€” 2026-06-08

๐Ÿ“„ 10ํŽธ ๐Ÿ›๏ธ ๋น…ํ…Œํฌ 10ํŽธ
1
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
NVIDIA

๐Ÿค– "๋ชจ๋“  ์„ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ ๋จธ์‹ ์— ์ง‘์–ด๋„ฃ๊ณ  ์‹ถ๋‹ค? ์ด์ œ ์ง„์งœ๋กœ ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค."

Cosmos 3: Omnimodal World Models for Physical AI

๐Ÿ›๏ธ ์†Œ์†: NVIDIA (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: omnimodal, world model, physical AI, mixture-of-transformers, embodied agent

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์˜์ƒ + ์Œ์„ฑ + ํ…์ŠคํŠธ + ํ–‰๋™โ€์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  • ๋กœ๋ด‡์ด ํ™˜๊ฒฝ์„ โ€˜์ดํ•ดโ€™ํ•˜๊ณ  โ€˜์ƒ์„ฑโ€™ํ•˜๋Š” ๊ฒŒ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ๊ฐ€๋Šฅํ• ๊นŒ?
  • ๊ธฐ์กด์˜ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์€ ๋„ˆ๋ฌด ๋А๋ฆฌ๊ณ , ๋กœ๋ด‡ ์ •์ฑ… ๋ชจ๋ธ์€ ๋„ˆ๋ฌด ์ œํ•œ์ ์ผ๊นŒ?

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(์˜์ƒ, ์Œ์„ฑ, ํ…์ŠคํŠธ ๋“ฑ)๋ฅผ ๋…๋ฆฝ๋œ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ–ˆ๊ณ , ๋กœ๋ด‡ ํ–‰๋™์€ ๋ณ„๋„์˜ ์ •์ฑ… ๋ชจ๋ธ์ด ํ•„์š”ํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ํ•˜๋‚˜์˜ mixture-of-transformers ์•„ํ‚คํ…์ฒ˜๋กœ ๋ชจ๋“  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ†ตํ•ฉํ•ด โ€˜๋ฌผ๋ฆฌ์  AIโ€™๋ฅผ ํ•˜๋‚˜์˜ ๋ผˆ๋Œ€๋กœ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • **Text-to-Image ๋ฐ Image-to-Video ๋ชจ๋ธ์—์„œ 1์œ„** โ€” Artificial Analysis๊ฐ€ ํ‰๊ฐ€ํ•œ ์ตœ๊ณ  ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ๋กœ, ๊ธฐ์กด ์ตœ๊ณ  ๋ชจ๋ธ ๋Œ€๋น„ 2.3๋ฐฐ ๋” ๋†’์€ ์ƒ์„ฑ ํ’ˆ์งˆ์„ ๋‹ฌ์„ฑ
  • **RoboArena์—์„œ ์ตœ๊ณ  ์ •์ฑ… ๋ชจ๋ธ** โ€” ๊ธฐ์กด ์ตœ๊ณ  ๋ชจ๋ธ ๋Œ€๋น„ 1.8๋ฐฐ ๋” ๋†’์€ ์„ฑ๊ณผ๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ, ๋ฌผ๋ฆฌ์  ์—์ด์ „ํŠธ์˜ ์ •์ฑ… ํ•™์Šต์— ์ ํ•ฉํ•œ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์ž…์ฆ

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

**๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ถ„์‚ฐ ๋ชจ๋ธ โ†’ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ์•„ํ‚คํ…์ฒ˜๋กœ ๋ชจ๋“  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ์„ธ๊ณ„ ๋ชจ๋ธ**

2
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
HUAWEI Computing Systems Lab

๐Ÿง  โ€œKV ์บ์‹œ ์••์ถ•์€ ์™œ ์‹คํŒจํ–ˆ์„๊นŒ? โ€” ํ† ํฐ ์Šค์ผ€์ผ ์˜ค๋ฅ˜๊ฐ€ ๋ˆ„์ ๋˜๋‹ˆ๊นŒ!โ€

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

๐Ÿ›๏ธ ์†Œ์†: HUAWEI Computing Systems Lab (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: KV-cache quantization, error accumulation, variance normalization, autoregressive decoding, test-time scaling

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ2๋น„ํŠธ๋กœ ์••์ถ•ํ•œ KV ์บ์‹œ๊ฐ€ ์™œ ์˜คํžˆ๋ ค ์ •๋‹ต๋ฅ ์ด ๋–จ์–ด์ง€์ง€?โ€
  • โ€œ๊ธด ๋ฌธ์žฅ ์ƒ์„ฑํ• ์ˆ˜๋ก ๋ชจ๋ธ์ด ๋” ์ด์ƒ ์•ˆ์ •์ ์ด์ง€ ์•Š์•„์š”โ€ฆ ์™œ?โ€
  • โ€œํ…Œ์ŠคํŠธ ์‹œ์Šคํ…œ์—์„œ ์Šค์ผ€์ผ๋ง์ด ํšจ๊ณผ์ ์ด๋ฉด ์™œ ์บ์‹œ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์— ๊ฑธ๋ ค?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” KV ์บ์‹œ๋ฅผ ๋‹จ์ˆœํžˆ ์ •๊ทœํ™”ํ•˜๊ฑฐ๋‚˜ ์Šค์ผ€์ผ๋ง๋งŒ ์ ์šฉํ•ด ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋ ค ํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ํ† ํฐ ์Šค์ผ€์ผ ์˜ค๋ฅ˜๊ฐ€ autoregressive decoding ์‹œ์ ์—์„œ ๋ˆ„์ ๋˜๋ฉฐ ์„ฑ๋Šฅ์„ ํ•ด์น˜๋Š” ๋ฌธ์ œ๋ฅผ ์ •ํ™•ํžˆ ์ง„๋‹จํ•˜๊ณ , Hadamard ํšŒ์ „ + ์ด์ค‘ ์Šค์ผ€์ผ๋ง์˜ ๋ถ„์‚ฐ ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ์ด๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • MATH500์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ๊ธฐ๋ก 62.1% ๋Œ€๋น„ **72.3%** ์ •๋‹ต๋ฅ  ํ–ฅ์ƒ (2๋น„ํŠธ ์ •๋ฐ€๋„)
  • HumanEval์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ๊ธฐ๋ก 49.2% ๋Œ€๋น„ **57.8%** ์ •๋‹ต๋ฅ  ํ–ฅ์ƒ (2๋น„ํŠธ ์ •๋ฐ€๋„)
  • AIME24์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ๊ธฐ๋ก 38.7% ๋Œ€๋น„ **46.1%** ์ •๋‹ต๋ฅ  ํ–ฅ์ƒ (2๋น„ํŠธ ์ •๋ฐ€๋„)

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ๋‹จ์ˆœ ์Šค์ผ€์ผ๋ง โ†’ ๋ถ„์‚ฐ ์ •๊ทœํ™” + ํ•˜๋‹ค๋งˆ๋“œ ํšŒ์ „ ์ ์šฉโ€

(๊ธฐ์กด์€ ํ† ํฐ ์Šค์ผ€์ผ ์˜ค๋ฅ˜๋ฅผ ๋ฌด์‹œํ•˜๊ฑฐ๋‚˜ ์ œํ•œ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ–ˆ์ง€๋งŒ, KVarN์€ ์ด ์˜ค๋ฅ˜๋ฅผ ์ •ํ™•ํžˆ ๊ฐ์ง€ํ•˜๊ณ , ๊ฐ ํ† ํฐ์˜ ์Šค์ผ€์ผ์„ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ ์œผ๋กœ ์กฐ์ •ํ•ด ๋ˆ„์  ์˜ค๋ฅ˜๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ์ฐจ๋‹จ)

3
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Tencent

๐Ÿง  โ€œ๋ฏธ๋ž˜๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•ด๋„ ํ‹€๋ฆด ์ˆ˜ ์žˆ๋‹ค? ๊ทธ๋Ÿผ ์™œ ์“ฐ๋Š” ๊ฑฐ์•ผ?โ€

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

๐Ÿ›๏ธ ์†Œ์†: Tencent (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: world model, multimodal LLM, controlled reasoning, self-distillation, future simulation

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์‹œ๊ฐํ™”๋œ ๋ฏธ๋ž˜ ์˜ˆ์ธก์ด ํ‹€๋ฆด ์ˆ˜ ์žˆ์œผ๋ฉด, ์™œ ์“ฐ๋Š” ๊ฑฐ์•ผ?โ€
  • โ€œLLM์ด ์ถ”๋ก ํ•˜๋Š” ๊ฑด ๊ดœ์ฐฎ๋Š”๋ฐ, ๊ทธ ์‹œ๊ฐํ™”๋œ ๋ฏธ๋ž˜๊ฐ€ ํ‹€๋ ธ์„ ๋•Œ ์–ด๋–ป๊ฒŒ ๋Œ€์ฒ˜ํ•ด?โ€
  • โ€œ๋ฏธ๋ž˜๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ๊ฒŒ ํž˜๋“ค๋ฉด, ๊ทธ๋ƒฅ ์ถ”๋ก ๋งŒ ํ•˜๋ฉด ๋˜๋Š” ๊ฑฐ ์•„๋ƒ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ๋ฅผ ์‹ ๋ขฐํ•˜๋˜ ๋ฐฉ์‹์ด์—ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ์ •ํ™•์„ฑ๊ณผ ์ถ”๋ก ์˜ ์ผ๊ด€์„ฑ์„ ๋™์‹œ์— ์กฐ์œจํ•˜๋Š” โ€˜ํ†ต์ œ๋œ ๊ตฌ์ฒด ์ถ”๋ก โ€™์œผ๋กœ ๋’ค์ง‘์—ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • VRQABench์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ๊ธฐ๋ก ๋Œ€๋น„ **10.6% ์ ์ˆ˜ ํ–ฅ์ƒ**
  • OpenWorldQA์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ๊ธฐ๋ก ๋Œ€๋น„ **10.9% ์ ์ˆ˜ ํ–ฅ์ƒ**
  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์˜ค๋ฅ˜์— ๋Œ€ํ•œ **robustness ์ฆ๊ฐ€**๋กœ ํ‹€๋ฆฐ ๋ฏธ๋ž˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋„ ์ •ํ™•ํ•œ ๋‹ต๋ณ€์œผ๋กœ ์ด์–ด์ง

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋ƒฅ ๋ฏฟ๊ณ  ๋โ€ โ†’ โ€œ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๊ฒ€์ฆํ•˜๊ณ  ์ถ”๋ก ๊ณผ ํ†ตํ•ฉํ•ด ์ •ํ™•ํ•œ ๊ฒฐ๋ก  ๋„์ถœโ€

4
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
NVIDIA

๐ŸŽฌ โ€œ๊ธด ์˜์ƒ ์ƒ์„ฑ์— โ€˜๊ธฐ์–ตโ€™์ด ํ•„์š”ํ•˜๋‹ค? ์ด ๋…ผ๋ฌธ์ด ๋‹ต์„ ์คฌ๋‹ค!โ€

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

๐Ÿ›๏ธ ์†Œ์†: NVIDIA (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: retrieval-augmented generation, video diffusion, latent history, temporal delta loss, long-horizon generation

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์˜์ƒ์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ์™œ ์–ผ๊ตด์ด ๋ณ€ํ•ด์š”?โ€
  • โ€œ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋งŒ ์จ์„œ ์˜ค๋ž˜๋œ ํ”„๋ ˆ์ž„์— ์˜ค๋ฅ˜๊ฐ€ ๋ˆ„์ ๋˜๋Š” ๊ฑด ์™œ ์•ˆ ๊ณ ์ณ์ง€์ฃ ?โ€
  • โ€œ๋‚ด๊ฐ€ ์ƒ์„ฑํ•œ ์˜์ƒ ์ž์ฒด๋ฅผ ๊ธฐ์–ตํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค๋ฉด, ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ๊นŒ์š”?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋งŒ์œผ๋กœ ์ƒ์„ฑํ–ˆ๊ณ , ์˜ค๋ฅ˜๊ฐ€ ๋ˆ„์ ๋˜๋ฉด ์ ์  ๋” ํ๋ ค์กŒ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ์ƒ์„ฑ๋œ ๋ผtent(์ž ์žฌ ํ‘œํ˜„)์„ โ€˜๊ฒ€์ƒ‰ ๊ฐ€๋Šฅํ•œ ์—ญ์‚ฌโ€™๋กœ ํ™œ์šฉํ•ด, ๊ณผ๊ฑฐ์˜ ์ •ํ™•ํ•œ ์ •๋ณด๋ฅผ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • **VBench-Long ํ‰๊ฐ€์—์„œ ํ‰๊ท  1.25์  ์ฆ๊ฐ€** โ€” ๊ธฐ์กด ์ตœ๊ณ  ์„ฑ๊ณผ๋ณด๋‹ค 1.25์  ๋” ๋†’์€ ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๊ธฐ๋ก.
  • **์˜ค๋ฅ˜ ๋ˆ„์  ๊ฐ์†Œ 3.8๋ฐฐ** โ€” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ๋ฐฉ์‹ ๋Œ€๋น„, ์˜ค๋ฅ˜ ๋ˆ„์  ์†๋„๋ฅผ 3.8๋ฐฐ ๊ฐ์†Œ์‹œํ‚ด.

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

**โ€œ์‹ค์‹œ๊ฐ„ ์œˆ๋„์šฐ๋งŒ์œผ๋กœ ์˜ค๋ฅ˜๋ฅผ ๋ˆ„์ ์‹œํ‚ค๋Š” ์ƒ์„ฑโ€ โ†’ โ€œ๊ณผ๊ฑฐ ๋ผtent๋ฅผ ๊ฒ€์ƒ‰ํ•ด ์ •ํ™•ํ•œ ๋งฅ๋ฝ์„ ์žฌ์‚ฌ์šฉํ•˜๋Š” RAG ๊ธฐ๋ฐ˜ ์ƒ์„ฑโ€**

5
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Microsoft

๐Ÿš€ "์˜คํ”ˆ ์›น ์—์ด์ „ํŠธ๋„ ์˜จ๋ผ์ธ RL๋กœ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•˜๋‹ค? ์‹ค์ œ ์›น์‚ฌ์ดํŠธ์—์„œ 67% ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ"

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

๐Ÿ›๏ธ ์†Œ์†: Microsoft (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: online reinforcement learning, visual web agents, multi-turn RL, live-browser infrastructure, open-source agent training

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์˜คํ”ˆ ์†Œ์Šค ์—์ด์ „ํŠธ๋Š” ์™œ ํ•ญ์ƒ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋‚˜์š”?โ€
  • โ€œ์‹ค์‹œ๊ฐ„ ์›น์‚ฌ์ดํŠธ์—์„œ RL ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ• ๊นŒ?โ€
  • โ€œ์ •๋ง๋กœ 2,200๊ฐœ์˜ RL ํ…Œ์Šคํฌ๋กœ ์›น ์—์ด์ „ํŠธ๋ฅผ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ์˜คํ”ˆ ์—์ด์ „ํŠธ๊ฐ€ ์ˆ˜์ž‘์—…์œผ๋กœ ์ˆ˜์ง‘ํ•œ ๊ณ ์ • ๋ฐ์ดํ„ฐ์…‹์— ์˜์กดํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ์‹ค์ œ ์›น์‚ฌ์ดํŠธ์—์„œ ์˜จ๋ผ์ธ ๋ฉ€ํ‹ฐํ„ด RL์„ ์ ์šฉํ•ด ํ›ˆ๋ จ ํŒŒ์ดํ”„๋ผ์ธ์„ ์™„์ „ํžˆ ์žฌ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • 0.4K ์ดˆ๊ธฐ ํŠธ๋ ˆ์ด์ ํ† ๋ฆฌ์™€ 2.2K ์˜คํ”ˆ์—”๋“œ RL ํ›ˆ๋ จ ํƒœ์Šคํฌ๋กœ OpenWebRL-4B๊ฐ€ Online-Mind2Web์—์„œ 67.0% ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ
  • ๋™์ผ ๊ทœ๋ชจ ๋˜๋Š” ๋” ํฐ ๊ทœ๋ชจ์˜ ์˜คํ”ˆ ์—์ด์ „ํŠธ๋ณด๋‹ค ์„ฑ๋Šฅ ์šฐ์ˆ˜ํ•˜๋ฉฐ, OpenAI CUA ๋ฐ Gemini CUA์™€ ๊ฒฝ์Ÿ ์ˆ˜์ค€์˜ ์„ฑ๊ณผ ์œ ์ง€

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ๊ณ ์ • ๋ฐ์ดํ„ฐ์…‹์— ์˜์กดํ•˜๋Š” ์˜คํ”ˆ ์—์ด์ „ํŠธโ€ โ†’ โ€œ์‹ค์‹œ๊ฐ„ ์›น์‚ฌ์ดํŠธ์—์„œ ์˜จ๋ผ์ธ ๋ฉ€ํ‹ฐํ„ด RL๋กœ ํ›ˆ๋ จํ•˜๋Š” ์˜คํ”ˆ ์—์ด์ „ํŠธโ€

6
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
BAIDU

๐Ÿ—บ๏ธ "๋„๋กœ๋ฅผ AI๊ฐ€ ์ง์ ‘ ๊ทธ๋ฆฐ๋‹ค? ๊ทธ๋Ÿฐ๋ฐ ์™œ ์ธ๊ฐ„์ด ์—ฌ์ „ํžˆ ์†์„ ๋Œ€๋Š”๊ฐ€?"

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

๐Ÿ›๏ธ ์†Œ์†: BAIDU (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: agentic framework, lane-level mapping, specification verification, vision-language reasoning, map editing

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œAI๊ฐ€ ๋„๋กœ๋ฅผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์™œ ์—ฌ์ „ํžˆ ์‚ฌ๋žŒ์ด ์ˆ˜์ •์„ ํ•ด์•ผ ํ•˜๋‚˜?โ€
  • โ€œ๋„๋กœ ํ‘œ์‹œ๊ฐ€ ํ๋ ค๋„ AI๊ฐ€ ์ •ํ™•ํžˆ ๋„๋กœ๋ฅผ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์„๊นŒ?โ€
  • โ€œ์ˆ˜๋ฐฑ ๋„์‹œ ๊ทœ๋ชจ์—์„œ AI๊ฐ€ ๋„๋กœ๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ฑด ํ˜„์‹ค์ธ๊ฐ€?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ์„ผ์„œ ๋ฐ์ดํ„ฐ์—์„œ ์ง์ ‘ ๋„๋กœ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ด์—ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ๋ช…ํ™•ํ•œ ๊ทœ์น™๊ณผ ์ œ์•ฝ์„ ์ ์šฉํ•œ โ€˜๊ฒ€์ฆ ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ ๋ฃจํ”„โ€™๋กœ ๋„๋กœ ์ƒ์„ฑ์„ ๋’ค์ง‘์—ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • 360๊ฐœ ๋„์‹œ์— ์ ์šฉํ•ด ์ „์ฒด ์ƒ์‚ฐ ์ž๋™ํ™”์œจ์„ 95%๋กœ ๋Œ์–ด์˜ฌ๋ ธ์Œ
  • ๋ณต์žกํ•œ ์žฅ๋ฉด(ํ‘œ์‹œ ํ›ผ์†, ๊ฐ€๋ ค์ง ๋“ฑ)์—์„œ ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค 15% ์ด์ƒ ์ •ํ™•๋„ ํ–ฅ์ƒ

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋„๋กœ๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ์ˆ˜์ž‘์—… โ†’ AI๊ฐ€ ๊ทœ์น™์„ ๋ช…์‹œ์ ์œผ๋กœ ๊ฒ€์ฆํ•˜๊ณ , ์ž๋™์œผ๋กœ ์ˆ˜์ •ํ•˜๋Š” ์—์ด์ „ํŠธ ์‹œ์Šคํ…œโ€

7
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
ByteDance

๐ŸŽค โ€œ๋‹ค์‹œ ๋งํ•ด๋„ ์‹ ๊ธฐํ•œ๋ฐโ€ฆ ๋Œ€ํ™”์ฒ˜๋Ÿผ ์ž์—ฐ์Šค๋Ÿฌ์šด ์Œ์„ฑ ํ•ฉ์„ฑ๋„ โ€˜์ œ๋กœ์ƒทโ€™์œผ๋กœ ๊ฐ€๋Šฅํ•˜๋‹ค?โ€

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

๐Ÿ›๏ธ ์†Œ์†: ByteDance (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: zero-shot TTS, expressive speech, dialogue synthesis, speaker-turn conditioning, diffusion model

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • ๋Œ€ํ™”ํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์ด ์™œ ํ•ญ์ƒ โ€˜๋Š๊น€โ€™๊ณผ โ€˜๊ฐ์ • ๋ถˆ์ผ์น˜โ€™๊ฐ€ ๋ฌธ์ œ์ธ๊ฐ€?
  • ๋‹จ์ผ ์Œ์„ฑ ๋ชจ๋ธ๋กœ ๋Œ€ํ™”๋ฅผ ํ•ฉ์„ฑํ•˜๋ฉด ์–ด๋–ค ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๋‚˜?
  • โ€˜์ œ๋กœ์ƒทโ€™์œผ๋กœ๋„ ๊ฐ์ •๊ณผ ํ†ค์„ ์œ ์ง€ํ•˜๋Š” ์Œ์„ฑ ํ•ฉ์„ฑ์ด ๊ฐ€๋Šฅํ• ๊นŒ?

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ๊ฐ ๋Œ€ํ™” ํ„ด์„ ๋…๋ฆฝ์ ์œผ๋กœ ํ•ฉ์„ฑํ•ด ์กฐ๊ฐ ๋งž์ถฐ์•ผ ํ–ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ 1~4๋ช…์˜ ์Œ์„ฑ๋„ ํ•œ ๋ชจ๋ธ๋กœ ์ œ๋กœ์ƒท์œผ๋กœ ์ฒ˜๋ฆฌํ•ด ๋Œ€ํ™” ํ๋ฆ„๊ณผ ๊ฐ์ • ์—ฐ์†์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • SwanBench-Speech ํ‰๊ฐ€์—์„œ **๋ชจ๋“  ์˜คํ”ˆ์†Œ์Šค ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค 25% ๋†’์€ โ€˜ richness โ€™ ์ ์ˆ˜**๋ฅผ ๊ธฐ๋ก
  • **๋Œ€ํ™” ์„ค์ •์—์„œ๋„ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ 30% ๋†’์€ โ€˜hierarchyโ€™ ์ ์ˆ˜**๋ฅผ ๋‹ฌ์„ฑ, ๊ฐ์ •๊ณผ ๊ตฌ์กฐ์˜ ๊ณ„์ธต์„ฑ์„ ํ›จ์”ฌ ๋” ์ž˜ ์žฌํ˜„

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ๊ฐ ํ„ด์„ ๋…๋ฆฝ์ ์œผ๋กœ ํ•ฉ์„ฑํ•ด ์กฐ๊ฐ ๋งž์ถ”๋Š” ๋ฐฉ์‹โ€ โ†’ โ€œํ•œ ๋ชจ๋ธ๋กœ 1~4๋ช…์˜ ์Œ์„ฑ๊ณผ ํ„ด์„ ์กฐ๊ฑด๋ถ€๋กœ ์ œ์–ดํ•˜๋Š” ์ œ๋กœ์ƒท ํ•ฉ์„ฑโ€

๋…ผ๋ฌธ ๋ณด๊ธฐ โ†’ Ruiqi Li, Yu Zhang, Changhao Pan ์™ธ 3๋ช…
8
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
ByteDance Seed

๐ŸŽจ "๋น„ํŠธ์ฝ”์ธ์ฒ˜๋Ÿผ ํ๋ฅด๋Š” ์ด๋ฏธ์ง€? VAE ์—†์ด๋„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์ด โ€˜์ž์ฒด ์ƒ์„ฑโ€™์„ ์™„์„ฑํ–ˆ๋‹ค!"

Representation Forcing for Bottleneck-Free Unified Multimodal Models

๐Ÿ›๏ธ ์†Œ์†: ByteDance Seed (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Representation Forcing, Bottleneck-Free, Unified Multimodal Models, Pixel-Space Generation, VAE-Free

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์™œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์— VAE๊ฐ€ ๊ผญ ํ•„์š”ํ• ๊นŒ?โ€
  • โ€œ๋ชจ๋ธ์ด ์ง์ ‘ ํ”ฝ์…€์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์™œ VAE๋ฅผ ๋ผ์›Œ๋„ฃ๋Š” ๊ฑธ๊นŒ?โ€
  • โ€œ์ด๋ฏธ์ง€ ์ดํ•ด์™€ ์ƒ์„ฑ์„ ๋™์‹œ์— ์ž˜ํ•˜๋Š” ๋ชจ๋ธ์€ ์ง„์งœ๋กœ ๊ฐ€๋Šฅํ•œ๊ฐ€?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” VAE๊ฐ€ ๊ณ ์ •๋œ ์™ธ๋ถ€ ์ž ์žฌ ๊ณต๊ฐ„์„ ํ†ตํ•ด ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ๋‹ด๋‹นํ–ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ VAE ์—†์ด๋„ ๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ โ€˜ํ‘œํ˜„์„ ์˜ˆ์ธกํ•˜๊ณ โ€™ โ€˜ํ”ฝ์…€์„ ์ƒ์„ฑโ€™ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์กฐ๋ฅผ ์žฌ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • ์ด๋ฏธ์ง€ ์ƒ์„ฑ์—์„œ RF ์ ์šฉ ๋ชจ๋ธ์€ VAE ๊ธฐ๋ฐ˜ ์ตœ์‹  ๋ชจ๋ธ๊ณผ **๋™์ผํ•œ ์ˆ˜์ค€์˜ ํ’ˆ์งˆ**(์ฆ‰, **100% ๊ฒฝ์Ÿ๋ ฅ ์ˆ˜์ค€**)์„ ๋‹ฌ์„ฑ
  • ์ด๋ฏธ์ง€ ์ดํ•ด ์„ฑ๋Šฅ์—์„œ VAE ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋Œ€๋น„ **ํ‰๊ท  5.2% ํ–ฅ์ƒ**์„ ๊ธฐ๋ก (๋ฏธ์„ธํ•œ ๊ตฌ์กฐ ํŒŒ์•… ๋Šฅ๋ ฅ ํ–ฅ์ƒ)

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

**โ€œ์™ธ๋ถ€ VAE๋ฅผ ๋ผ์›Œ๋„ฃ๊ณ , ๊ณ ์ •๋œ ์ž ์žฌ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•˜๋Š” ๊ตฌ์กฐโ€ โ†’ โ€œ์ž์ฒด์ ์œผ๋กœ ํ‘œํ˜„์„ ์˜ˆ์ธกํ•˜๊ณ , ํ”ฝ์…€ ์ƒ์„ฑ์„ ์ง์ ‘ ๋‹ด๋‹นํ•˜๋Š” ๋‚ด์žฌ์  ์ƒ์„ฑ ์•„ํ‚คํ…์ฒ˜โ€**

๋…ผ๋ฌธ ๋ณด๊ธฐ โ†’ Yuqing Wang, Zhijie Lin, Ceyuan Yang ์™ธ 10๋ช…
9
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Samsung Research

๐Ÿค– โ€œ์˜คํ”ผ์…œ ํ•™์Šต๋„ ๋ง๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค? ํŠธ๋Ÿฌ์ŠคํŠธ ๋ ˆ์ง€์˜จ์œผ๋กœ ์•ˆ์ •ํ™”ํ•œ OPD๊ฐ€ ๋“ฑ์žฅํ–ˆ๋‹ค!โ€

Trust Region On-Policy Distillation

๐Ÿ›๏ธ ์†Œ์†: Samsung Research (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: On-Policy Distillation, Trust Region, Policy Gradient, KL Divergence, Token-Level Supervision

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œ์˜คํ”ผ์…œ ํ•™์Šต์ด ์™œ ์•ˆ ๋˜๋Š” ๊ฑธ๊นŒ?โ€
  • โ€œ๊ต์‚ฌ์™€ ํ•™์ƒ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๋ฉด ํ•™์Šต์ด ์‹คํŒจํ•˜๋Š” ๊ฑด ๋‹น์—ฐํ•œ๊ฐ€?โ€
  • โ€œ์˜คํ”ผ์…œ ํ•™์Šต์— ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ๋…์ด ์—†์œผ๋ฉด, ์–ด๋–ป๊ฒŒ ํ•™์Šต์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‚˜?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” ํ•™์ƒ์ด ๊ต์‚ฌ์˜ ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ–ˆ์ง€๋งŒ, ๋ถ„ํฌ๊ฐ€ ๋„ˆ๋ฌด ๋‹ค๋ฅด๋ฉด ์ •์ฑ… ๊ทธ๋ผ๋””์–ธํŠธ๊ฐ€ ๋ถˆ์•ˆ์ •ํ•ด์ ธ ํ•™์Šต์ด ์‹คํŒจํ–ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ โ€œ์‹ ๋ขฐ ์˜์—ญโ€์„ ์„ค์ •ํ•ด, ๊ต์‚ฌ์˜ ๊ฐ๋…์ด ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ์ง€์—ญ์—์„œ๋งŒ ์˜คํ”ผ์…œ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์•ˆ์ •์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ๋†’์˜€๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • ์ˆ˜ํ•™ ์ถ”๋ก  ํ…Œ์ŠคํŠธ์—์„œ ๊ธฐ์กด OPD ๋Œ€๋น„ **12.7% ์ ์ˆ˜ ํ–ฅ์ƒ**
  • ์ฝ”๋“œ ์ƒ์„ฑ ํ‰๊ฐ€์—์„œ **EOPD ๋Œ€๋น„ 19.3% ๋” ๋†’์€ ์„ฑ๊ณผ**
  • ์ผ๋ฐ˜ ๋„๋ฉ”์ธ ํ‰๊ฐ€์—์„œ **REOPOLD ๋Œ€๋น„ 21.1% ์„ฑ๋Šฅ ๊ฐœ์„ **

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œ๊ต์‚ฌ-ํ•™์ƒ ๋ถ„ํฌ ์ฐจ์ด์— ์ทจ์•ฝํ•œ ์˜คํ”ผ์…œ ๊ฐ๋…โ€ โ†’ โ€œ์‹ ๋ขฐ ์˜์—ญ ๋‚ด์—์„œ๋งŒ ํ•™์Šตํ•˜๋Š” ํŠธ๋Ÿฌ์ŠคํŠธ ๋ ˆ์ง€์˜จ ์˜คํ”ผ์…œ ๋””์Šคํ‹ฐ๋ฆฌ์…˜โ€

๋…ผ๋ฌธ ๋ณด๊ธฐ โ†’ Xingrun Xing, Haoqing Wang, Boyan Gao ์™ธ 2๋ช…
10
๐Ÿ›๏ธ ๋น…ํ…Œํฌ
Deepmind

๐ŸŽจ โ€œAI๊ฐ€ 3D ๋ชจ๋ธ๋ง์„ โ€˜์ฝ”๋“œ๋กœโ€™ ํ•ด๋‚ด๋Š” ๊ฑด, ์ด์ œ โ€˜๊ฐ€๋Šฅโ€™์ด ์•„๋‹ˆ๋ผ โ€˜๋ฌด์Šจ ์ˆ˜์ค€โ€™์ด ๊ถ๊ธˆํ•ด์ง€๋Š” ์‹œ๋Œ€์ž…๋‹ˆ๋‹ค.โ€

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

๐Ÿ›๏ธ ์†Œ์†: Deepmind (๋น…ํ…Œํฌ)

๐Ÿท๏ธ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: procedural 3D modeling, vision-language models, agent benchmark, code generation, 3DCodeArena

๐Ÿ’ญ ์ด๋Ÿฐ ์งˆ๋ฌธ์„ ํ•ด๋ณธ ์  ์žˆ๋‚˜์š”?

  • โ€œAI๊ฐ€ ์ด๋ฏธ์ง€๋‚˜ ๋ฌธ์žฅ์„ ๋ณด๊ณ  ์ž๋™์œผ๋กœ 3D ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์™œ ์•„์ง ์ฝ”๋“œ๋กœ ์ž‘์„ฑํ•˜๋Š” ๊ฒŒ ๋” ์ค‘์š”ํ• ๊นŒ?โ€
  • โ€œVLM์ด 3D ๋ชจ๋ธ๋ง์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์™œ 12๊ฐœ์˜ ๊ณ ๊ธ‰ ๋ชจ๋ธ ์ค‘ 90%๊ฐ€ ์‹คํŒจํ•˜๋Š” ๊ฑธ๊นŒ?โ€
  • โ€œ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ๋ชจ๋ธ์ด โ€˜๋ถ€์œ ํ•œ ๋ถ€ํ’ˆโ€™์ด ๋˜๋Š” ๊ฑธ ๋ฐฉ์ง€ํ•˜๋ ค๋ฉด, ์–ด๋–ค ๊ธฐ์ˆ ์ด ํ•„์š”ํ•œ๊ฐ€?โ€

[ํ•ต์‹ฌ ์„ค๋ช…: ๊ธฐ์กด์—๋Š” VLM์ด 3D ๋ชจ๋ธ๋ง์„ โ€˜์ƒ์„ฑโ€™ํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋งŒ ํ‰๊ฐ€๋๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ โ€˜์ฝ”๋“œ๋กœ ์ƒ์„ฑโ€™ํ•˜๋Š” ๋Šฅ๋ ฅ๊ณผ โ€˜์‹คํ–‰ ํ™˜๊ฒฝโ€™์˜ ์ค‘์š”์„ฑ์„ ์ฒ˜์Œ์œผ๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.]

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ :

  • 12๊ฐœ์˜ ๊ณ ๊ธ‰ VLM ์ค‘ 90%๊ฐ€ API ๋ถˆ์ผ์น˜๋กœ ์‹คํŒจํ–ˆ์œผ๋ฉฐ, ์„ฑ๊ณตํ•œ ๋ชจ๋ธ๋„ 67%๊ฐ€ ์—ฐ๊ฒฐ๋˜์ง€ ์•Š์€ ๋ถ€ํ’ˆ์„ ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ…Œ์ŠคํŠธ ํƒ€์ž„ ์Šค์ผ€์ผ๋ง(์‚ฌ๊ณ  ์˜ˆ์‚ฐ ์ฆ๊ฐ€ ๋ฐ ๋‹ค๋‹จ๊ณ„ ํ”ผ๋“œ๋ฐฑ)์ด ์ ์šฉ๋  ๋•Œ ์„ฑ๋Šฅ์ด ํ‰๊ท  2.3๋ฐฐ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๐ŸŽฏ ์™œ ์ด๊ฒƒ์ด ๊ฒŒ์ž„ ์ฒด์ธ์ €์ธ๊ฐ€? :

โ€œVLM์ด 3D ๋ชจ๋ธ๋ง์„ โ€˜์ƒ์„ฑโ€™ํ•˜๋Š” ๊ฒƒโ€ โ†’ โ€œVLM์ด 3D ๋ชจ๋ธ๋ง์„ โ€˜์ฝ”๋“œ๋กœ ์ž‘์„ฑํ•˜๊ณ  ์‹คํ–‰โ€™ํ•˜๋Š” ๊ฒƒโ€

โœ‰๏ธ

๋งค์ผ ๋ฐ›์•„๋ณด์„ธ์š”

AI ๋ฐ์ผ๋ฆฌ ๋‰ด์Šค ยท ๋…ผ๋ฌธ ยท GitHub ํŠธ๋ Œ๋“œ๋ฅผ ๋งค์ผ ํ•œ๊ตญ์–ด๋กœ ์ •๋ฆฌํ•ด ๋ณด๋‚ด๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์ŠคํŒธ ์—†์Œ ยท ์–ธ์ œ๋“  ๊ตฌ๋…์ทจ์†Œ ๊ฐ€๋Šฅ