Speculative Decoding Vllm - Search Videos

🌵 Speculative Speculative DecodingWhat if your draft model could speculate while the target model is still verifying? That's the idea behind Speculative Speculative Decoding (SSD). I've been… | Maxime Labonne | 15 comments

🌵 Speculative Speculative DecodingWhat if your draft model could speculate while the target model is still verifying? That's the idea behind Speculative Speculative Decoding (SSD). I've been… | Maxime Labonne | 15 comments

15 views2 months ago

Measuring Qwen3.6-27B NVFP4 MTP on vLLM: ~190 tok/s TG on Dual RTX PRO 6000 Blackwell Max-Q

Measuring Qwen3.6-27B NVFP4 MTP on vLLM: ~190 tok/s TG on Dual RTX PRO 6000 Blackwell Max-Q

Vienna vLLM Meetup Live Stream - March 11, 2026 | Ajit Joshi

Vienna vLLM Meetup Live Stream - March 11, 2026 | Ajit Joshi

15.3K views2 months ago

Speculative Decoding — Think Fast⚡, Then Think Right✅

Speculative Decoding — Think Fast⚡, Then Think Right✅

How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100

How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

Multi-Token Prediction (MTP): Accelerating Local Models with no Quality Loss

Multi-Token Prediction (MTP): Accelerating Local Models with no Quality Loss

1.4K views1 week ago

YouTubeOnchain AI Garage

Speculative Decoding: Make AI 2-3x Faster for Free | Tech Decoded

3 views1 month ago

Multi-Token Prediction: Why Your GPU Runs LLMs 3x Faster

4 views1 week ago

YouTubeDevsplainers

What is Speculative Decoding ?

38 views2 weeks ago

YouTubeDeepManim

Don't use speculative decoding until you watch this

7 views3 weeks ago

YouTubeDigitalOcean

DFlash Just Hit Google TPUs — 3x Faster LLM Inference is Now Real

3K views2 weeks ago

YouTubeFahd Mirza

DFlash Drafter for Gemma 4 26B - Official Speculative Decoding is Here: Run Locally

4.8K views2 weeks ago

YouTubeFahd Mirza

ContextForge — AMD AI Hackathon 2026

YouTubePablo Manuel Suarez

Speculation is all you need: Intro to Speculative Decoding for High Performance Inference

753 views2 months ago

600 Toks/Second Gemma4-26B —The Setting That Actually Wins (vLLM + Dflash Speculative Decoding)

3.4K views1 week ago

YouTubeTech-Practice

Speculative Decoding: 2-3x Faster LLMs for Free

1 views1 month ago

YouTubeThe AI Century

The regex trick that beats structured output #ai #coding #performance

1.3K views1 month ago

YouTubeJimi V. (Bitswired)

3.5K+ Stars • AI/ML | DFlash — Faster LLM Inference via Block Diffusion #shorts

1.1K views1 week ago

YouTubeneural-nexus

Speculative Decoding • LLM Acceleration Patterns

1 views1 month ago

YouTubeTechnical Interview Essentials A–Z

5 AI Terms Devs Are Quietly Searching More — April 2026

194 views3 weeks ago

YouTubeColony-AI

Qwen3.6-27B NVFP4+MTP vLLM Benchmark TG 190tok/s — RTX PRO 6000 Blackwell Max-Q x 2

303 views2 weeks ago

Speculative Decoding & Inference Speed — 2-3x Faster LLMs With Zero Quality Loss

YouTubeJeff Heidelberger

别盲跟！SPEED-Bench 实测 Speculative Decoding 在 vLLM 值不值得

4 views2 months ago

YouTubeAI 决策内参

NeMo RL：Speculative Decoding 把 8B rollout 提速到 1.8×，235B 估计可达2.5×

5 views2 weeks ago

How ChatGPT Serves 100M Users in Real Time ⚡ (LLM Inference, Explained)

4 views2 weeks ago

YouTubePriya Bansal

2026-04-30｜後端工程師的 AI 推論工程選型：從 batching 到 workload-specific runtime

YouTubeTodayShip

Researchers found a way to make LLMs 8.5x faster!(without compromising accuracy)Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference.A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass.If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding.But current drafters in Speculati

10K views1 week ago

x.comAvi Chawla

Lecture 22 - Hacker s Guide to Speculative Decoding in VLLM

1 views3 months ago

bilibili安得广厦千万间678

Google just made Gemma 4 up to 3x faster. Zero quality loss.Multi-Token Prediction (MTP) drafters use speculative decoding:→ A lightweight drafter predicts several tokens at once→ The main model verifies them all in one pass→ Same output quality, up to 3x less wait timeWhere it matters:→ 26B MoE and 31B Dense on consumer GPUs→ E2B and E4B on edge/mobile devices→ Coding assistants, agents, voice appsApache 2.0. Available now on Hugging Face, Kaggle, Ollama, vLLM, SGLang.Gemma 4 hit 60M downloads

68 views2 weeks ago

x.comRamesh Dontha 🦉

See more