Block Diffusion for Flash Speculative Decoding

A speculative decoding framework that replaces sequential autoregressive drafters with a block diffusion model, generating all draft tokens in parallel for dramatically faster inference.

How It Works

Autoregressive Draft

Tokens are drafted one at a time, sequentially. Each token depends on the previous, creating a bottleneck. Draft budget: 16 tokens.

Block Diffusion Draft

An entire block of 16 tokens is generated in a single parallel forward pass using diffusion denoising. Then verified all at once.

Choose a Scenario

Speed 1.0x

Lower the speed to observe the cycle: draft → verify → accept / reject

Autoregressive Draft

Sequential | 16 tokens

Block Diffusion Draft

Parallel | 16 tokens