Block Diffusion for Flash Speculative Decoding

A speculative decoding framework that replaces sequential autoregressive drafters with a block diffusion model, generating all draft tokens in parallel for dramatically faster inference.
How It Works

Autoregressive Draft

Tokens are drafted one at a time, sequentially. Each token depends on the previous, creating a bottleneck. Draft budget: 16 tokens.

Block Diffusion Draft

An entire block of 16 tokens is generated in a single parallel forward pass using diffusion denoising. Then verified all at once.
Choose a Scenario
1.0x
Lower the speed to observe the cycle: draftverifyaccept / reject
Autoregressive Draft
Sequential | 16 tokens
Block Diffusion Draft
Parallel | 16 tokens