A speculative decoding framework that replaces sequential autoregressive drafters with a block diffusion model, generating all draft tokens in parallel for dramatically faster inference.
How It Works
Autoregressive Draft
Tokens are drafted one at a time, sequentially. Each token depends on the previous, creating a bottleneck. Draft budget: 16 tokens.
Block Diffusion Draft
An entire block of 16 tokens is generated in a single parallel forward pass using diffusion denoising. Then verified all at once.
Choose a Scenario
1.0x
Lower the speed to observe the cycle: draft → verify → accept / reject