Adaptive Draft Length for Speculative Decoding: Self-Calibrating Adaptive Length Drafts for Faster Language Model Inference
Large language models (LLMs) enable state-of-the-art performance across diverse tasks but face latency challenges in real-time applications due to their autoregressive nature. Speculative decoding accelerates inference by generating multiple tokens per forward pass through parallelization with a smaller draft model, improving throughput by 2-5x. However, existing methods fix the draft length a priori, leading to suboptimal performance since different inputs require different draft lengths to balance accuracy and speed. This study proposes adaptive draft length mechanisms for speculative decoding that dynamically adjust the number of draft tokens based on input characteristics. We implement self-calibrating methods that monitor draft acceptance rates and adjust draft length in real-time without retraining. Our approach uses lightweight heuristics: (1) acceptance-rate-based adjustment, (2) input-length adaptive length, and (3) entropy-based confidence scoring for draft-length selection. Experiments on LLaMA-7B and CodeLLaMA-7B show that adaptive draft length improves token throughput by 15-25% over fixed draft length across diverse benchmarks (MMLU, HellaSwag, HumanEval). Particularly, for long-context inputs (>2000 tokens), adaptive methods achieve 1.3-1.8x throughput improvement while maintaining <1% accuracy loss compared to baseline outputs. Our technique requires no additional model training, works with any existing draft model, and is compatible with other speculative decoding variants like Jacobi decoding. We analyze the draft-length distribution across inputs and find that optimal draft lengths vary significantly: short inputs benefit from longer drafts (8-12 tokens), while long contexts prefer shorter drafts (3-5 tokens). Our self-calibration mechanism learns these patterns within 100 inference steps, enabling immediate deployment without offline profiling. The framework generalizes to different model sizes and draft model architectures. This work demonstrates that adaptive inference strategies can provide substantial speedups for speculative decoding without additional computational overhead or model modifications.


