Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding--where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency.
We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality.
We propose Collaborative Decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding, e.g. contrastive decoding or weighted ensemble, by leveraging speculative decoding principles. Its core innovations are:
Speculative decoding allows not only sampling from the target model's distribution, but also sampling from any combined distribution of the proposal model and target model.
In standard speculative decoding, the proposer and verifier are fixed, one model always acts as the proposer and the other model acts as the verifier, which is suboptimal in the collaborative decoding setting.
We observe that alternating each model as proposer and verifier can further speed up the collaboration process.
Dataset and evaluation. We test CoS across multiple tasks including code generation, mathematical reasoning, multi-task understanding, and text summarization on HumanEval, GSM8K, MMLU, and CNNDM, respectively. We measure each method's speed by the average tokens generated per second and compute the speedup ratio relative to the standard collaborative decoding. All experiments are conducted on RTX 3090, except for evaluations involving the Llama-Vicuna model pair, which use the A6000 GPU.
Combination functions and methods. We experiment with two combination functions: weighted ensemble (WE) at the distribution level and contrastive decoding (CD) at the logits level. Among two combination functions, four methods are compared: (1) the standard collaborative decoding (WE, CD); (1) parallel collaborative decoding (WE-P, CD-P); (2) an accelerated version with speculative decoding (SD), using the smallest model as the proposal and the combined distribution as the target (WE-SD, CD-SD); and (3) CoS (WE-CoS, CD-CoS).
Model pair configuration. We experiment on different types and pairs of LLMs, as shown below.
From these tables, we observe the following findings:
@inproceedings{fu2025speculative,
title={Fast Large Language Model Collaborative Decoding via Speculation},
author={Fu, Jiale and Jiang, Yuchu and Chen, Junkai and Fan, Jiaming and Geng, Xin and Yang, Xu},
booktitle={Forty-two International Conference on Machine Learning},
year={2025}
}