Fast Large Language Model Collaborative Decoding via Speculation

Key Laboratory of New Generation Artificial Intelligence Technology and
Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
^*Equal Contribution. ^†Corresponding Author

Abstract

Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding--where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency.

We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality.

Motivation

We propose Collaborative Decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding, e.g. contrastive decoding or weighted ensemble, by leveraging speculative decoding principles. Its core innovations are:

Verification Mechanism Refinement

Speculative decoding allows not only sampling from the target model's distribution, but also sampling from any combined distribution of the proposal model and target model.

Alternate Proposal Framework

In standard speculative decoding, the proposer and verifier are fixed, one model always acts as the proposer and the other model acts as the verifier, which is suboptimal in the collaborative decoding setting.

We observe that alternating each model as proposer and verifier can further speed up the collaboration process.

Experiment Setup

Dataset and evaluation. We test CoS across multiple tasks including code generation, mathematical reasoning, multi-task understanding, and text summarization on HumanEval, GSM8K, MMLU, and CNNDM, respectively. We measure each method's speed by the average tokens generated per second and compute the speedup ratio relative to the standard collaborative decoding. All experiments are conducted on RTX 3090, except for evaluations involving the Llama-Vicuna model pair, which use the A6000 GPU.

Combination functions and methods. We experiment with two combination functions: weighted ensemble (WE) at the distribution level and contrastive decoding (CD) at the logits level. Among two combination functions, four methods are compared: (1) the standard collaborative decoding (WE, CD); (1) parallel collaborative decoding (WE-P, CD-P); (2) an accelerated version with speculative decoding (SD), using the smallest model as the proposal and the combined distribution as the target (WE-SD, CD-SD); and (3) CoS (WE-CoS, CD-CoS).

Model pair configuration. We experiment on different types and pairs of LLMs, as shown below.

Results

From these tables, we observe the following findings:

CoS consistently achieves the highest speedup across all settings, while SD may sometimes slow down collaboration. This is because vanilla SD doesn't guarantee acceleration—especially when the acceptance rate is low.

CoS delivers a higher minimum speedup in the WE scenario compared to CD. In the two-model case, it reaches at least 1.34×, and in the three-model case, 1.27×. In contrast, CD can drop to 1.11×. This advantage comes from CoS maintaining a higher acceptance rate in WE.

Speedup varies by task and output determinism. For example, in the WE scenario, CoS achieves 1.65× on HumanEval, where strict formatting favors high acceptance. On summarization tasks, with more flexible outputs, the speedup drops to 1.36×.

The speedup ratio of each method in WE setting. The method with the optimal speedup is highlighted in bold.

The raw speed of each method under WE setting. The table reports the average number of tokens generated per second. Models are of comparable sizes.

The speedup ratio of each method in CD setting.

The raw speed of each method under CD setting. Models are of different sizes.

@inproceedings{fu2025speculative, title={Fast Large Language Model Collaborative Decoding via Speculation}, author={Fu, Jiale and Jiang, Yuchu and Chen, Junkai and Fan, Jiaming and Geng, Xin and Yang, Xu}, booktitle={Forty-two International Conference on Machine Learning}, year={2025} }