Fast Large Language Model Collaborative Decoding via Speculation

Published in ICML, 2025

1. Introduction: The Need for Efficient Collaborative Inference

Collaborative decoding for Large Language Models (LLMs) enhances output quality by combining outputs from multiple models at each generation step. However, this approach incurs significant computational overhead, as standard collaborative methods require each model to perform a forward pass for every token, leading to a total cost of \(O(nT)\) for \(n\) models and \(T\) tokens.

We propose Collaborative Decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding by leveraging speculative decoding principles. CoS achieves 1.11×–2.23× speedups across diverse settings without compromising generation quality. Its core innovations are: (1) sampling from a combined distribution of models rather than a single model, and (2) an alternate proposal framework that utilizes bonus tokens efficiently by alternating proposer and verifier roles.

2. Background: Speculative Decoding and Its Extension

Figure 1. (a) Vanilla collaborative decoding, (b) speculative decoding, (c) CoS (Naive-CoS variant).

2.1 Limitations of Vanilla Collaborative Decoding

Standard collaborative decoding computes the token distribution using functions like weighted averaging or contrastive subtraction:

\[ r_i(x) = \sum_{k=1}^n \lambda_k p_i^{(k)}(x) \] This requires \(n\) forward passes per token, causing high latency as shown in Figure 1(a).

2.2 Speculative Decoding Revisited

Speculative Decoding (SD) accelerates generation using:

A proposal model \( \mathcal{M}_q \) to generate candidate tokens.
A verifier model \( \mathcal{M}_p \) to verify them in parallel.

Given \( \gamma \) proposal tokens \( x_{i+1}, \ldots, x_{i+\gamma} \), verification is based on: \[ u_j \leq \min\left(1, \frac{p_{i+j}(x)}{q_{i+j}(x)}\right) \] Rejected tokens are resampled from a renormalized residual distribution.

3. Collaborative Decoding via Speculation (CoS)

3.1 Naive-CoS: Speculative Decoding with Combined Distribution

Naive-CoS extends SD by verifying tokens using a combined distribution \(r(x)\):

For weighted ensemble (WE):
\[ r(x) = \lambda q(x) + (1-\lambda) p(x) \]
For contrastive decoding (CD):
\[ r(x) = \text{Softmax}(l_p - \mu l_q) \] Verification becomes: \[ u_j \leq \min\left(1, \frac{r_{i+j}(x)}{q_{i+j}(x)}\right) \] This ensures generated tokens align with \(r(x)\), forming the foundation of CoS.

3.2 Alternate Proposal Framework

Figure 2. Alternate Proposal Framework alternating proposer/verifier roles to utilize bonus tokens.

When all proposed tokens are accepted, the verifier emits a bonus token. CoS treats this bonus as the next proposal, enabling the models to alternate proposer/verifier roles:

\( \mathcal{M}_q \) proposes \( \gamma_q \) tokens → verified by \( \mathcal{M}_p \)
If accepted, \( \mathcal{M}_p \) emits a bonus token → verified by \( \mathcal{M}_q \)

This feedback loop improves utilization and throughput.

3.3 Generalization to \(n\)-Model Collaboration

Figure 3. CoS applied to three-model setting with scoring and bonus token chaining.

In the \(n\)-model CoS:

Each model scores proposals from others in parallel.
Each scoring pass also emits a bonus token.
Verification of a token happens only after being scored by all other models.

This extension preserves the theoretical efficiency guarantees.

4. Experimental Results

4.1 Setup

We evaluate CoS on four benchmarks: HumanEval, GSM8K, MMLU, and CNNDM, across weighted ensemble and contrastive decoding.

Compared methods:

We compare the following decoding strategies:

Ensemble Decoding (WE, CD) – standard ensemble with weighted or contrastive decoding.
Parallel Ensemble Decoding (WE-P, CD-P) – parallel processing version of standard ensemble methods, where each model computes its distribution independently and in parallel.
Speculative Decoding (WE-SD, CD-SD) – acceleration using the smallest model as proposer and the ensemble distribution as verifier.
Collaborative Decoding via Speculation (WE-CoS, CD-CoS) – our proposed method using alternating roles and speculative ensemble sampling.

Model pairs:

Name	M_q	M_p
Weight Ensemble (WE)	Llama-Vicuna	Llama-2-7B, Vicuna-7B-V1.5
	Qwen-3b	Qwen2.5-3B-Instruct, Qwen2.5-Coder-3B-Instruct
	Qwen-1.5b	Qwen2.5-1.5B-Instruct, Qwen2.5-Coder-1.5B-Instruct
Contrastive Decoding (CD)	Llama-3	Llama-3.2-1B, Llama-3.1-8B-Instruct
	Llama-2	Llama-68M, Llama-2-7B
	OPT	OPT-125M, OPT-13B

4.2 Main Results

Weighted Ensemble (WE) Performance

Model	Method	HumanEval	GSM8K	MMLU	CNNDM
Llama-Vicuna	WE	1.00x	1.00x	1.00x	1.00x
	WE-P	0.69x	0.73x	0.70x	0.75x
	SD	1.27x	1.21x	1.19x	1.15x
	CoS	1.58x	1.52x	1.41x	1.46x
Qwen-3b	WE	1.00x	1.00x	1.00x	1.00x
	WE-P	0.74x	0.79x	0.79x	0.77x
	SD	1.13x	1.06x	1.09x	1.08x
	CoS	1.62x	1.52x	1.42x	1.38x
Qwen-1.5b	WE	1.00x	1.00x	1.00x	1.00x
	WE-P	0.63x	0.62x	0.64x	0.63x
	SD	1.11x	1.13x	1.08x	1.10x
	CoS	1.56x	1.46x	1.34x	1.35x
Qwen-1.5b (3 Model)	WE	1.00x	1.00x	1.00x	1.00x
	WE-P	0.54x	0.73x	0.80x	0.82x
	SD	0.96x	0.92x	0.98x	0.95x
	CoS	1.85x	1.53x	1.38x	1.27x

Contrastive Decoding (CD) Performance

Model	T	Method	HumanEval	GSM8K	MMLU	CNNDM
Llama-3	0	CD	1.00x	1.00x	1.00x	1.00x
		CD-P	0.41x	0.40x	0.41x	0.41x
		SD	2.04x	1.81x	1.52x	1.58x
		CoS	2.23x	2.00x	1.77x	1.61x
	1	CD	1.00x	1.00x	1.00x	1.00x
		CD-P	0.39x	0.41x	0.42x	0.41x
		SD	1.55x	1.21x	1.20x	1.07x
		CoS	1.65x	1.44x	1.31x	1.18x
Llama-2	0	CD	1.00x	1.00x	1.00x	1.00x
		CD-P	0.59x	0.50x	0.54x	0.48x
		SD	1.15x	1.62x	1.08x	0.93x
		CoS	1.26x	1.65x	1.68x	1.30x
	1	CD	1.00x	1.00x	1.00x	1.00x
		CD-P	0.56x	0.51x	0.53x	0.49x
		SD	0.94x	1.16x	1.23x	1.10x
		CoS	1.15x	1.20x	1.37x	1.11x

Recommended citation: Fu J, Jiang Y, Chen J, et al. Fast Large Language Model Collaborative Decoding via Speculation[J]. arXiv preprint arXiv:2502.01662, 2025.
Download Paper

Share on

Twitter Facebook LinkedIn