Fast Large Language Model Collaborative Decoding via Speculation

Published in ICML, 2025

1. Introduction: The Need for Efficient Collaborative Inference

Collaborative decoding for Large Language Models (LLMs) enhances output quality by combining outputs from multiple models at each generation step. However, this approach incurs significant computational overhead, as standard collaborative methods require each model to perform a forward pass for every token, leading to a total cost of \(O(nT)\) for \(n\) models and \(T\) tokens.

We propose Collaborative Decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding by leveraging speculative decoding principles. CoS achieves 1.11×–2.23× speedups across diverse settings without compromising generation quality. Its core innovations are: (1) sampling from a combined distribution of models rather than a single model, and (2) an alternate proposal framework that utilizes bonus tokens efficiently by alternating proposer and verifier roles.

2. Background: Speculative Decoding and Its Extension

Figure 1

Figure 1. (a) Vanilla collaborative decoding, (b) speculative decoding, (c) CoS (Naive-CoS variant).

2.1 Limitations of Vanilla Collaborative Decoding

Standard collaborative decoding computes the token distribution using functions like weighted averaging or contrastive subtraction:

\[ r_i(x) = \sum_{k=1}^n \lambda_k p_i^{(k)}(x) \] This requires \(n\) forward passes per token, causing high latency as shown in Figure 1(a).

2.2 Speculative Decoding Revisited

Speculative Decoding (SD) accelerates generation using:

  • A proposal model \( \mathcal{M}_q \) to generate candidate tokens.
  • A verifier model \( \mathcal{M}_p \) to verify them in parallel.

Given \( \gamma \) proposal tokens \( x_{i+1}, \ldots, x_{i+\gamma} \), verification is based on: \[ u_j \leq \min\left(1, \frac{p_{i+j}(x)}{q_{i+j}(x)}\right) \] Rejected tokens are resampled from a renormalized residual distribution.

3. Collaborative Decoding via Speculation (CoS)

3.1 Naive-CoS: Speculative Decoding with Combined Distribution

Naive-CoS extends SD by verifying tokens using a combined distribution \(r(x)\):

  • For weighted ensemble (WE):
    \[ r(x) = \lambda q(x) + (1-\lambda) p(x) \]
  • For contrastive decoding (CD):
    \[ r(x) = \text{Softmax}(l_p - \mu l_q) \] Verification becomes: \[ u_j \leq \min\left(1, \frac{r_{i+j}(x)}{q_{i+j}(x)}\right) \] This ensures generated tokens align with \(r(x)\), forming the foundation of CoS.

3.2 Alternate Proposal Framework

Figure 2

Figure 2. Alternate Proposal Framework alternating proposer/verifier roles to utilize bonus tokens.

When all proposed tokens are accepted, the verifier emits a bonus token. CoS treats this bonus as the next proposal, enabling the models to alternate proposer/verifier roles:

  1. \( \mathcal{M}_q \) proposes \( \gamma_q \) tokens → verified by \( \mathcal{M}_p \)
  2. If accepted, \( \mathcal{M}_p \) emits a bonus token → verified by \( \mathcal{M}_q \)

This feedback loop improves utilization and throughput.

3.3 Generalization to \(n\)-Model Collaboration

Figure 3

Figure 3. CoS applied to three-model setting with scoring and bonus token chaining.

In the \(n\)-model CoS:

  • Each model scores proposals from others in parallel.
  • Each scoring pass also emits a bonus token.
  • Verification of a token happens only after being scored by all other models.

This extension preserves the theoretical efficiency guarantees.

4. Experimental Results

4.1 Setup

We evaluate CoS on four benchmarks: HumanEval, GSM8K, MMLU, and CNNDM, across weighted ensemble and contrastive decoding.

Compared methods:

We compare the following decoding strategies:

  • Ensemble Decoding (WE, CD) – standard ensemble with weighted or contrastive decoding.
  • Parallel Ensemble Decoding (WE-P, CD-P) – parallel processing version of standard ensemble methods, where each model computes its distribution independently and in parallel.
  • Speculative Decoding (WE-SD, CD-SD) – acceleration using the smallest model as proposer and the ensemble distribution as verifier.
  • Collaborative Decoding via Speculation (WE-CoS, CD-CoS) – our proposed method using alternating roles and speculative ensemble sampling.

Model pairs:

NameM_qM_p
Weight Ensemble (WE)Llama-VicunaLlama-2-7B, Vicuna-7B-V1.5
Qwen-3bQwen2.5-3B-Instruct, Qwen2.5-Coder-3B-Instruct
Qwen-1.5bQwen2.5-1.5B-Instruct, Qwen2.5-Coder-1.5B-Instruct
Contrastive Decoding (CD)Llama-3Llama-3.2-1B, Llama-3.1-8B-Instruct
Llama-2Llama-68M, Llama-2-7B
OPTOPT-125M, OPT-13B

4.2 Main Results

Weighted Ensemble (WE) Performance

ModelMethodHumanEvalGSM8KMMLUCNNDM
Llama-VicunaWE1.00x1.00x1.00x1.00x
WE-P0.69x0.73x0.70x0.75x
SD1.27x1.21x1.19x1.15x
CoS1.58x1.52x1.41x1.46x
Qwen-3bWE1.00x1.00x1.00x1.00x
WE-P0.74x0.79x0.79x0.77x
SD1.13x1.06x1.09x1.08x
CoS1.62x1.52x1.42x1.38x
Qwen-1.5bWE1.00x1.00x1.00x1.00x
WE-P0.63x0.62x0.64x0.63x
SD1.11x1.13x1.08x1.10x
CoS1.56x1.46x1.34x1.35x
Qwen-1.5b (3 Model)WE1.00x1.00x1.00x1.00x
WE-P0.54x0.73x0.80x0.82x
SD0.96x0.92x0.98x0.95x
CoS1.85x1.53x1.38x1.27x

Contrastive Decoding (CD) Performance

ModelTMethodHumanEvalGSM8KMMLUCNNDM
Llama-30CD1.00x1.00x1.00x1.00x
CD-P0.41x0.40x0.41x0.41x
SD2.04x1.81x1.52x1.58x
CoS2.23x2.00x1.77x1.61x
1CD1.00x1.00x1.00x1.00x
CD-P0.39x0.41x0.42x0.41x
SD1.55x1.21x1.20x1.07x
CoS1.65x1.44x1.31x1.18x
Llama-20CD1.00x1.00x1.00x1.00x
CD-P0.59x0.50x0.54x0.48x
SD1.15x1.62x1.08x0.93x
CoS1.26x1.65x1.68x1.30x
1CD1.00x1.00x1.00x1.00x
CD-P0.56x0.51x0.53x0.49x
SD0.94x1.16x1.23x1.10x
CoS1.15x1.20x1.37x1.11x

Recommended citation: Fu J, Jiang Y, Chen J, et al. Fast Large Language Model Collaborative Decoding via Speculation[J]. arXiv preprint arXiv:2502.01662, 2025.
Download Paper