Rethinking LLM Ensembling from the Perspective of Mixture Models

1. Introduction

Model ensembling in LLMs combines multiple models’ next-token distributions to improve quality, robustness, and reliability across tasks. The downside is cost: conventional ensembling runs every model at every decoding step, which makes latency scale linearly with the number of models. Ensembling heterogeneous models also requires vocabulary alignment and careful weighting.

We address this with a mixture-model interpretation of ensembling and a Mixture-model-like Ensemble (ME) algorithm. ME samples from the same ensemble distribution while invoking only one model per step, keeping the output distribution identical to conventional ensembling but reducing compute. Experiments show that ME matches CE performance and delivers 1.78x-2.68x speedups.

2. Mixture-model-like Ensemble

2.1 Conventional LLM Ensemble

Unlike standard decoding, which samples the next token from a single model, conventional LLM ensembling combines the next-token distributions predicted by multiple models. Let the input prefix be \(x_{\le t}\) and let \(\mathcal{M}_i\) denote the \(i\)-th model with weight \(\lambda_i\), where \(\lambda_i \ge 0\) and \(\sum_i \lambda_i = 1\). At every decoding step, each model performs a forward pass and produces its own distribution over the vocabulary. These distributions are then aggregated into an ensemble distribution:

\[ P(x_{t+1} = y \mid x_{\le t}) = \sum_{i=1}^{n} \lambda_i \, \mathcal{M}_i(y \mid x_{\le t}). \]

The next token is sampled from this weighted distribution, appended to the prefix, and the same procedure repeats until generation ends. This direct implementation is simple and often improves quality over a single model, but it requires \(n\) model evaluations for every generated token. Even if the models are evaluated in parallel, the method still needs to load and execute every model at each step, so inference cost grows with the number of ensemble members.

2.2 Mixture-model-like Ensemble

ME treats the ensemble distribution as a mixture model and moves the randomness one step earlier. Instead of explicitly computing every model’s distribution and averaging them, ME samples in two stages: (1) draw a model index \(m \sim \mathrm{Cat}(\lambda)\) according to the ensemble weights; (2) run only \(\mathcal{M}_m\) and sample the next token from \(\mathcal{M}_m(y \mid x_{\le t})\).

This produces the same marginal next-token distribution as conventional ensembling:

\[ P(x_{t+1}=y \mid x_{\le t}) = \sum_{i=1}^{n} P(\mathcal{M}_i \text{ is selected}) P(y \text{ is generated by } \mathcal{M}_i) = \sum_{i=1}^{n} \lambda_i \mathcal{M}_i(y \mid x_{\le t}). \]

Thus, ME preserves the ensemble sampling distribution while evaluating only one model per token. The practical implication is that the generation quality should match conventional ensembling under the same weights, while decoding becomes much closer to single-model inference.

2.3 Lazy Key-Value Cache Synchronization

Sampling only one model per step introduces a KV-cache issue. In autoregressive decoding, each model normally caches the keys and values of previous tokens. If ME selects model \(\mathcal{M}_i\) at one step and later switches to \(\mathcal{M}_j\), then \(\mathcal{M}_j\) may not have cached the tokens generated while it was inactive. Naively updating every model’s cache at every step would recover correctness, but it would also bring back the overhead that ME is designed to avoid.

ME handles this with lazy synchronization. Each model maintains its own cache asynchronously and is synchronized only when it is selected again. If \(\mathcal{M}_i\) was last used at step \(t-k\) and is selected at step \(t\), the missing tokens \(x_{t-k+1:t}\) are first processed in one forward-extend, or prefill, operation to update its cache; then \(\mathcal{M}_i\) generates \(x_{t+1}\). Because LLM decoding is often memory-bandwidth bound, processing a short accumulated span in one prefill pass has latency comparable to a decoding step, making the synchronization overhead small in practice.

2.4 Ensembling with Heterogeneous Vocabularies

When models use different vocabularies, their probability distributions cannot be averaged directly. Let the models have vocabularies \(V_1,\dots,V_n\) and next-token distributions \(P_1,\dots,P_n\). A vocabulary alignment function \(\mathcal{F}_i\) maps each \(P_i\) from its native vocabulary \(V_i\) to a distribution \(\tilde{P}_i\) over a unified vocabulary \(U\). The ensemble distribution can then be written as:

\[ P(x_{t+1}=y \mid x_{\le t}) = \sum_{i=1}^{n} \lambda_i \mathcal{F}_i[\mathcal{M}_i(y \mid x_{\le t})], \quad y \in U. \]

ME integrates this naturally: after sampling model \(\mathcal{M}_i\), it applies \(\mathcal{F}_i\) to the selected model’s distribution and samples the next token from the aligned distribution. In our experiments we use a top-\(k\) alignment strategy inspired by UniTe, but the framework is compatible with other alignment methods.

2.5 Connection to Token-level Routing

ME also gives a simple view of the relationship between LLM ensembling and token-level routing. A token-level router chooses one expert model at each generation step; ME can be interpreted as the simplest possible router, where the routing decision is random and follows the fixed ensemble weights. Conversely, conventional ensembling can be viewed as marginalizing over this random router. This connection highlights the tradeoff: ME is training-free and plug-and-play, while learned routers may achieve stronger routing decisions at the cost of additional training.

3. Experiments

3.1 Experimental Setup

We evaluate ME on GSM8K, MMLU, BBH, and ARC. Accuracy results are averaged over five runs, and speed is measured in tokens per second (H100 GPU unless stated). We test similar-model ensembles (Qwen-3B + Qwen-Math-1.5B, and Openchat/Nous-Hermes/OpenHermes) and a heterogeneous ensemble (Openchat + Deepseek-7B + Mistral-7B). We compare ME against conventional ensembling (CE) in both sequential and parallel settings, using the same weights and top-\(k\) vocabulary alignment.

3.2 Main Results

ME consistently matches CE accuracy across both similar and heterogeneous ensembles, confirming the distributional equivalence of the two approaches. The tables below report performance; values in parentheses denote the gain or drop relative to the best single model.

Performance comparison on ensembling similar models.

Model / Setting	GSM8K	MMLU	BBH	ARC
Qwen-3B	79.77	66.75	51.94	81.81
Qwen-Math-1.5B	79.39	39.54	39.75	46.23
Two-model ensembling: Qwen-3B + Qwen-Math-1.5B
CE (k=5)	83.14 (+3.37)	66.05 (-0.70)	52.74 (+0.80)	81.14 (-0.67)
ME (k=5)	82.97 (+3.20)	65.61 (-1.14)	53.04 (+1.10)	81.12 (-0.69)
CE (k=10)	82.62 (+2.85)	66.67 (-0.08)	52.25 (+0.31)	81.57 (-0.24)
ME (k=10)	82.83 (+3.06)	67.90 (+1.15)	52.51 (+0.57)	81.10 (-0.71)
Openchat	68.02	56.47	44.85	73.39
Nous-Hermes	67.11	58.37	46.72	73.02
OpenHermes	67.59	59.84	47.13	75.25
Two-model ensembling: Openchat + Nous-Hermes
CE (k=5)	69.34 (+1.32)	60.60 (+2.23)	48.12 (+1.40)	78.84 (+5.45)
ME (k=5)	69.11 (+1.09)	60.95 (+2.58)	47.33 (+1.22)	78.78 (+5.39)
CE (k=10)	68.19 (+0.17)	60.28 (+1.91)	47.82 (+1.10)	78.70 (+5.31)
ME (k=10)	68.74 (+0.72)	60.63 (+2.26)	47.25 (+0.53)	80.06 (+6.67)
Three-model ensembling: Openchat + Nous-Hermes + OpenHermes
CE (k=5)	69.05 (+1.03)	60.60 (+0.76)	47.82 (+0.69)	78.38 (+3.13)
ME (k=5)	69.42 (+1.40)	59.97 (+0.13)	48.04 (+0.91)	78.42 (+3.17)
CE (k=10)	68.47 (+0.45)	61.19 (+1.35)	46.87 (-0.26)	77.34 (+2.09)
ME (k=10)	67.93 (-0.09)	60.80 (+0.96)	47.40 (+0.27)	76.29 (+1.04)

Performance comparison on ensembling heterogeneous models.

Model / Setting	GSM8K	MMLU	BBH	ARC
Openchat	68.02	56.47	44.85	73.39
Deepseek-7B	53.63	46.10	36.02	56.41
Mistral-7B	46.90	56.22	41.25	68.75
Three heterogeneous model ensembling: Openchat + Deepseek-7B + Mistral-7B
CE (k=5)	69.20 (+1.18)	57.98 (+1.51)	45.61 (+0.76)	75.00 (+1.61)
ME (k=5)	69.86 (+1.84)	57.79 (+1.32)	45.76 (+0.91)	74.19 (+0.80)
CE (k=10)	68.23 (+0.21)	58.50 (+2.03)	44.96 (+0.11)	78.69 (+5.30)
ME (k=10)	67.81 (-0.21)	58.46 (+1.99)	45.38 (+0.53)	78.58 (+5.19)