Merging Tokens With SLERP To Accelerate LLM Inference Performance

Nov 13, 2025 By Tessa Rodriguez

Large language models can feel slow when prompts get long. Each extra token adds work across the attention stack, and that work piles up fast. Token merging tackles the slowdown by fusing nearby, similar tokens into a single stand-in that carries the same idea.

The trick is to combine vectors without washing out their meaning. SLERP, short for spherical linear interpolation, gives a neat path. It blends two unit vectors along the surface of a sphere, keeping direction sharp and magnitude stable. That is the whole idea in one breath: merge redundant tokens, keep their semantic direction, and shorten the sequence the model must read.

How SLERP Merges Token Vectors?

A token becomes a vector once it passes through the embedding layer. After normalization, the vector’s direction hints at meaning, while its length acts like strength. If you average two vectors in a straight line, you drift toward the origin and weaken the signal.

SLERP moves along the arc between the vectors on the unit sphere. The angle comes from the dot product, and a mixing weight selects a point on that arc. You end up with a unit vector that points between the inputs without shrinking. This keeps meaning intact and avoids a dull midpoint that the network might treat as low energy.

Picking Which Tokens To Merge

Good mergers start with smart selection. Look for neighbors in the sequence that have high cosine similarity, since that signals the same idea said twice or minor phrasing changes. Repeated boilerplate often lights up as tight clusters.

You can set a similarity floor so random pairs do not fuse. Keep rare tokens, numbers, negations, and code symbols off limits. Those pieces carry outsized meaning and deserve their own space. A sliding window over the prompt, paired with a cap on the share of tokens merged per window, keeps things steady and predictable.

Placing Merged Tokens In The Sequence

After you blend two vectors, you still need a position. A simple plan keeps the first token’s position and drops the second. That choice plays nicely with rotary positional embeddings, since you can reuse the original index without recalculating anything fancy.

The model sees a shorter sequence with a token that “feels” like both originals. If your stack supports it, you may encode a soft midpoint, yet most teams find the first-position rule fast, clean, and easy to debug.

Cache-Level Merging For Long Sessions

Speed is not only about the prefill pass. During generation, key and value caches grow with each token. Long chats and tool calls can swell those caches until memory gets tight. Many entries end up as near duplicates when the prompt repeats patterns.

You can search the cache for highly similar keys or values and merge them with SLERP. This trims memory use and reduces time spent moving tensors. Apply strict thresholds and avoid sensitive heads or layers if you see drift in tests. Even light cache merging can make a laptop or small GPU breathe easier.

Choosing The Mixing Weight

The blend weight should reflect importance. If attention scores, token saliency, or simple heuristics hint that one token matters more, lean the weight toward it. When both tokens play equal roles, pick the midpoint.

Another safe rule is to weight by inverse frequency, so common fillers bend toward the rarer neighbor. Keep the math in higher precision for near-parallel vectors, since tiny angles can turn into noisy divisions. When the angle is extremely small, a normalized average is close enough and cheaper to compute.

Quality Trade Offs And Safeguards

Any compression brings risk, so guard the edges. Merges can blur small but key words like not or off if you fuse them with a stronger neighbor. Numbers, dates, units, currency symbols, and code markers deserve a no-merge tag.

Names and rare entities also benefit from protection. Keep a strict ceiling on how many tokens you merge per window and across the whole prompt. If a section reads like legal text or medical advice, consider turning merging off for that span. A light touch brings most of the gain with little change to meaning.

Measuring Speed And Stability

Measure before and after with the same prompts. Track prefill latency, tokens per second, peak memory, and cache size. Then check answer quality with a small audit set. Compare n-gram overlap, semantic similarity scores, and human spot checks on tricky cases.

Pay special attention to long prompts with repeated framing text. Those are sweet spots for merging because they contain clusters of near copies. Record both mean and tail latency, since merges often trim the slowest runs the most.

Ethics And Transparency

If you ship this feature in a product, explain it in plain terms. People should know when prompts might be compacted during processing. Provide a simple switch to turn it off when fidelity matters more than speed, such as legal reviews or strict audits. For research or hobby use, note the setting in the release notes so others can compare results fairly and reproduce your setup without guessing.

Make the feature accessible and fair from day one. Tell users when compacting is active, ask for consent where needed, and give a plain audit view that shows when merges happened and why. Document known risks, like possible loss of rare details, and state the steps you take to limit them in simple language. Keep reviewers in the loop with change logs and reproducible configs so they can trace behavior across versions.

Conclusion

SLERP-based token merging trims sequence length and eases attention work while keeping meaning pointed in the right direction. The method blends similar token vectors on the unit sphere, places the result back into the sequence, and applies careful rules so key details stay intact.

By guarding sensitive tokens, choosing weights that reflect importance, and limiting how much you merge, you gain faster responses and lower memory use without turning answers mushy. Add clean metrics, roll it out with restraint, and you get a tidy speedup that fits today’s transformer stacks and plays nicely with real prompts that tend to repeat themselves.

SLERP Token Merging: Faster Inference For Large Language Models

How SLERP Merges Token Vectors?

Picking Which Tokens To Merge

Placing Merged Tokens In The Sequence

Cache-Level Merging For Long Sessions

Choosing The Mixing Weight

Quality Trade Offs And Safeguards

Measuring Speed And Stability

Ethics And Transparency

Conclusion

You May Like

Understanding How AI Agents Shift Behavior for Different Users

Beyond Accuracy: Breaking Down Barriers in AI Measurement

Understanding AI Hallucination: Why Artificial Intelligence Sometimes Gets It Wrong

SLERP Token Merging: Faster Inference For Large Language Models

Beyond FOMO: Mastering AI Trends and Insights

Multi-Framework AI/ML Development Simplified with Keras 3

An Introduction to TensorFlow's Functional API for Beginners

5 Data Strategy Mistakes and How to Avoid Them

Mastering Time-Series Imputation with Neural Networks

Multi-Agentic RAG Using Hugging Face Code Agents In Production

Deep Dive Into Multithreading, Multiprocessing, And Asyncio Explained

Exploring DeepSeek’s R1 Training Process: A Complete Beginner’s Guide