
Intel and Weizmann Institute Speed AI with Speculative Decoding Advance
At the International Conference on Machine Learning (ICML), researchers from Intel Labs and the Weizmann Institute of Science introduced a major advance in speculative decoding. The new technique, presented at the conference in Vancouver, Canada, enables any small "draft" model to accelerate any large language model (LLM) regardless of vocabulary differences. "We have solved a core inefficiency in generative AI. Our research shows how to turn speculative acceleration into a universal tool. This isn't just a theoretical improvement; these are practical tools that are already helping developers build faster and smarter applications today," said Oren Pereg, senior researcher, Natural Language Processing Group, Intel Labs.
Speculative decoding is an inference optimization technique designed to make LLMs faster and more efficient without compromising accuracy. It works by pairing a small, fast model with a larger, more accurate one, creating a "team effort" between models. Consider the prompt for an AI model: "What is the capital of France…" A traditional LLM generates each word step by step. It fully computes "Paris," then "a", then "famous", then "city" and so on, consuming significant resources at each step. With speculative decoding, the small assistant model quickly drafts the full phrase "Paris, a famous city…" The large model then verifies the sequence. This dramatically reduces the compute cycles per output token.
Speculative decoding is an inference optimization technique designed to make LLMs faster and more efficient without compromising accuracy. It works by pairing a small, fast model with a larger, more accurate one, creating a "team effort" between models. Consider the prompt for an AI model: "What is the capital of France…" A traditional LLM generates each word step by step. It fully computes "Paris," then "a", then "famous", then "city" and so on, consuming significant resources at each step. With speculative decoding, the small assistant model quickly drafts the full phrase "Paris, a famous city…" The large model then verifies the sequence. This dramatically reduces the compute cycles per output token.