Boffins detail new algorithms that boost AI perf up to 2.8x (www.theregister.com)
from Powderhorn@beehaw.org to technology@beehaw.org on 17 Jul 17:49
https://beehaw.org/post/21149377

We all know that AI is expensive, but a new set of algorithms developed by researchers at the Weizmann Institute of Science, Intel Labs, and d-Matrix could significantly reduce the cost of serving up your favorite large language model (LLM) with just a few lines of code.

Presented at the International Conference on Machine Learning this week and detailed in this paper, the algorithms offer a new spin on speculative decoding that they say can boost token generation rates by as much as 2.8x while also eliminating the need for specialized draft models.

Speculative decoding, if you’re not familiar, isn’t a new concept. It works by using a small “draft” model (“drafter” for short) to predict the outputs of larger, slower, but higher quality “target” models.

If the draft model can successfully predict, say, the next four tokens in the sequence, that’s four tokens the bigger model doesn’t have to generate, and so we get a speed-up. If it’s wrong, the larger model discards the draft tokens and generates new ones itself. That last bit is important as it means the entire process is lossless — there’s no trade-off in quality required to get that speed-up.

#technology

threaded - newest

Hirom@beehaw.org on 17 Jul 18:15 collapse

There’s a Github issue to enable speculative decoding in Ollama.

Quexotic@beehaw.org on 21 Jul 10:58 collapse

Thank you!