If you want to use the native Multi-Token Prediction (MTP) drafters in Google’s new Gemma 4 models, you don’t have to wait around for upstream pull requests to merge into llama.cpp.
MTP is a specialized flavor of speculative decoding. It uses a lightweight draft model to predict multiple future tokens at once, allowing the main model to verify them in a single parallel pass. This means massive throughput gains on consumer hardware with zero reduction in reasoning quality.
If you need this performance today for local app deployments, look at LiteRT-LM (Google’s rebranded TensorFlow Lite LLM runner). Because Gemma 4 and its MTP drafters are first-party Google designs, the framework supports them natively out of the gate.
According to Google’s official benchmarks, enabling MTP in LiteRT-LM delivers:
- Up to a 2.2x speedup on mobile GPUs.
- Up to a 1.5x speedup on mobile and desktop CPUs, with optimized pipelines that prevent costly data transfers between hardware backends.
If you’re targeting on-device deployments (mobile, desktop apps, or embedded hardware like a Raspberry Pi) and building around Gemma 4, it’s an incredibly fast path to production.
Get Started
It’s available as a simple PyPI package (pip install litert-lm-api), and turning on MTP takes just one parameter in your configuration:
import litert_lm
with litert_lm.Engine(
"path/to/model.litertlm",
backend=litert_lm.Backend.GPU(),
enable_speculative_decoding=True # Enablng MTP
) as engine:
# Your local inference pipeline is readyYou can find the full setup guides in the official Google AI Edge LiteRT-LM Documentation.
Need to ship optimized local builds? At Helixbound, we help engineering teams design and deploy high-efficiency, on-device architectures and agentic workflows. If you want to integrate models like Gemma 4 directly into your product line without the cloud overhead, let’s talk. Reach out at our Contact page to instantly schedule a call!

