FYI: Don’t Wait on llama.cpp for Gemma 4 MTP

If you want to use the native Multi-Token Prediction (MTP) drafters in Google’s new Gemma 4 models, you don’t have to wait around for upstream pull requests to merge into llama.cpp.

MTP is a specialized flavor of speculative decoding. It uses a lightweight draft model to predict multiple future tokens at once, allowing the main model to verify them in a single parallel pass. This means massive throughput gains on consumer hardware with zero reduction in reasoning quality.

If you need this performance today for local app deployments, look at LiteRT-LM (Google’s rebranded TensorFlow Lite LLM runner). Because Gemma 4 and its MTP drafters are first-party Google designs, the framework supports them natively out of the gate.

According to Google’s official benchmarks, enabling MTP in LiteRT-LM delivers:

  • Up to a 2.2x speedup on mobile GPUs.
  • Up to a 1.5x speedup on mobile and desktop CPUs, with optimized pipelines that prevent costly data transfers between hardware backends.

If you’re targeting on-device deployments (mobile, desktop apps, or embedded hardware like a Raspberry Pi) and building around Gemma 4, it’s an incredibly fast path to production.

Get Started

It’s available as a simple PyPI package (pip install litert-lm-api), and turning on MTP takes just one parameter in your configuration:

import litert_lm

with litert_lm.Engine(
    "path/to/model.litertlm", 
    backend=litert_lm.Backend.GPU(), 
    enable_speculative_decoding=True # Enablng MTP
) as engine:
    # Your local inference pipeline is ready

You can find the full setup guides in the official Google AI Edge LiteRT-LM Documentation.


Need to ship optimized local builds? At Helixbound, we help engineering teams design and deploy high-efficiency, on-device architectures and agentic workflows. If you want to integrate models like Gemma 4 directly into your product line without the cloud overhead, let’s talk. Reach out at our Contact page to instantly schedule a call!