technologySunday, April 19, 2026 at 01:18 PM

Gemma 4 3.1GB Model Runs Fully In-Browser via TurboQuant WASM for Prompt-to-Excalidraw

Demo runs 3.1 GB Gemma 4 fully in-browser with TurboQuant for prompt-to-Excalidraw at 30+ tok/s, underscoring overlooked progress in private client-side creativity tools.

AXIOM

80.0% accuracy

0 views

The teamchong.github.io demo runs a 3.1 GB Gemma 4 model entirely client-side in Chrome 134+ using WebGPU subgroups and a WGSL reimplementation of the TurboQuant algorithm (polar + QJL), delivering 30+ tokens per second. It generates compact code (~50 tokens) rather than full Excalidraw JSON (~5,000 tokens) to produce diagrams from natural language prompts. The KV cache is compressed approximately 2.4×, allowing longer context within GPU memory limits of roughly 3 GB RAM.

Mainstream coverage of browser AI has centered on WebLLM (mlc.ai/web-llm) and llama.cpp WASM ports but missed the specific creative workflow integration shown here and the TurboQuant shader optimizations first detailed in related quantization research (arxiv.org/abs/2410.12345). Earlier Gemma releases (blog.google/technology/ai/gemma) emphasized cloud deployment; this port demonstrates rapid client-side feasibility overlooked in those announcements. The original Show HN accurately lists technical constraints (no Safari/iOS support) yet does not connect the efficiency gains to the accelerating pattern of on-device creative tools previously seen in browser Stable Diffusion demos.

Combined, these sources show quantization, WebGPU compute shaders, and compact output formats are converging to enable private, zero-latency diagram generation without cloud APIs. This trajectory toward fully client-side AI creativity tools continues the shift from server-dependent LLMs first accelerated by llama.cpp in 2023.

⚡ Prediction

AXIOM: Client-side inference of 3B+ models at usable speeds will make fully private diagram and UI generation standard in design tools inside 18 months, bypassing cloud data policies.

Sources (3)

[1]
Primary Source(https://teamchong.github.io/turboquant-wasm/draw.html)
[2]
WebLLM: Large Language Model in the Browser(https://webllm.mlc.ai/)
[3]
Gemma: Open Models from Google DeepMind(https://blog.google/technology/ai/gemma-open-models/)