TurboQuant - Google's Vector Quantization Algorithm

TurboQuant is a vector quantization algorithm launched by Google Research. It can compress the large model KV Cache from 32-bit to 3-bit, reducing memory by 6 times and increasing inference speed by 8 times with zero loss of accuracy. TurboQuant transforms vectors into a coordinate system following a Beta distribution through random rotation, combined with 1-bit QJL residual correction, without calibration constants and model fine-tuning, plug and play. TurboQuant has been verified to support long-context tasks of Gemma, Mistral and other models, providing a key breakthrough for edge device deployment and reducing cloud inference costs.

TurboQuant’s main features

extreme compression : Compress the 32-bit floating-point KV Cache to 3-bit, reducing memory usage by more than 6 times, while supporting a minimum 1-bit extreme compression mode.
Accelerate reasoning : Through highly vectorized quantization calculations, the attention calculation speed is increased by 8 times on the H100 GPU, significantly reducing inference latency.
Accuracy maintained : In long context benchmarks such as LongBench and Needle in a Haystack, the compressed model scores exactly the same as the original model, achieving true zero precision loss.
plug and play : Using a data-independent online quantification strategy, there is no need for model retraining, fine-tuning, or calibration for specific data sets, and the deployment threshold is low.
dual mode quantization : Provides MSE optimization mode to minimize reconstruction error, and inner product optimization mode to provide unbiased attention score estimation to meet the needs of different application scenarios.
Applicable to multiple scenarios : Applies large model KV Cache compression to support ultra-long context, and nearest neighbor search of vector databases, outperforming traditional methods in both recall rate and indexing speed.

TurboQuant’s technical principles

Random rotation dimensionality reduction : By applying a random rotation matrix to the input vector, high-dimensional vectors are transformed from a Cartesian coordinate system into a space where each coordinate obeys a beta distribution, making different coordinates nearly independent, thereby applying optimal scalar quantization to each coordinate independently without the need to store data-dependent calibration constants.
optimal scalar quantization : Based on the statistical properties of Beta distribution, the Lloyd-Max algorithm is used to solve the continuous one-dimensional k-means problem, and the optimal quantization codebook is precalculated for each coordinate to achieve a near-optimal MSE distortion rate.
Two-stage residual correction : Apply the MSE optimal quantizer for main compression, and then apply the 1-bit Quantized Johnson-Lindenstrauss transformation to the residual vector to eliminate the bias of the inner product estimate and achieve unbiased and low-distortion attention calculation.
information theory optimality guarantee : By proving that the distortion rate of TurboQuant only differs from the Shannon theoretical lower bound by about 2.7 times a constant factor, and is closer to the optimal at low bits, the ultimate performance of the algorithm is theoretically verified.

TurboQuant key information and usage requirements

publisher : Jointly launched by Google Research and Google DeepMind, the paper was published in ICLR 2026.
core indicators : KV Cache is compressed to 3-bit, memory is reduced by 6 times, inference speed is increased by 8 times, and accuracy is zero loss.
technology portfolio : Consists of two stages: PolarQuant (random rotation + Beta distribution quantization) and QJL (1-bit residual correction).
Theoretical guarantee : The difference between the distortion rate and the information theory lower bound is no more than 2.7 times, and the difference is only 1.45 times at 1-bit.
Validate model : Open source large models such as Gemma and Mistral have passed 5 long context benchmarks including LongBench and Needle in a Haystack.
community implementation : Multiple third-party implementation versions such as PyTorch, MLX, C/CUDA, etc. have appeared.
No training required : No retraining or fine-tuning of the model is required, and it is applied directly to the pre-trained model.
No calibration required : Adopts a data-independent online quantification strategy that does not require offline calibration or preprocessing for specific data sets.
Hardware support : An AI accelerator (such as a GPU) that supports vectorization is required for optimal performance, and the algorithm itself is not tied to specific hardware.

TurboQuant’s Core Advantages

Extreme compression ratio : Compress the 32-bit KV Cache to 3-bit, reducing the memory usage by more than 6 times, and supporting a minimum of 1-bit extreme compression, significantly alleviating the memory bottleneck in long context scenarios.
Zero precision loss : In 5 long-context benchmarks, the compressed model scored exactly the same as the original model, achieving true lossless compression rather than near-lossless compression.
Significant acceleration in reasoning : The highly vectorized algorithm design increases the attention calculation speed by 8 times, effectively reduces the reasoning delay and improves the throughput.
Plug and play deployment : No model retraining, fine-tuning or data calibration is required. It can be used out of the box, significantly reducing the project implementation threshold and deployment costs.
Best theoretical performance : The difference between the distortion rate and the Shannon information theory lower bound is only about 2.7 times a constant factor. The gap is even smaller at low bits, approaching the theoretical limit.

How to use TurboQuant

At present, the official open source code has not been released. You can pay attention to the Google Research official warehouse or the arXiv paper page for the latest open source information.

TurboQuant project address

Project official website ：https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
arXiv technical papers ：https://arxiv.org/pdf/2504.19874

TurboQuant’s comparison of similar competing products

Contrast Dimensions	TurboQuant	H2O	GPTQ
Technical route	Vector quantization (3-bit compression)	Sparse retain heavy hitters	Static weight quantization (4-bit)
Compressed object	KV Cache (activation value)	KV Cache (selective discard)	Model weight
Compression ratio	6 times (32-bit → 3-bit)	About 2-4 times (depending on configuration)	4x (weighted)
loss of accuracy	Zero loss (benchmark consistent)	minor loss	minor loss
Do you need training?	No	No	No
Whether calibration is required	No, the data is irrelevant	No	Yes, calibration data set required
Whether to support dynamic input	Yes, online quantification	Yes	No, offline quantification
acceleration effect	8 times (attention calculation)	limited	Limited, mainly saving video memory

Application scenarios of TurboQuant

Long context LLM service : Compress the KV Cache 6 times, enabling the cloud API to support millions of token contexts, significantly reducing computing power costs and improving concurrency capabilities.
Consumer grade graphics card deployment : Supports consumer-grade GPUs with 32GB of video memory to smoothly run long context tasks of models larger than 7B, breaking the video memory bottleneck of local deployment.
Edge device reasoning : Provide compression solutions for memory-limited scenarios such as mobile phones and IoT devices, so that large model capabilities can be transferred to end-side devices.
Vector database search : Replace traditional Product Quantization to achieve semantic search with higher recall rate and lower indexing latency in RAG system. ©

← Previous Sub2API - An open-source AI API gateway platform that supports multi-account management Next → Open-source audio and video generation models such as daVinci, MagiHuman, and Sand.ai | AI toolsets

Uni-1 is a unified image understanding and generation model launched by Luma AI. It is the first model to integrate visual reasoning and image generation into a single autoregressive Transformer architecture. The model can perform structured internal reasoning before and during generation, understanding spatial relationships, logical causality, and physical laws, thus achieving...

Gemini 3.1 Pro - Google's latest AI model, specializing in complex reasoning

Gemini 3.1 Pro is Google's latest AI model, the first "0.1" version iteration of the Gemini 3 series, featuring a doubling of its inference capabilities. In the ARC-AGI-2 benchmark test, its score jumped from 31.1% of Gemini 3 Pro to 77.1%, an improvement of over 148%, setting a record for the largest single-generation inference capability improvement among leading-edge models. It also surpasses GPT-5.2 and Claude... on key benchmarks such as GPQA Diamond, LiveCodeBench Pro, and SWE-Bench Verified.

360 Security Lobster - 360's AI Agent desktop application

360 Security Lobster is an AI Agent desktop application launched by 360 Security Technology Inc., developed based on the OpenClaw architecture. The application includes 16 top-tier large-scale models, supports over 13,000 skill scenarios, and emphasizes a "ready-to-use" experience, requiring no complex configuration. Its core functions cover five major scenarios: self-media creation, office automation, personal data management, scheduled tasks, and multi-device remote control. It can automatically complete tasks such as content generation, data processing, file archiving, and daily report sending. It supports integration with mainstream platforms such as Lark, DingTalk, and WeChat, emphasizing both security and ease of use. The main functions of 360 Security Lobster...

Finally, Apple supports the Claude Agent SDK!

Apple and Anthropic jointly announced early this morning that Xcode, the official programming tool for Apple platform developers, has released version 26.3, and for the first time natively integrates Claude Agent, supporting development in Agentic Coding mode. In addition to Claude Agent, Xcode 26.3 also supports integration with OpenAI's Codex code agent. ...