TurboQuant - Google's Vector Quantization Algorithm
TurboQuant is a vector quantization algorithm developed by Google Research that can compress large model KV cache from 32-bit to 3-bit, achieving a 6x reduction in memory usage, an 8x increase in inference speed, and zero loss of accuracy.
TurboQuant is a vector quantization algorithm launched by Google Research. It can compress the large model KV Cache from 32-bit to 3-bit, reducing memory by 6 times and increasing inference speed by 8 times with zero loss of accuracy. TurboQuant transforms vectors into a coordinate system following a Beta distribution through random rotation, combined with 1-bit QJL residual correction, without calibration constants and model fine-tuning, plug and play. TurboQuant has been verified to support long-context tasks of Gemma, Mistral and other models, providing a key breakthrough for edge device deployment and reducing cloud inference costs.
TurboQuant’s main features
- extreme compression : Compress the 32-bit floating-point KV Cache to 3-bit, reducing memory usage by more than 6 times, while supporting a minimum 1-bit extreme compression mode.
- Accelerate reasoning : Through highly vectorized quantization calculations, the attention calculation speed is increased by 8 times on the H100 GPU, significantly reducing inference latency.
- Accuracy maintained : In long context benchmarks such as LongBench and Needle in a Haystack, the compressed model scores exactly the same as the original model, achieving true zero precision loss.
- plug and play : Using a data-independent online quantification strategy, there is no need for model retraining, fine-tuning, or calibration for specific data sets, and the deployment threshold is low.
- dual mode quantization : Provides MSE optimization mode to minimize reconstruction error, and inner product optimization mode to provide unbiased attention score estimation to meet the needs of different application scenarios.
- Applicable to multiple scenarios : Applies large model KV Cache compression to support ultra-long context, and nearest neighbor search of vector databases, outperforming traditional methods in both recall rate and indexing speed.
TurboQuant’s technical principles
- Random rotation dimensionality reduction : By applying a random rotation matrix to the input vector, high-dimensional vectors are transformed from a Cartesian coordinate system into a space where each coordinate obeys a beta distribution, making different coordinates nearly independent, thereby applying optimal scalar quantization to each coordinate independently without the need to store data-dependent calibration constants.
- optimal scalar quantization : Based on the statistical properties of Beta distribution, the Lloyd-Max algorithm is used to solve the continuous one-dimensional k-means problem, and the optimal quantization codebook is precalculated for each coordinate to achieve a near-optimal MSE distortion rate.
- Two-stage residual correction : Apply the MSE optimal quantizer for main compression, and then apply the 1-bit Quantized Johnson-Lindenstrauss transformation to the residual vector to eliminate the bias of the inner product estimate and achieve unbiased and low-distortion attention calculation.
- information theory optimality guarantee : By proving that the distortion rate of TurboQuant only differs from the Shannon theoretical lower bound by about 2.7 times a constant factor, and is closer to the optimal at low bits, the ultimate performance of the algorithm is theoretically verified.
TurboQuant key information and usage requirements
- publisher : Jointly launched by Google Research and Google DeepMind, the paper was published in ICLR 2026.
- core indicators : KV Cache is compressed to 3-bit, memory is reduced by 6 times, inference speed is increased by 8 times, and accuracy is zero loss.
- technology portfolio : Consists of two stages: PolarQuant (random rotation + Beta distribution quantization) and QJL (1-bit residual correction).
- Theoretical guarantee : The difference between the distortion rate and the information theory lower bound is no more than 2.7 times, and the difference is only 1.45 times at 1-bit.
- Validate model : Open source large models such as Gemma and Mistral have passed 5 long context benchmarks including LongBench and Needle in a Haystack.
- community implementation : Multiple third-party implementation versions such as PyTorch, MLX, C/CUDA, etc. have appeared.
- No training required : No retraining or fine-tuning of the model is required, and it is applied directly to the pre-trained model.
- No calibration required : Adopts a data-independent online quantification strategy that does not require offline calibration or preprocessing for specific data sets.
- Hardware support : An AI accelerator (such as a GPU) that supports vectorization is required for optimal performance, and the algorithm itself is not tied to specific hardware.
TurboQuant’s Core Advantages
- Extreme compression ratio : Compress the 32-bit KV Cache to 3-bit, reducing the memory usage by more than 6 times, and supporting a minimum of 1-bit extreme compression, significantly alleviating the memory bottleneck in long context scenarios.
- Zero precision loss : In 5 long-context benchmarks, the compressed model scored exactly the same as the original model, achieving true lossless compression rather than near-lossless compression.
- Significant acceleration in reasoning : The highly vectorized algorithm design increases the attention calculation speed by 8 times, effectively reduces the reasoning delay and improves the throughput.
- Plug and play deployment : No model retraining, fine-tuning or data calibration is required. It can be used out of the box, significantly reducing the project implementation threshold and deployment costs.
- Best theoretical performance : The difference between the distortion rate and the Shannon information theory lower bound is only about 2.7 times a constant factor. The gap is even smaller at low bits, approaching the theoretical limit.
How to use TurboQuant
At present, the official open source code has not been released. You can pay attention to the Google Research official warehouse or the arXiv paper page for the latest open source information.
TurboQuant project address
- Project official website :https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
- arXiv technical papers :https://arxiv.org/pdf/2504.19874
TurboQuant’s comparison of similar competing products
| Contrast Dimensions | TurboQuant | H2O | GPTQ |
|---|---|---|---|
| Technical route | Vector quantization (3-bit compression) | Sparse retain heavy hitters | Static weight quantization (4-bit) |
| Compressed object | KV Cache (activation value) | KV Cache (selective discard) | Model weight |
| Compression ratio | 6 times (32-bit → 3-bit) | About 2-4 times (depending on configuration) | 4x (weighted) |
| loss of accuracy | Zero loss (benchmark consistent) | minor loss | minor loss |
| Do you need training? | No | No | No |
| Whether calibration is required | No, the data is irrelevant | No | Yes, calibration data set required |
| Whether to support dynamic input | Yes, online quantification | Yes | No, offline quantification |
| acceleration effect | 8 times (attention calculation) | limited | Limited, mainly saving video memory |
Application scenarios of TurboQuant
- Long context LLM service : Compress the KV Cache 6 times, enabling the cloud API to support millions of token contexts, significantly reducing computing power costs and improving concurrency capabilities.
- Consumer grade graphics card deployment : Supports consumer-grade GPUs with 32GB of video memory to smoothly run long context tasks of models larger than 7B, breaking the video memory bottleneck of local deployment.
- Edge device reasoning : Provide compression solutions for memory-limited scenarios such as mobile phones and IoT devices, so that large model capabilities can be transferred to end-side devices.
- Vector database search : Replace traditional Product Quantization to achieve semantic search with higher recall rate and lower indexing latency in RAG system. ©