IndexCache - A sparse attention acceleration technology jointly launched by Tsinghua University and Zhipu

Homepage • AI Tools • AI Projects and Frameworks • IndexCache - A Sparse Attention Acceleration Technology Developed by Tsinghua University and Zhipu. IndexCache is a sparse attention acceleration technology developed by Tsinghua University and Zhipu. Addressing the high computational overhead of the indexer in DeepSeek Sparse Attention (DSA), it reduces redundant computation by reusing indexes across layers. IndexCache discovered that the overlap rate of the top-k tokens selected by adjacent layers is as high as...

IndexCache - A sparse attention acceleration technology jointly launched by Tsinghua University and Zhipu
  • Home page•* AI tools *•*AI projects and frameworks *•*IndexCache – sparse attention acceleration technology launched by Tsinghua University and Zhipu
    IndexCache is a sparse attention acceleration technology launched by the Tsinghua University and Zhipu teams. It solves the problem of high indexer calculation overhead in DeepSeek Sparse Attention (DSA) and reduces redundant calculations by reusing indexes across layers. IndexCache found that the overlap rate of the top-k tokens selected by adjacent layers was as high as 70%-100%, so the layers were divided into “full layers” (computing and caching indexes) and “shared layers” (directly reusing the cache). This method can remove 75% of indexer calculations, achieve 1.82x pre-filling and 1.48x decoding acceleration in a 200K context scenario, with almost no loss of model performance. It has been used in 30B parameter models and 744B parameters. GLM-5 The above verification is valid.

Main functions of IndexCache

  • Cross-layer index reuse : Use adjacent layer top-k indexes with a high overlap rate of 70%-100%, allowing the shared layer to directly reuse the cache index of the full layer to avoid repeated calculations.
  • Dramatically reduce indexer overhead : Removes 75% of indexer calculations, leaving only 1/4 of the indexers to maintain model performance.
  • Dramatically speed up inference : Achieve pre-filling 1.82 times and decoding 1.48 times acceleration under 200K context, shortening user waiting time.
  • Zero additional memory overhead : Reuse through a conditional branch without allocating additional GPU memory.
  • Two deployment options are provided : The training-free scheme determines the optimal layer pattern through greedy search, and the training-aware scheme optimizes indexer parameters through multi-layer distillation loss.
  • Production level verification : Validated on GLM-5 with 30B parameter model and 744B parameters, supporting SGLang and vLLM inference frameworks.

Technical principles of IndexCache

  • Cross-layer index similarity discovery : The research team found through heat map analysis that the top-k token sets output by the indexers of adjacent layers of the DSA model have extremely high similarities, with overlap rates generally ranging from 70% to 100%, indicating that there is redundancy in a large number of index calculations.
  • Layer role division mechanism : IndexCache divides the model layers into two categories: the Full Layer retains the original indexer and is responsible for calculating and caching the latest top-k index; the Shared Layer no longer runs its own indexer and directly reuses the index cached by the most recent full layer for sparse attention calculations.
  • Dynamic mode selection strategy : For the trained model, a greedy search algorithm based on calibration data is used to try to convert the layers into shared layers one by one and evaluate the impact on the model output, retaining key layers as full layers; for scratch training scenarios, a multi-layer distillation loss is introduced, allowing each full layer indexer to learn to serve the needs of multiple subsequent shared layers at the same time.
  • Reasoning process optimization : During the inference process, each layer only adds a simple conditional judgment, switches between calculating new indexes and reusing cached indexes according to the preset mode, realizing cross-layer sharing of indexers without modifying the model architecture or adding additional storage.

Key information and usage requirements of IndexCache

  • Proposing organization : Jointly developed by Tsinghua University and Z.ai.
  • Target the problem : Solve the computational bottleneck of the indexer in DeepSeek sparse attention in long context scenarios, accounting for up to 81% of the pre-filling time when 200K tokens are used.
  • Core principles : Based on the high overlap rate of 70%-100% of the adjacent layer top-k index, redundant calculations are reduced through cross-layer reuse.
  • acceleration effect : Retaining 1/4 of the indexer can achieve 1.82x pre-filling and 1.48x decoding speedup.
  • performance loss : Almost no quality loss, and even performs better on some reasoning tasks.
  • Validate model : Validated on both 30B parameter DSA model and 744B parameter GLM-5.
  • Hardware requirements : Requires NVIDIA GPU (such as H100), but does not require additional video memory and reuses standard DSA memory space.
  • software environment : Supports SGLang or vLLM framework, and provides ready-made patches that can be directly used in DeepSeek-V3.2, GLM-5 and other models.
  • No training plan : Applicable to the trained DSA model, a small batch of calibration data needs to be prepared to run a greedy search to determine the optimal layer mode.

IndexCache’s core advantages

  • Significant acceleration : Supports 1.82 times faster prefilling and 1.48 times faster decoding under 200K context, significantly reducing user waiting time.
  • Zero performance loss : After removing 75% of indexer calculations, the model quality is almost unchanged, and some tasks are even slightly improved.
  • Zero additional overhead : A conditional branch realizes reuse without increasing GPU memory usage, and reuses the memory allocated by standard DSA.
  • plug and play : Provides SGLang and vLLM patches without modifying the model architecture and can be directly applied to mainstream models such as DeepSeek-V3.2 and GLM-5.
  • Flexible deployment : Supports both training-free and training-aware solutions, adapts to trained models and training scenarios from scratch, and the indexer retention ratio can be flexibly configured.
  • Production level verification : It has been verified to be effective on the GLM-5 large model with 744B parameters, and has the capability of large-scale deployment.

IndexCache project address

Comparison of similar competing products of IndexCache

Contrast DimensionsIndexCacheNative DSAFull Attention Anchor method
core mechanismTop-k index output from cross-layer reuse indexerEach layer runs a lightweight indexer independentlyRely on full attention anchor layer reuse index
Computational overheadRemove 75% of indexers, prefill speedup 1.82xIndexer takes 81% of prefill time with 200K contextIt is necessary to retain the full attention layer and the calculation cost is high
Applicable scenariosDSA architecture that completely eliminates full attentionStandard DSA deploymentAn architecture that requires full attention as an anchor
Implementation complexityOne if/else branch, zero additional video memoryStandard implementationAn anchor layer strategy needs to be designed
training requirementsSupports training-free deployment or training-aware optimizationRequires complete trainingUsually requires joint training
Production verification744B GLM-5 VerificationDeepSeek-V3 production applicationMostly small and medium-scale experiments

Application scenarios of IndexCache

  • Long document processing : Suitable for scenarios such as paper reading and legal contract analysis, the pre-population speed is 1.82 times faster under 200K context, significantly reducing the time users wait for the first token.
  • multi-step reasoning tasks : Supports complex logical chain reasoning such as mathematical proofs and code generation, decoding is accelerated by 1.48 times, and accelerates the thinking chain generation process.
  • Agent workflow : Empower agentic processes such as multi-round tool invocation and autonomous task planning, reduce the cost of long-context reasoning, and support more complex agent interactions.
  • RAG system : Used to enhance the generation of large-scale knowledge base retrieval, and efficiently handle the long context integration and generation of web-scale retrieval results.
  • real-time conversation service : Suitable for online services such as customer service robots and intelligent assistants to increase throughput, reduce serving costs, and improve end-user experience. ©