C-GSPN Logo

Compact GSPN: Scaling Spatial Propagation to Vision Foundation Models

1NVIDIA, 2The Chinese University of Hong Kong

Abstract

Scaling vision foundation models is limited by the quadratic cost of self-attention. Generalized Spatial Propagation Networks (GSPN) provide a linear-time alternative that propagates context directly on the 2D grid and removes positional embeddings, but have not been scaled to foundation-level training. We present Compact GSPN (C-GSPN), a ViT block with a compressed propagation space that preserves accuracy while cutting propagation latency by nearly 10×, complemented by lightweight projections and fused CUDA kernels for further efficiency. To pretrain at scale, we use a two-stage distillation scheme with module-wise supervision and end-to-end alignment. In a representative 1K configuration (batch 32, C=1152), C-GSPN yields up to 2× speedup, while maintaining competitive zero-shot accuracy and improving segmentation by +2.1%. Extensive experiments and ablations confirm that the proposed compression and two-stage distillation are key to achieving strong transfer while substantially reducing compute, offering a practical path toward subquadratic vision foundation models.

Performance vs. Latency vs. Model Size comparison

C-GSPN achieves 2.64× faster inference while maintaining competitive segmentation performance.

Motivation

Scaling vision foundation models is bottlenecked by self-attention's quadratic O(N²) cost, which explodes with higher resolutions and limits memory, latency, and high-resolution deployment. Generalized Spatial Propagation Networks (GSPN) offer a linear-time alternative, propagating features directly on 2D grids with O(√N) effective depth and no positional embeddings. However, original GSPN faces CUDA limitations that hinder scaling with increasing batch and channel sizes, and it has not yet been adapted to foundation-level data volumes and model scales.

We introduce Compact GSPN (C-GSPN), which leverages distillation for effective pretraining while enabling efficient scaling of subquadratic architectures to vision foundation models through compressed latent-space propagation, redesigned projections, and fused CUDA kernels—yielding ~10× latency reductions and addressing channel and batch bottlenecks for practical high-resolution encoders.

2D Propagation Performance vs Channels and Batch Size
Block Performance vs Image Resolution

Left: C-GSPN demonstrates significant speedups across different channel dimensions and batch sizes. Right: C-GSPN achieves 13.67× speedup over FlashAttention at 1036 resolution.

Method

Our method introduces three key innovations to enable efficient scaling of spatial propagation networks to foundation models: (1) Compressed latent-space propagation that reduces computational overhead by nearly 10×, (2) Unified CUDA Kernel for Overhead Reduction, (3) A two-stage distillation scheme with layer-wise pretraining and end-to-end alignment for effective foundation-level training, and (4) Progressive high-resolution transfer with feature interpolation for efficient adaptation under limited computational budgets.

Overview of the Compact GSPN method, showing the architecture design, two-stage distillation strategy, and high-resolution transfer approach.

Results

Latency Improvements

C-GSPN achieves significant latency reductions compared to both FlashAttention and the original GSPN across different resolutions. The compression and optimization techniques enable practical deployment at higher resolutions.

Sublayer and Block Time Comparison

Left: C-GSPN achieves 10× lower sublayer latency compared to GSPN across resolutions. Right: C-GSPN maintains consistently low block-level latency, approaching FlashAttention Tiling efficiency.

Detailed Latency and Throughput Comparison Table

Detailed latency and throughput measurements for sublayer, layer, and block levels across different resolutions.

High-Resolution Transfer with Limited Budget

When transferring to higher resolutions with limited computational budget, C-GSPN demonstrates strong performance consistency. With knowledge distillation, C-GSPN achieves 2.64× speedup over ViT-Distill while maintaining consistent segmentation performance across multiple resolutions (378, 518, 756, 1036).

High-Resolution Transfer Results with Limited Budget

C-GSPN with knowledge distillation (KD) achieves 2.64× speedup over ViT-Distill while maintaining consistent segmentation performance across resolutions, demonstrating effective high-resolution transfer with limited computational budget.

Model Quality: Zero-Shot Transfer and Dense Tasks

C-GSPN maintains competitive performance on both classification and dense prediction tasks. Through two-stage distillation, C-GSPN achieves 81.3% zero-shot accuracy (vs. 82.2% ViT-Distill baseline) while significantly improving segmentation performance by +2.1% on ADE20K compared to the baseline.

Model Quality Comparison Table

C-GSPN achieves competitive results across classification and dense prediction tasks with 14% fewer parameters than GSPN, demonstrating effective knowledge distillation and architectural efficiency.

Related Links

Our work builds upon the foundation of spatial propagation networks for vision tasks.

Generalized Spatial Propagation Network (GSPN) introduces a linear attention mechanism optimized for multi-dimensional data that directly operates on spatially coherent image data. Our Compact GSPN extends this work by introducing compression techniques for efficient foundation model training.

GSPN-2: Efficient Parallel Sequence Modeling builds upon GSPN with further optimizations for efficient parallel sequence processing.

BibTeX

@article{park2021nerfies,
  author    = {yitong jiang, Collin McCarthy, Hongjun Wang, Hanrong Ye, Qi Dou, Tianfan Xue, Jinwei Gu, Jan Kautz, Hongxu Yin, Pavlo Molchanov, Sifei Liu},
  title     = {Compact GSPN: Scaling Spatial Propagation to Vision Foundation Models},
  journal   = {arXiv},
  year      = {2026},
}