Native Spatio-temporal 4D Variational AutoEncoder

ICML 2026

¹CUHK MMLab, ²Kuaishou Technology, ³HKUST, ⁴CPII under InnoHK

Method

The overall architecture: we introduce a novel 4D VAE that operates directly in native 4D space, that is dynamic colored voxel space, without 2D projection. This preserves explicit spatio-temporal coordinates throughout the learned encoder and decoder, enabling both partial and complete 4D content encoding. To support a flexible temporal compression ratio, we also design a novel spatio-temporal window attention module that performs attention within local 4D windows. Additionally, we propose a differentiable voxel rendering loss based on sparse voxel rasterization to improve the geometry and color reconstruction quality.

BibTeX

@inproceedings{ ding2026native, title={Native Spatio-Temporal 4D Variational Autoencoder}, author={Ding, Lihe and Ye, weicai and Dong, Shaocong and Wang, Xintao and Wan, Pengfei and Gai, Kun and Xue, Tianfan}, booktitle={Forty-third International Conference on Machine Learning}, year={2026}, }