The overall architecture: we introduce a novel 4D VAE that operates directly in native 4D space, that is dynamic colored voxel space, without 2D projection. This preserves explicit spatio-temporal coordinates throughout the learned encoder and decoder, enabling both partial and complete 4D content encoding. To support a flexible temporal compression ratio, we also design a novel spatio-temporal window attention module that performs attention within local 4D windows. Additionally, we propose a differentiable voxel rendering loss based on sparse voxel rasterization to improve the geometry and color reconstruction quality.
@inproceedings{
ding2026native,
title={Native Spatio-Temporal 4D Variational Autoencoder},
author={Ding, Lihe and Ye, weicai and Dong, Shaocong and Wang, Xintao and Wan, Pengfei and Gai, Kun and Xue, Tianfan},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
}