Velox 🚀: Learning Representations
of 4D Geometry and Appearance

Anagh Malik1,2, Dorian Chan1, Xiaoming Zhao1, David B. Lindell2, Oncel Tuzel1, Jen-Hao Rick Chang1

1 Apple     2 University of Toronto

📄 Arxiv 💻 Code 🏠 Homepage 🖼️ Gallery

We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct.


Video-to-4D Gaussians viewer

We train a generative model that maps a monocular input video to our latent space of dynamic tokens, which our decoder then converts into 3D Gaussians. For an interactive look at the Gaussians produced by our video-to-4D model, just click any thumbnail below!


Cloth simulation (image-to-4D) Gaussian viewer

We train a generative model, to map from an input initial cloth position (given as an image) to our latent space of dynamic tokens, esentially solving an image to 4D problem. The generated tokens can be decoded into 3D Gaussians, using our trained decoder. For an interactive look at the Gaussians produced by our cloth simulation model, just click any thumbnail below!


3D tracking viewer

We train a separate model which given an input RGBD video (encoded into the latent space of dynamic tokens) learns to track query points on the first frame across the video in 3D. Click any thumbnail below for an interactive viewer of the predicted tracks!



Citation

@inproceedings{malik2025velox,
  author    = {Malik, Anagh and Chan, Dorian and Zhao, Xiaoming and Lindell, David B. and Tuzel, Oncel and Chang, Jen-Hao Rick},
  title     = {Velox: Learning Representations of 4D Geometry and Appearance},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}