A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high- resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044×1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics.
Our method consists of two main components: (1) a temporal module that enforces consistency across the video sequence, implemented using Mamba, and (2) a hybrid setup that combines the speed of a smaller model with the accuracy of a larger model. See our paper and code for details.
Given input frames (Frame T, T+1), a lightweight Mamba model aligns their features to the same scale for temporal consistency. The other components (ViT encoder, DPT decoder) are based on Depth Anything V2 and process each frame independently.
A smaller model (Depth Anything V2 Small) ensures real-time inference at 2K resolution, while a larger model (Depth Anything V2 Large) provides accurate and robust features at lower resolution.
FlashDepth achieves competitive accuracies and long-range temporal consistency compared to offline methods while running faster at higher resolution. See paper for more details.
While FlashDepth achieves competitive quantitative results, we observe noticeable flickering when testing on in-the-wild videos, especially when compared to diffusion based methods like DepthCrafter. These artifacts likely stem from lighting fluctuations such as small specks or minor pixel-level changes from outdoor filming. We believe more diverse training data will improve temporal stability. Additionally, we did not extensively tune Mamba, which was primarily designed for language-based tasks. Further tuning may yield better results.
Gene Chou was supported by an NSF graduate fellowship (2139899). We thank Chi-Chih Chang for discussions on Mamba and efficiency; Jennifer Lao, Tarik Thompson, and Daniel Heckenberg for their operational support; Nhat Phong Tran, Miles Lauridsen, Oliver Hermann, Oliver Walter, and David Shorey for helping integrate FlashDepth into Eyeline's internal VFX pipeline.
Our code was modified and heavily borrowed from the following projects:
Depth Anything V2: Our base models for training.
Mamba 2: Our module for temporal consistency.
@inproceedings{chou2025flashdepth,
title = {FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution},
author = {Chou, Gene and Xian, Wenqi and Yang, Guandao and Abdelfattah, Mohamed and Hariharan, Bharath and Snavely, Noah and Yu, Ning and Debevec, Paul},
journal = {The IEEE International Conference on Computer Vision (ICCV)},
year = {2025},
}