FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

ICCV 2025

1Netflix Eyeline Studios, 2Cornell University, 3Stanford University

FlashDepth enables high-resolution, real-time video depth estimation. It captures fine details like fur, hair, and poles. We provide additional snapshots in the paper (Fig. 2,4).

Abstract

A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high- resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044×1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics.

Method

Our method consists of two main components: (1) a temporal module that enforces consistency across the video sequence, implemented using Mamba, and (2) a hybrid setup that combines the speed of a smaller model with the accuracy of a larger model. See our paper and code for details.


Enforcing Temporal Consistency

Given input frames (Frame T, T+1), a lightweight Mamba model aligns their features to the same scale for temporal consistency. The other components (ViT encoder, DPT decoder) are based on Depth Anything V2 and process each frame independently.


Hybrid Model for Efficiency at High-Resolution

A smaller model (Depth Anything V2 Small) ensures real-time inference at 2K resolution, while a larger model (Depth Anything V2 Large) provides accurate and robust features at lower resolution.

Qualitative Comparisons

(Videos are downsampled for faster loading.)
DepthCrafter is the most consistent method on in-the-wild videos, but it runs at 2.1 FPS at 1024×576 resolution and requires optimizing 110 images at once.
CUT3R supports streaming inputs but only at 14 FPS with resolution 512×288, and produces blurry depth.
In contrast, FlashDepth achieves 24 FPS at 2K resolution while remaining competitive with offline methods.

Quantitative Comparisons

FlashDepth achieves competitive accuracies and long-range temporal consistency compared to offline methods while running faster at higher resolution. See paper for more details.

Quantitative Result 1
Absolute relative error and accuracy
Quantitative Result 2
Boundary sharpness and FPS / resolution
Quantitative Result 3
Temporal consistency on Waymo test sets

Applications

On-Set Live Previews

FlashDepth has been integrated into Eyeline Studio’s production stages to support live video effects. In this example, it segments the actors, chairs, and set pieces to compose the background in near-real-time. This allows filmmakers to preview how actors fit within the virtual scene without relying on post-production, and makes it easier to get the shot right.
Left screen: Two actors in front of LED screens that can change in brightness and intensity based on position and depth of the subjects, for special effects such as relighting.
Right screen: Real-time depth estimation. Effects in order shown: 1) depth estimation; 2) depth slicing (colorization) to carve depth ranges of the set (specifically, we interactively adjust the depth threshold to highlight the actors and props in the same (green) range); 3) segmentation based on the depth threshold.


Depth-based Visual Effects

FlashDepth is sufficiently accurate to enable multiple visual effects, such as depth-based composition, unprojection (novel view synthesis), and relighting.

Limitations and Future Work

While FlashDepth achieves competitive quantitative results, we observe noticeable flickering when testing on in-the-wild videos, especially when compared to diffusion based methods like DepthCrafter. These artifacts likely stem from lighting fluctuations such as small specks or minor pixel-level changes from outdoor filming. We believe more diverse training data will improve temporal stability. Additionally, we did not extensively tune Mamba, which was primarily designed for language-based tasks. Further tuning may yield better results.

Acknowledgements

Gene Chou was supported by an NSF graduate fellowship (2139899). We thank Chi-Chih Chang for discussions on Mamba and efficiency; Jennifer Lao, Tarik Thompson, and Daniel Heckenberg for their operational support; Nhat Phong Tran, Miles Lauridsen, Oliver Hermann, Oliver Walter, and David Shorey for helping integrate FlashDepth into Eyeline's internal VFX pipeline.

References

Our code was modified and heavily borrowed from the following projects:

Depth Anything V2: Our base models for training.
Mamba 2: Our module for temporal consistency.

BibTeX


      @inproceedings{chou2025flashdepth,
        title     = {FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution},
        author    = {Chou, Gene and Xian, Wenqi and Yang, Guandao and Abdelfattah, Mohamed and Hariharan, Bharath and Snavely, Noah and Yu, Ning and Debevec, Paul},
        journal   = {The IEEE International Conference on Computer Vision (ICCV)},
        year      = {2025},
      }