Take a single photo. No depth sensor, no stereo pair, no training on your specific scene. The goal: decompose it into layers at different depths, fill in what's hidden behind foreground objects, and render the result as a navigable 3D scene. This is 2D → 2.5D conversion.
The pipeline has three stages: harmonize the depth map with object boundaries, extract layers, and inpaint the occluded regions. Each stage solves a distinct problem, and together they hit 30fps on consumer GPUs.
Stage 1: Making Depth Respect Object Boundaries
Off-the-shelf depth estimators (Depth Anything) produce smooth depth maps, but the edges are soft — they bleed across object boundaries. Meanwhile, SAM (Segment Anything) gives pixel-perfect masks but no depth information. Stage 1 marries the two.
The harmonized depth map minimizes an energy function with three terms:
- Data fidelity — stay close to the original depth, especially in mask interiors where confidence is high
- Intra-mask smoothness — enforce smooth depth within each SAM mask, but allow arbitrary jumps across mask boundaries
- Boundary contrast — actively encourage depth discontinuities at mask edges
The key: the smoothness term multiplies by an indicator function that's 1 when two adjacent pixels share a mask and 0 when they don't. This single constraint forces depth edges to snap exactly to object boundaries.
Solved via preconditioned conjugate gradient on a bilateral grid — effectively where , converging in 15–30 iterations, 3–5ms on GPU at 1080p.
Stage 2: From Pixels to Layers
With harmonized depth in hand, each SAM mask gets robust depth statistics (median and MAD — more resilient to boundary noise than mean/std). Then agglomerative clustering groups masks into layers:
Two masks belong to the same depth layer if their depth distributions overlap significantly — controlled by a statistical overlap factor and a minimum absolute gap. Complete linkage ensures no two masks within a layer are too far apart in depth.
Each resulting layer gets per-pixel depth, RGB, alpha, and a sort key for back-to-front compositing. For real-time rendering, layers can be triangulated into meshes (constrained Delaunay, decimated to ~10K triangles per layer) or quantized into a Multi-Plane Image with 32–64 depth planes.
Stage 3: Filling In What's Hidden
When you lift a foreground object to its depth plane, you expose a hole in the background layer — pixels that were never observed. Both RGB and depth need to be hallucinated.
The fast path uses LaMa (a Fourier convolution inpainter) for RGB and Telea/harmonic PDE for depth — deterministic, 8–15ms total, good enough for real-time. For higher quality on static frames, a 4-step DDIM diffusion refinement produces more coherent textures at ~50–80ms.
The inpainting is embarrassingly parallel across layers — each layer's hole is independent. Launch concurrent GPU kernels or batch all holes into a single padded tensor.
Performance Budget at 30fps
| Stage | Algorithm | GPU latency (1080p) |
|---|---|---|
| 1. Harmonization | Bilateral solver (PCG) | 3–5ms |
| 2. Layer extraction | Agglomerative clustering | 1–2ms |
| 3a. Depth inpainting | Telea / harmonic PDE | 1–2ms |
| 3b. RGB inpainting | LaMa (fast) or DDIM-4 (quality) | 8–15ms / 40–80ms |
| 4. Render | MPI alpha compositing | 5–8ms |
Fast path total: ~20–30ms — fits in 33ms at 30fps. Quality path (with diffusion): ~60–100ms, suitable for offline or keyframe workflows.
Temporal Coherence for Video
For video, you don't recompute everything per frame. Four optimizations:
- Mask tracking — use SAM's prompt-based mode with tracked points (optical flow) instead of full re-segmentation. 5× SAM cost reduction.
- Depth temporal filtering — exponential moving average on the harmonized depth, warped by optical flow to align across frames.
- Inpainting caching — if a hole region changes by less than 10% (IoU > 0.9 with previous frame), reuse the cached result with flow-based warping. Eliminates ~70% of inpainting calls.
- Layer persistence — maintain stable layer assignment across frames, only re-clustering when mask topology changes significantly.
For throughput, a 3-stage inter-frame pipeline processes harmonization, layer extraction, and inpainting on different frames concurrently — 2 frames of latency (~66ms) but full 30fps throughput.
Hardware Requirements
| Target | Hardware |
|---|---|
| Real-time (30fps, 1080p) | RTX 4070+ with TensorRT-compiled LaMa, CUDA bilateral solver |
| Quality (10–15fps) | RTX 4090 / A100 with DDIM-4 diffusion inpainting |
| Edge (AR glasses, 720p) | Snapdragon XR2 Gen 2 with INT8 quantized models, 20fps |
The core insight: depth estimation and segmentation have both crossed the "good enough" threshold as separate models. The real problem isn't predicting depth or finding objects — it's making them agree at the pixel level and filling in what neither model can see. That's a systems problem, not a model problem.