Computer VisionJuly 2025

Turning a Photo Into a 3D Scene — Zero-Shot

A single RGB image in, a multi-layer 3D scene out. No training data, no depth sensors. Here's the three-stage pipeline that makes it work at 30fps.

Take a single photo. No depth sensor, no stereo pair, no training on your specific scene. The goal: decompose it into layers at different depths, fill in what's hidden behind foreground objects, and render the result as a navigable 3D scene. This is 2D → 2.5D conversion.

The pipeline has three stages: harmonize the depth map with object boundaries, extract layers, and inpaint the occluded regions. Each stage solves a distinct problem, and together they hit 30fps on consumer GPUs.

Stage 1: Making Depth Respect Object Boundaries

Off-the-shelf depth estimators (Depth Anything) produce smooth depth maps, but the edges are soft — they bleed across object boundaries. Meanwhile, SAM (Segment Anything) gives pixel-perfect masks but no depth information. Stage 1 marries the two.

The harmonized depth map minimizes an energy function with three terms:

  • Data fidelity — stay close to the original depth, especially in mask interiors where confidence is high
  • Intra-mask smoothness — enforce smooth depth within each SAM mask, but allow arbitrary jumps across mask boundaries
  • Boundary contrast — actively encourage depth discontinuities at mask edges

The key: the smoothness term multiplies by an indicator function that's 1 when two adjacent pixels share a mask and 0 when they don't. This single constraint forces depth edges to snap exactly to object boundaries.

Solved via preconditioned conjugate gradient on a bilateral grid — effectively where , converging in 15–30 iterations, 3–5ms on GPU at 1080p.

Stage 2: From Pixels to Layers

With harmonized depth in hand, each SAM mask gets robust depth statistics (median and MAD — more resilient to boundary noise than mean/std). Then agglomerative clustering groups masks into layers:

Two masks belong to the same depth layer if their depth distributions overlap significantly — controlled by a statistical overlap factor and a minimum absolute gap. Complete linkage ensures no two masks within a layer are too far apart in depth.

Each resulting layer gets per-pixel depth, RGB, alpha, and a sort key for back-to-front compositing. For real-time rendering, layers can be triangulated into meshes (constrained Delaunay, decimated to ~10K triangles per layer) or quantized into a Multi-Plane Image with 32–64 depth planes.

Stage 3: Filling In What's Hidden

When you lift a foreground object to its depth plane, you expose a hole in the background layer — pixels that were never observed. Both RGB and depth need to be hallucinated.

The fast path uses LaMa (a Fourier convolution inpainter) for RGB and Telea/harmonic PDE for depth — deterministic, 8–15ms total, good enough for real-time. For higher quality on static frames, a 4-step DDIM diffusion refinement produces more coherent textures at ~50–80ms.

The inpainting is embarrassingly parallel across layers — each layer's hole is independent. Launch concurrent GPU kernels or batch all holes into a single padded tensor.

Performance Budget at 30fps

StageAlgorithmGPU latency (1080p)
1. HarmonizationBilateral solver (PCG)3–5ms
2. Layer extractionAgglomerative clustering1–2ms
3a. Depth inpaintingTelea / harmonic PDE1–2ms
3b. RGB inpaintingLaMa (fast) or DDIM-4 (quality)8–15ms / 40–80ms
4. RenderMPI alpha compositing5–8ms

Fast path total: ~20–30ms — fits in 33ms at 30fps. Quality path (with diffusion): ~60–100ms, suitable for offline or keyframe workflows.

Temporal Coherence for Video

For video, you don't recompute everything per frame. Four optimizations:

  1. Mask tracking — use SAM's prompt-based mode with tracked points (optical flow) instead of full re-segmentation. 5× SAM cost reduction.
  2. Depth temporal filtering — exponential moving average on the harmonized depth, warped by optical flow to align across frames.
  3. Inpainting caching — if a hole region changes by less than 10% (IoU > 0.9 with previous frame), reuse the cached result with flow-based warping. Eliminates ~70% of inpainting calls.
  4. Layer persistence — maintain stable layer assignment across frames, only re-clustering when mask topology changes significantly.

For throughput, a 3-stage inter-frame pipeline processes harmonization, layer extraction, and inpainting on different frames concurrently — 2 frames of latency (~66ms) but full 30fps throughput.

Hardware Requirements

TargetHardware
Real-time (30fps, 1080p)RTX 4070+ with TensorRT-compiled LaMa, CUDA bilateral solver
Quality (10–15fps)RTX 4090 / A100 with DDIM-4 diffusion inpainting
Edge (AR glasses, 720p)Snapdragon XR2 Gen 2 with INT8 quantized models, 20fps

The core insight: depth estimation and segmentation have both crossed the "good enough" threshold as separate models. The real problem isn't predicting depth or finding objects — it's making them agree at the pixel level and filling in what neither model can see. That's a systems problem, not a model problem.