Diffusion Priors for Dynamic View Synthesis from Monocular Videos

1Snap Inc., 2KAUST

Abstract

Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions that are occluded or partially observed in the given videos. To address these issues, we first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. Subsequently, we distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves geometric consistency while preserving the scene identity. We perform thorough experiments to evaluate the efficacy of the proposed method qualitatively and quantitatively. Our results demonstrate the robustness and utility of our approach in challenging cases, further advancing dynamic novel view synthesis.

Method

To perform dynamic novel view synthesis given a video, we adopt a 4D representation consisting of dynamic and static parts. We use two types of supervision. First, we render the input viewpoints at input time. Besides, we distill prior knowledge of a pre-trained RGB-D diffusion model on random novel views using score distillation sampling. Furthermore, to mitigate the domain gaps between the training distributions and in-the-wild images, we tune the RGB-D diffusion model using the reference images with a customization technique prior to distillation.


Stablized view Bullet-time
T-NeRF Nerfies HyperNeRF DpDy (Ours) DpDy (Ours)