VistaDream

Sampling multiview consistent images for single-view scene reconstruction

Haiping Wang1, Yuan Liu2,3,†, Ziwei Liu3, Wenping Wang4, Zhen Dong1,†, Bisheng Yang1

1Wuhan University 2Hong Kong University of Science and Technology 3Nanyang Technological University 4Texas A&M University

Scene Reconstruction from a Single Image

RGB Renderings (Left) & Depth Renderings (Right) of the Scene reconstructed from a Single Image (below)


    

fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn

Abstract


In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream begins with building a global coarse 3D scaffold by zooming out a little step with outpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multi-view Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin.

Coarse Gaussian Field Refinement

Coarse Scene (Left) v.s. Refined Scene (Right)


    

fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn

Scene Refinement with SDS or MCS

Score Distillation Sampling (Left) v.s. Multi-view Consistency Sampling (Right)

* w/o scene refinement on the input ( & outpainted) image for better and fair comparisions


    

fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn fn

LLaVA - Image Quality Assessment (LLaVA-IQA)

Ask LLaVA the following questions on rendered multiview images and report the ratio of "YES"

[Noise-Free]:   Is the image free of noise or distortion?
[Edge]:     Does the image show clear objects and sharp edges?
[Detail]:      Does this image show detailed textures and materials?
[Structure]:      Is the overall scene coherent and realistic in terms of layout and proportions in this image?
[Quality]:    Is this image overall a high-quality image with clear objects, sharp edges, nice color, good overall structure, and good visual quality?

Table 1 Quantitative evaluations on renderings from the scenes reconstructed by different methods

We borrow this website template from RealmDreamer.