Multi-View 3D Reconstruction Pipeline

Pipeline Overview

SPARC's reconstruction pipeline transforms a single 2D image into a complete 3D mesh in under 10 seconds. The process involves four main stages:

Input ImageSingle 2D photo

→

Multi-View Gen6 canonical views

→

Depth EstimationPer-view depth maps

→

Mesh FusionTextured 3D mesh

Stage 1: Multi-View Generation

Given a single image, SPARC generates 6 canonical views (front, back, left, right, top, bottom) using a fine-tuned multi-view diffusion model. Unlike approaches that generate arbitrary views, the canonical 6-view setup ensures complete coverage with minimal redundancy.

Camera Pose Convention

View	Azimuth	Elevation	Purpose
Front	0°	0°	Primary identity view, from input
Right	90°	0°	Side profile geometry
Back	180°	0°	Occluded region hallucination
Left	270°	0°	Symmetry verification
Top	0°	90°	Top surface geometry
Bottom	0°	-90°	Bottom surface (often flat)

Stage 2: Depth Estimation

Each of the 6 generated views is processed through a monocular depth estimator. SPARC uses a modified version of DPT (Dense Prediction Transformer) fine-tuned on our paired RGB-depth dataset of 500K synthetic objects.

Key challenge: depth maps from different views must be metrically consistent. If the front view predicts the object is 2 meters deep but the side view predicts 3 meters, the fusion will fail. SPARC addresses this with a cross-view depth consistency loss during training.

Stage 3: Mesh Fusion

The 6 depth maps are back-projected into 3D point clouds using known camera parameters, then fused using a truncated signed distance function (TSDF). The resulting volume is converted to a mesh via Marching Cubes. Texture is projected from the original generated views onto the mesh faces.

# Pseudo-code for TSDF fusion
tsdf = TSDFVolume(resolution=256, voxel_size=0.004)
for view_id in range(6):
    color = load_image(f"view_{view_id}.png")
    depth = load_depth(f"depth_{view_id}.npy")
    camera = get_camera_pose(view_id)  # Known canonical pose
    tsdf.integrate(color, depth, camera)
vertices, faces, colors = tsdf.extract_mesh()
export_mesh("output.obj", vertices, faces, colors)

Quality Metrics

Metric	SPARC	Previous SOTA	Ground Truth
Chamfer Distance (↓)	0.0082	0.0145	0.0
[email protected] (↑)	0.87	0.72	1.0
Normal Consistency (↑)	0.91	0.84	1.0
Inference Time	8.2s	45s	—

Limitations and Future Work

Thin structures (bicycle spokes, tree branches) are poorly reconstructed due to depth resolution limits
Highly reflective surfaces cause depth estimation errors
The back view is fully hallucinated—accuracy depends on the diffusion model's prior knowledge
Current resolution is limited to 256³ voxels; higher resolution requires more VRAM

For publishing 3D reconstruction results in academic papers, clear multi-view visualizations and pipeline diagrams are essential. SciDraw can generate these figures in publication-ready quality.