Pipeline Overview

SPARC's reconstruction pipeline transforms a single 2D image into a complete 3D mesh in under 10 seconds. The process involves four main stages:

Input ImageSingle 2D photo
Multi-View Gen6 canonical views
Depth EstimationPer-view depth maps
Mesh FusionTextured 3D mesh

Stage 1: Multi-View Generation

Given a single image, SPARC generates 6 canonical views (front, back, left, right, top, bottom) using a fine-tuned multi-view diffusion model. Unlike approaches that generate arbitrary views, the canonical 6-view setup ensures complete coverage with minimal redundancy.

Camera Pose Convention

ViewAzimuthElevationPurpose
FrontPrimary identity view, from input
Right90°Side profile geometry
Back180°Occluded region hallucination
Left270°Symmetry verification
Top90°Top surface geometry
Bottom-90°Bottom surface (often flat)

Stage 2: Depth Estimation

Each of the 6 generated views is processed through a monocular depth estimator. SPARC uses a modified version of DPT (Dense Prediction Transformer) fine-tuned on our paired RGB-depth dataset of 500K synthetic objects.

Key challenge: depth maps from different views must be metrically consistent. If the front view predicts the object is 2 meters deep but the side view predicts 3 meters, the fusion will fail. SPARC addresses this with a cross-view depth consistency loss during training.

Stage 3: Mesh Fusion

The 6 depth maps are back-projected into 3D point clouds using known camera parameters, then fused using a truncated signed distance function (TSDF). The resulting volume is converted to a mesh via Marching Cubes. Texture is projected from the original generated views onto the mesh faces.

# Pseudo-code for TSDF fusion
tsdf = TSDFVolume(resolution=256, voxel_size=0.004)
for view_id in range(6):
    color = load_image(f"view_{view_id}.png")
    depth = load_depth(f"depth_{view_id}.npy")
    camera = get_camera_pose(view_id)  # Known canonical pose
    tsdf.integrate(color, depth, camera)
vertices, faces, colors = tsdf.extract_mesh()
export_mesh("output.obj", vertices, faces, colors)

Quality Metrics

MetricSPARCPrevious SOTAGround Truth
Chamfer Distance (↓)0.00820.01450.0
[email protected] (↑)0.870.721.0
Normal Consistency (↑)0.910.841.0
Inference Time8.2s45s

Limitations and Future Work

For publishing 3D reconstruction results in academic papers, clear multi-view visualizations and pipeline diagrams are essential. SciDraw can generate these figures in publication-ready quality.