Pipeline Overview
SPARC's reconstruction pipeline transforms a single 2D image into a complete 3D mesh in under 10 seconds. The process involves four main stages:
Stage 1: Multi-View Generation
Given a single image, SPARC generates 6 canonical views (front, back, left, right, top, bottom) using a fine-tuned multi-view diffusion model. Unlike approaches that generate arbitrary views, the canonical 6-view setup ensures complete coverage with minimal redundancy.
Camera Pose Convention
| View | Azimuth | Elevation | Purpose |
|---|---|---|---|
| Front | 0° | 0° | Primary identity view, from input |
| Right | 90° | 0° | Side profile geometry |
| Back | 180° | 0° | Occluded region hallucination |
| Left | 270° | 0° | Symmetry verification |
| Top | 0° | 90° | Top surface geometry |
| Bottom | 0° | -90° | Bottom surface (often flat) |
Stage 2: Depth Estimation
Each of the 6 generated views is processed through a monocular depth estimator. SPARC uses a modified version of DPT (Dense Prediction Transformer) fine-tuned on our paired RGB-depth dataset of 500K synthetic objects.
Key challenge: depth maps from different views must be metrically consistent. If the front view predicts the object is 2 meters deep but the side view predicts 3 meters, the fusion will fail. SPARC addresses this with a cross-view depth consistency loss during training.
Stage 3: Mesh Fusion
The 6 depth maps are back-projected into 3D point clouds using known camera parameters, then fused using a truncated signed distance function (TSDF). The resulting volume is converted to a mesh via Marching Cubes. Texture is projected from the original generated views onto the mesh faces.
# Pseudo-code for TSDF fusion
tsdf = TSDFVolume(resolution=256, voxel_size=0.004)
for view_id in range(6):
color = load_image(f"view_{view_id}.png")
depth = load_depth(f"depth_{view_id}.npy")
camera = get_camera_pose(view_id) # Known canonical pose
tsdf.integrate(color, depth, camera)
vertices, faces, colors = tsdf.extract_mesh()
export_mesh("output.obj", vertices, faces, colors)
Quality Metrics
| Metric | SPARC | Previous SOTA | Ground Truth |
|---|---|---|---|
| Chamfer Distance (↓) | 0.0082 | 0.0145 | 0.0 |
| [email protected] (↑) | 0.87 | 0.72 | 1.0 |
| Normal Consistency (↑) | 0.91 | 0.84 | 1.0 |
| Inference Time | 8.2s | 45s | — |
Limitations and Future Work
- Thin structures (bicycle spokes, tree branches) are poorly reconstructed due to depth resolution limits
- Highly reflective surfaces cause depth estimation errors
- The back view is fully hallucinated—accuracy depends on the diffusion model's prior knowledge
- Current resolution is limited to 256³ voxels; higher resolution requires more VRAM
For publishing 3D reconstruction results in academic papers, clear multi-view visualizations and pipeline diagrams are essential. SciDraw can generate these figures in publication-ready quality.