Workshop Retrospective - 3D Reconstruction of objects and scenes¶

Workshop goals

Reconstructing objects

Input

Results - Point clouds

Evaluation

Reflection on model-based approach

Different use cases

Installation of Mast3r and usage of Dune

Workshop goals¶

✔️ Reconstruct small 3D objects using various methods.

❌ Reconstruct a room or a larger outdoor scene using various methods.

✔️ Compare results from different devices and techniques.

✔️ Reflect on the effectiveness of traditional versus model-based methods.

Unfortunately, we were unable to gather a larger dataset of a room or outdoor scene in time for the workshop. We therefore focused on evaluating these model-based approaches using a small dataset of a teddy bear figurine.

Reconstructing objects¶

Input¶

The following images were captured using a Samsung Galaxy S23 in professional mode (allowing manual setting of focus, etc.).

Input images

Results - Point clouds¶

The following point clouds were rendered in Blender 4.4 with equally sized point primitives and global lighting to make qualitative comparison easier.

Point clouds

Tool/Method	0°	90°	180°	270°	No. of points
Colmap Apr 7, 2016					20 728
Dust3r 21 Dec 2023					451 620
Mast3r 14 Jun 2024					157 807
Spann3r 28 Aug 2024					19 552
VGGT 14 Mar 2025 at CVPR					301 823
Dune 18 Mar 2025 at CVPR					504 990
Polycam Online service					301 823

Evaluation¶

Colmap: The SfM tool of choice of many scientific papers in the 3D reconstruction space seems to struggle with the rather small number of images. The entire back-side of the figurine was not reconstructed. That being said, the comparatively small number of points do a good job of capturing the entire front side of the figurine. Comparing this to the result achieved with Spann3r (which contains slightly fewer points), Colmap seems to have points distributed across a larger surface area, making the original object easier to discern.
Dust3r: This point cloud contains the second-highest number of points achieving an impressively detailed and complete reconstruction. Interestingly, a lot of the black fabric in the background seems to have been reconstructed as well - albeit, poorly (likely greatly contributing to the high point count). Furthermore, there seem to be sets of points that are floating off to the side relative to their correct position - even though most other points that make up that area are correctly placed.
Mast3r: These results are very similar to that of Dust3r. The figurine is well reconstructed - however, it seems as though points are grouped into numerous rectangular "sheets" that are fitted to the surface of the figurine, making the surface seem disconnected. Dust3r seems to more accurately reconstruct the geometry (when disregarding the artifacts).
Spann3r: This point cloud is even more sparse and incomplete than the Colmap result, making the original object very hard to discern.
VGGT: This reconstruction looks quite complete, but seems to suffer from "over-reconstruction". It has nearly twice the point count of Mast3r, yet these additional points make the original object harder to discern rather than making the point cloud more detailed. The effect is similar to the floating effect of the Dust3r result, but even more prevalent here.
Dune: Similar to Dust3r, a lot of the background was reconstructed, and the floating artifacts are quite prevalent. This result contains the most points, but once again, many of these likely contribute to the reconstruction of the background.
Polycam: While Polycam generated a very high point count - over 400,000 points - the reconstruction focuses almost entirely on the front of the figurine, much like COLMAP. Polycam delivers the most detailed result, but it falls short in terms of completeness.

Reflection on model-based approach¶

Dust3r directly generates 2D to 3D mappings in form of a point map per image (either for one or two images). When using two images, both point maps are expressed in the same coordinate frame. With more than two images, pairwise point maps are produced, and global alignment is performed, explaining the disproportionate rise in runtime as the number of images increases.

The exact distinction to a model like Depth-Anything-V1 / V2 isn't directly obvious to me, other than that this method produces point maps rather than depth maps. A critical difference seems to be that the point maps inherently contain information about the images' relation to one another, as pairwise point maps are expressed in the same coordinate frame. Essentially - as far as I understand - this method directly creates a 3D reconstruction, or, a point cloud. The Dust3r website also mentions "dense" 3D reconstruction as a downstream task - all the while reconstructions with Dust3r (same goes for its later iterations) are already much denser than what Colmap was able to produce.

Dune, a recent model developed by Naver Labs, was also included in the evaluation to compare correspondence matching performance. The results speak for themselves:

The following results depict the correspondences produced with Dune and Colmap comparing the first image in the dataset to all other images. Each line connects a common object point in the two images.

Image matching

Image pair	Dune matching	Colmap matching

	No. of correspondences: 880; Time: 0.84s	No. of Correspondences: 66

	No. of correspondences: 602; Time: 0.53s	No. of Correspondences: 28

	No. of correspondences: 358; Time: 0.52s	No. of Correspondences: 33

	No. of correspondences: 374; Time: 0.53s	No. of Correspondences: 21

	No. of correspondences: 221; Time: 0.50s	No. of Correspondences: 31

	No. of correspondences: 192; Time: 0.50s	No. of Correspondences: 16

	No. of correspondences: 216; Time: 0.51s	No. of Correspondences: 17

	No. of correspondences: 436; Time: 0.47s	No. of Correspondences: 24

	No. of correspondences: 711; Time: 0.49s	No. of Correspondences: 31

	No. of correspondences: 922; Time: 0.48s	No. of Correspondences: 98

	No. of correspondences: 1192; Time: 0.50s	No. of Correspondences: 2265

Colmap took a total of ~11 seconds for both feature extraction and matching, making the average time per image pair matching just under one second, compared to the ~0.5s average time of Dune. Furthermore, Dune identified over ten times as many matches as COLMAP on average.

Dune is notably more sensitive than COLMAP when it comes to detecting and matching small highlights in the background. In our dataset, the teddy bear figurine was photographed inside a box lined with very dark fabric. With carefully adjusted exposure settings, the remaining background highlights were barely visible. Under these conditions, COLMAP detected and matched only a few of these faint details, while Dune picked up and matched significantly more. To capture full coverage of the object, we rotated the figurine rather than the camera - a setup used in several other workshop datasets as well. However, when exposure was not properly controlled and more background highlights became visible, the quality of the reconstruction deteriorated noticeably. This appears to be due to Dune’s heightened sensitivity: it not only detects subtle background features, but also tends to match them - even when they clearly do not correspond - leading to incorrect geometry and floating artifacts in the reconstruction. While Dune’s aggressive matching behavior can be impressive in scenes with rich visual texture, it may also backfire in controlled environments like ours, where small, irrelevant features (such as fabric highlights) can be overemphasized. If Dune shares this characteristic with the other model-based approaches, this helps explain why they sometimes underperform in this particular setup when exposure is not tightly managed.

Different use cases¶

Surface reconstruction is a traditional use case for point clouds. Applying surface reconstruction to the above point clouds produced the following results:

These meshes were created in MeshLab using Screened Poisson surface reconstruction. This included prior manual removal of some outlier points and computing normals using MeshLab's Compute normals per point sets.

Poisson surface reconstruction

Tool/Method	0°	90°	180°	270°
Colmap
Dust3r 21 Dec 2023
Mast3r 14 Jun 2024
Spann3r 28 Aug 2024
VGGT 14 Mar 2025 at CVPR
Dune 18 Mar 2025 at CVPR
Polycam

A recent and growing use case for scene and camera pose reconstruction is the optimization of scene representations like 3D Gaussians or NeRFs. This Hugging Face space allows uploading small datasets to create a 3D reconstruction with Mast3r and use it for the optimization of a 3D Gaussian scene representation, showcasing the promise of model-based approaches for modern use cases. A video of a 3D Gaussian representation of the teddy bear figurine created using this pipeline can be seen here.

Installation of Mast3r and usage of Dune¶

We installed Mast3r on a Windows 11 machine according to the instructions on their GitHub site. We got it working with:

Visual Studio 2022
CUDA Toolkit 12.4

A very recent commit added partial support for Dune to the Mast3r repository, pushed on the same day as the workshop (June 26th), and was therefore not included in the version we had cloned at the time. After the workshop, Noam and I pulled the updated version to test the new functionality and generate the results presented in this retrospective. Both the 3D reconstruction and feature matching were performed using the newly integrated Dune features, following the updated instructions provided in the repository’s README.