Learning Objectives and story arc¶

Story of the course¶

The goal of the course is to give an introduction to 3D Computer Vision.

We will start with the basics of how images are formed and how cameras work. We will then know how 3D points are projected onto 2D images and which paramters need to be estimated. This makes it clear that 3D Computer Vision is an ill-posed problem, because we want to get 3D information from 2D images.

We will have a closer look at the mathematical models of cameras and how to calibrate them. The goal of calibration is to know the parameters of the camera and to be able to correct distortions in the images. This helps us to reduce the ill-posedness of the problem and to get more accurate 3D information from the images. From a pixel position we can now get a ray in 3D space, which is a basis for further methods to get 3D information from images.

Then we will learn about projective geometry and how to use it to understand the geometry of images. This helps to get some depth information already from a single image. If we have some reference object in the image, we can measure the size of other objects in the image. Projective geometry provides a way to quickly compute the necessary information. We will also learn about deep learning based methods to get depth information from a single image. We will also learn about Time-of-Flight cameras, which are a special type of camera that can directly measure depth information.

Before going further on 3D reconstruction, we will learn how 3D data can be represented and processed. We will learn about different data structures for 3D data (meshes, point clouds, voxel grids, point maps, depth maps) and algorithms to process them.

Using geometry we learn how to get 3D information from two images, which is called stereo vision. We will learn about the principle of epipolar geometry and how to find correspondences between two images. We will also learn about triangulation and structured light, which are methods to get 3D information based on the same principles. Then we will learn about multi-view stereo, which uses many images to get more accurate 3D information. We will also learn about NeRFs, which are a new method to get 3D information from many images using deep learning.

All previous methods rely on the fact that the cameras are calibrated, which means that we know the parameters of the cameras. In the last part of the course, we will learn about structure from motion, which is a method to get 3D information from many images without knowing the camera parameters. We will quickly repeat how feature matching is key for finding correspondences between images which are the basis for structure from motion. We then will have a look at geometric foundation models like Dust3r, which can be used to get 3D information from many images without knowing the camera parameters by using a large neural network.

Learning objectives¶

We track the learning progress on the Miro board (Password is given in lecture).

01 Introduction and motivation¶

I am motivated.
I have an idea of what is coming.

What is 3D Computer Vision about¶

I know what computer vision means and encompasses.
I know first ideas how 3D information can be reconstructed from images.

Cameras¶

Cameras and perception¶

I know how a camera creates an image.
I know the pinhole camera model.
I understand why we use lenses in cameras.
I know that human eyes are not cameras and perception is tricky.

Projection¶

I understood that an image is a projection from 3D to 2D, and therefore information is lost.
I know how a projection can be calculated using matrix multiplication.
I know that the projection matrix only maps 3D points to image coordinates and not vice versa.

Camera parameter¶

I know that there are internal and external camera parameters and how they are related.
I know that the internal camera parameters are fixed unless the lens is changed.
I know that the external camera parameters define the pose (position and direction) of the camera.
I understand that if I know both parameters, I can compute on which pixel a 3D point is projected, but not vice versa.

Camera calibration¶

I understand the principle of camera calibration.
I know that there is a well-established method using planar targets.

Distortion¶

I learned that distortion of an image can occur due to perspective as well as the lens.
I learned about a method that can correct distortion in wide-angle shots.
I know that there are other cameras where the pinhole model is not suitable, but those are out of scope of this lecture.

Monocular reconstruction¶

Projective geometry¶

I understood that projective geometry can be used to easily calculate intersections of lines in images.
I can imagine that it also works in 3D with planes.
I have understood how to use vanishing points and a reference object to measure the height of other objects in an image.
I know how to calculate the position of a vanishing point.

Deep Learning for depth estimation¶

I know what an artificial neuron is and I can imagine that millions of them can be connected to each other.
I understand why it is called “deep” learning.
I know why you should never use training data for testing.
I understand what training, testing and prediction mean and can understand the process.
I understand the role of the loss function.
I know what the aim and principle of the gradient descent method is and have understood what it has to do with derivatives.
I know that back propagation is the algorithm used to train a neural network.
I know about DepthAnything and other deep learning based methods to get a depth map from a single image.

3D cameras¶

I know different 3D camera technologies.
I have understood how the Time-Of-Flight measurement principle works.
I know the necessary hardware components of a ToF camera.
I know which problems exist in ToF imaging and how they can be solved.
I know different applications for 3D cameras.

Data structures and algorithms¶

I know what data structures exist for 3D data.
I understand how the ICP algorithm works.
I know octrees.
I know how a kD tree is constructed and how to find the nearest neighbors in it.

Stereo and Triangulation¶

Stereo¶

I have understood the principle of epipolar geometry.
I have understood the purpose and use of rectification.
I have understood why stereo matching is so difficult, especially on edges.
I know the distinction between local and global stereo algorithms.
I understand how a global stereo matching can be found using dynamic programming on the disparity space image.

Triangulation¶

I understand the calculation process of stereo triangulation and can reproduce it.
I understand the link between structured illumination and stereo triangulation.
I understand the idea of using binary codes for Structured Light scanners and why this speeds up the process.
I have understood how triangulation can be used to reconstruct the shape of objects and I have learned about various applications.

Epipolar geometry¶

I have understood what epipoles are.
I have learned that the mapping from one point to the epipolar line in another image is described by the fundamental matrix.
I can estimate where the epipolar lines are in two images of the same scene.
I know that you can rectify a stereo image pair if you know the fundamental matrix.
I have understood the difference between essential matrix and fundamental matrix.
I understood that you can calculate the fundamental matrix from matching image points.
I learned that it is still a relatively simple minimization problem.

Multiview Stereo (MVS) and NeRF (and tbd: 3DGS)¶

MVS¶

I have understood that multi-view stereo assumes that one knows the intrinsic and extrinsic camera parameters.
I learned about different weighting functions for matching.
I learned that you can reconstruct very accurately with many cameras and I know some applications.
I learned about the influence of the base length.
I have understood the principle of how to arrive at 3D models instead of a single depth map.

NeRF (and Gaussian Splatting)¶

I know what a NeRF is and how it is created.
I have understood that a NeRF does not store geometries.
I can differentiate how NeRFs are trained and rendered.
I understand why controlling and editing a NeRF is complicated.

Feature Matching and Structure from motion (SfM)¶

Features¶

I understand what is meant by image feature.
I am aware of different applications that use image features and am motivated to learn more about features.
I understand what are the advantages of image features over individual pixels, regions or whole images.
I can understand the approach; detection, descriptive, matching.
I know what a blob is and know and understand the concept of scale invariance.
I have the understood what invariance is and what it is necessary for.
I know applications based on finding pairs of features.
I understand how invariance can be achieved by the descriptor.
I know SIFT as a detector and descriptor.
I can calculate the distance between two features.
I understood that there is problem with repetitions and know possible solutions of the problem.
I understand RANSAC and have ideas about where to use it.
I go through the day exhilarated and whistling the RANSAC song!
I can distinguish between feature detection and feature description.

SfM¶

I know what Structure From Motion means and why it is called what it is called.
I understand that you can take advantage of the low rank when solving the system of equations.
I understand that there is a lot of data and large matrices that need to be computed together and that is why iterative methods are used.
I know many different applications of SfM.