How do orthographic and perspective camera models in structure from motion differ from each other?

问题

Under the assumption that the camera model is orthographic, how do orthographic and perspective camera models in structure from motion?

Also, how do these techniques differ from each other?

回答1:

Say you have a static scene and moving camera (or equivalently, rigidly moving scene and static camera) and you want to reconstruct the scene geometry and camera motion from two or more images. The reconstruction usually based on obtaining point correspondences, that is you have some equations which ones should be solved for the points and camera motion.

The solution can be either based on nonlinear minimization or on various approximations. The camera can be approximated by orthographic or perspective projection. In the simplest SFM case the camera can be approximated by orthographic projection (or more generally by weak perspective projection), where the scene can be recovered up to scale. But translation perpendicular to image plane can never be recovered due to the properties of orthographic projection.

Newer SfM methods use perspective projection, because with orthographic projection we can’t recover all information. With full perspective projection we can recover for example the translation along optical axis. That is the geometry and full motion can be recovered up to global scale factor.

回答2:

To understand why each method is chosen we need to look at the model of the camera when we model it as orthographic and when we model it as perspective.

The orthographic camera model is a special case were we assume that the distance of the scene from the center of projection is infinite. This means that we assume there isn't any distortion caused by the distance between the object and the image. As a consequence we expect to get an identity between the object coordinate in the real world and in the image.

So for example if we have a triangle in the real world in coordinates (X1,Y1,Z1) ,(X2,Y2,Z2), (X3,Y3,Z3) we expect to see the triangle on the image (x1,y1),(x2,y2),(x3,y3) were X1=wx1 X2=wx2 .. Y1=w*y1.. and so on. where w is some scaling factor.

When this is a good assumption? Pay attention that i didn't took the Z values of each point into consideration. So this assumption is good when we look at a scene where the distance of the scene from the camera is almost constant.

Note: This is a very simplistic explanation that doesn't take into considerations a lot of other factor like the camera itself lens distortion and more.

来源：https://stackoverflow.com/questions/39521396/how-do-orthographic-and-perspective-camera-models-in-structure-from-motion-diffe

标签

OpenCV

computer-vision

augmented-reality

structure-from-motion