Direct Linear Transformation (DLT) problems in computer vision involve messy coefficients in the linear system of equations. Appropriate use of an orthogonalization routine and einsum simplifies the implementation; it also helps to make the code more standardized and more readable.
I was TA-ing for the vision class this quarter.
The curriculum has always included classic machine vision staples: camera calibration, epipolar geometry, SfM, etc. Many of these are instances of Direct Linear Transformation (DLT) problems. We have a projective equality of the form
λx=Ty,
where x,y are 2D or 3D homogeneous coordinates in R3 or R4. T is a transform matrix. λ is an arbitrary scaling factor: the equality is satisfied as long as x and Ty are parallel. Here are four common problem setups:
2D homography: x∈R3,y∈R3,T∈R3×3. The goal is to compute T given {xi,yi}.
PnP camera localization: x∈R3,y∈R4,T∈R3×4. The goal is to compute T given {xi,yi}.
3D Triangulation: x∈R3,y∈R4,T∈R3×4. The goal is to compute y given {xi,Ti}.
Fundamental matrix: xTFy=0,x∈R3,y∈R3,F∈R3×3. The goal is to compute F given {xi,yi}. The setup is a bit different from eq. 1 but related.
We often need multiple such equalities to provide a sufficient number of constraints on the unknown variables.
We will discuss the requirements for each case.
The overall linear system is developed into Aq=0, and the unknown q is solved in a least-squares / l2 sense by finding the null space of A, often via SVD/QR.
2. Messy coefficients in the linear system
What I find unsatisfactory is that when deriving the constraint matrix A from λx=Ty, many lecture slides (such as this, this, and many others) go through a series of somewhat ad-hoc steps. The presentation has the downsides that 1) the connection between these problems becomes harder to see,
and 2) for code implementation, we have to fill in the entries of A one by one; it's error-prone and a bit messy.
Take 2D homography for example. To solve the entries {hi} of the matrix
h has 8 degrees of freedom due to scale-ambiguity in the projective setup. 1 point pair yields 2 row constraints. 4 point pairs produce 8 row constraints, which are sufficient to solve for h. The problem is that Ai needs to be manually entered element-wise, and the expression is messier for PnP camera localization and other problems.
3. Standardizing the linear system
The aim of the DLT eq. 1 is to make x and Ty parallel. Say x∈R3, and we denote the x⊥∈R2×3 the vertical stack of independent row vectors, each of which is orthogonal to x, so that x⊥x=0. The parallelism requirement can be written as
x∝Ty⟹x⊥TyR2×3×3(x⊥⊗y)⋅R3×3T(=0=0.
The ⊗ in eq. 6 is outer product applied to the last two dimensions, and the dot product is applied on the last two dimensions as well. This type of multidimensional linear system has become popular in scientific computing. The code is compact when written with einsum, and it looks like this:
The orthogonal stack x⊥ is easy in R3; it's just skew-symmetric / cross-product matrix [x^], since [x^]x=0 and [x^] has rank 2, even though it's R3×3 instead of R2×3.
For R4 and beyond, the straightforward approach to get x⊥ is QR factorization by Gram-Schmidt / Givens / Householder. The routine is efficient and is implicitly doing all the manual algebra steps before.
x⊥ also makes it clear that 1 equality provides d−1 constraints; hence 8/(3−1)=4 inputs required for 2D homography.
3.1. 2D Homography
Here is the code for 2D homography estimation.
def compute_homography(pts1, pts2): n = pts1.shape[0] pts2 = ortho(pts2) # [n, 3, 3] A = np.einsum("nab, nc -> nabc", pts2, pts1) A = A.reshape(3 * n, -1) _, _, V_t = np.linalg.svd(A) L = V_t[-1] / V_t[-1, -1] H1to2 = L.reshape(3, 3) return H1to2
3.2. PnP camera localization
Here is the code for PnP camera calibration.
If we only want to estimate extrinsics, and have ground-truth intrinsics, it's easy to factor it out too.
def pnp_camera_localization(pts_2d, pts_3d): n = pts_2d.shape[0] pts_2d = ortho(pts_2d) # [n, 3, 3] A = np.einsum("nab, nc -> nabc", pts_2d, pts_3d) A = A.reshape(3 * n, -1) _, _, V_t = np.linalg.svd(A) L = V_t[-1] / V_t[-1, -1] P = L.reshape(3, 4) return P
3.3. Triangulation
For triangulation we are solving for y by (xi⊥Ti)y=0, where (xi,Ti) are the 2D pixel coordinates and camera projection matrix from one image. 1 set of (xi,Ti) already provides enough constraints for a unique y up to scale, but we are trying to find the best intersection of 2 or more light rays.
Fundamental matrix is the simplest. It's already in the form xTFy=0; 8 point pairs needed.
def eight_point_algorithm(x1s, x2s): n, _ = x1s.shape A = np.einsum("np, nq -> npq", x2s, x1s) A = A.reshape(n, -1) _, _, V_t = svd(cons, full_matrices=False) F = V_t[-1].reshape(3, 3) F = F / F[2, 2] F = enforce_rank_2(F) return F
Last updated on 2024-12-31. Design inspired by distill.