Concise DLT with Einsum

2022-02

Direct Linear Transformation (DLT) problems in computer vision involve messy coefficients in the linear system of equations. Appropriate use of an orthogonalization routine and einsum simplifies the implementation; it also helps to make the code more standardized and more readable.

1. DLT problems in Computer Vision

I was TA-ing for the vision class this quarter. The curriculum has always included classic machine vision staples: camera calibration, epipolar geometry, SfM, etc. Many of these are instances of Direct Linear Transformation (DLT) problems. We have a projective equality of the form
λx=Ty,eqnlabel-dlt\begin{align} \lambda \vx = \mT \vy, \label{dlt}\end{align}
where x,y\vx, \vy are 2D or 3D homogeneous coordinates in R3\R^3 or R4\R^4. T\mT is a transform matrix. λ\lambda is an arbitrary scaling factor: the equality is satisfied as long as x\vx and Ty\mT \vy are parallel. Here are four common problem setups:
  • 2D homography: xR3,yR3,TR3×3\vx \in \R^3, \vy \in \R^3, \mT \in \R^{3 \times 3} . The goal is to compute T\mT given {xi,yi}\{\vx_i, \vy_i\}.
  • PnP camera localization: xR3,yR4,TR3×4 \vx \in \R^3, \vy \in \R^4, \mT \in \R^{3 \times 4} . The goal is to compute T\mT given {xi,yi}\{\vx_i, \vy_i\}.
  • 3D Triangulation: xR3,yR4,TR3×4\vx \in \R^3, \vy \in \R^4, \mT \in \R^{3 \times 4} . The goal is to compute y\vy given {xi,Ti}\{\vx_i, \mT_i \}.
  • Fundamental matrix: xTFy=0,xR3,yR3,FR3×3\vx^T \mF \vy = 0, \vx \in \R^3, \vy \in \R^3, \mF \in \R^{3 \times 3} . The goal is to compute F\mF given {xi,yi}\{ \vx_i, \vy_i \}. The setup is a bit different from eq. 1 but related.
We often need multiple such equalities to provide a sufficient number of constraints on the unknown variables. We will discuss the requirements for each case. The overall linear system is developed into Aq=0\mA \bm{q} = \bm{0}, and the unknown q\bm{q} is solved in a least-squares / l2l2 sense by finding the null space of A\mA, often via SVD/QR.

2. Messy coefficients in the linear system

What I find unsatisfactory is that when deriving the constraint matrix A\mA from λx=Ty \lambda \vx = \mT \vy, many lecture slides (such as this, this, and many others) go through a series of somewhat ad-hoc steps. The presentation has the downsides that 1) the connection between these problems becomes harder to see, and 2) for code implementation, we have to fill in the entries of A\mA one by one; it's error-prone and a bit messy.
Take 2D homography for example. To solve the entries {hi}\{h_i\} of the matrix
λ[xy1]=[h1h2h3h4h5h6h7h8h9][xy1]\begin{align} \lambda \begin{bmatrix}x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} h_1 & h_2 & h_3 \\ h_4 & h_5 & h_6 \\ h_7 & h_8 & h_9 \\ \end{bmatrix} \begin{bmatrix}x \\ y \\ 1\end{bmatrix}\end{align}
we expand the rows, divide row 1, 2 by row 3 to remove λ\lambda, and obtain
h1x+h2y+h3=x(h7x+h8y+h9)h4x+h5y+h6=y(h7x+h8y+h9).\begin{align} h_1 x + h_2 y + h_3 &= x'(h_7 x + h_8 y + h_9) \nonumber \\ h_4 x + h_5 y + h_6 &= y'(h_7 x + h_8 y + h_9) .\end{align}
Let h\vh be the vector [h1h9]T[h_1 \dots h_9]^T. The corresponding coefficients of are written in the matrix
Ai=[xy1000xxyxx000xy1xyyyy]Aih=0.\begin{align} \mA_i = \begin{bmatrix*}[r] -x & -y & -1 & 0 & 0 & 0 & xx' & yx' & x' \\ 0 & 0 & 0 & -x & -y & -1 & xy' & yy' & y' \end{bmatrix*} \Longrightarrow \mA_i \vh = \bm{0}.\end{align}
h\vh has 88 degrees of freedom due to scale-ambiguity in the projective setup. 1 point pair yields 2 row constraints. 4 point pairs produce 8 row constraints, which are sufficient to solve for h\vh. The problem is that Ai\mA_i needs to be manually entered element-wise, and the expression is messier for PnP camera localization and other problems.

3. Standardizing the linear system

The aim of the DLT eq. 1 is to make x\vx and Ty\mT \vy parallel. Say xR3\vx \in \R^3, and we denote the xR2×3\vx^{\perp} \in \R^{2 \times 3} the vertical stack of independent row vectors, each of which is orthogonal to x\vx, so that xx=0\vx^{\perp} \vx = \bm{0} . The parallelism requirement can be written as
xTy  xTy=0(xy)R2×3×3T(R3×3=0.eqnlabel-outer\begin{align} \vx \propto \mT \vy ~\Longrightarrow~ \vx^{\perp} \mT \vy &= \bm{0} \\ \underbrace{ (\vx^{\perp} \otimes \vy) }_{\R^{2 \times 3 \times 3}} \cdot \underbrace{\mT \vphantom{(} }_{\R^{3 \times 3}} &= \bm{0}. \label{outer}\end{align}
The \otimes in eq. 6 is outer product applied to the last two dimensions, and the dot product is applied on the last two dimensions as well. This type of multidimensional linear system has become popular in scientific computing. The code is compact when written with einsum, and it looks like this:
pts_2d # [n, 3]
pts_3d # [n, 4]
n = pts_2d.shape[0]
ortho_pts_2d = ortho(pts_2d)
A = np.einsum("nab, nc -> nabc", ortho_pts_2d, pts_3d)
A = A.reshape(2 * n, -1)
_, _, V_t = svd(A)
# ...
The orthogonal stack x\vx^{\perp} is easy in R3\R^3; it's just skew-symmetric / cross-product matrix [x^][\hat{\vx}], since [x^]x=0[\hat{\vx}] \vx = \bm{0} and [x^][\hat{\vx}] has rank 2, even though it's R3×3\R^{3 \times 3} instead of R2×3\R^{2 \times 3}.
For R4\R^4 and beyond, the straightforward approach to get x\vx^{\perp} is QR factorization by Gram-Schmidt / Givens / Householder. The routine is efficient and is implicitly doing all the manual algebra steps before.
x\vx^{\perp} also makes it clear that 11 equality provides d1d-1 constraints; hence 8/(31)=48 / (3-1) = 4 inputs required for 2D homography.

3.1. 2D Homography

Here is the code for 2D homography estimation.
def compute_homography(pts1, pts2):
    n = pts1.shape[0]

    pts2 = ortho(pts2)  # [n, 3, 3]
    A = np.einsum("nab, nc -> nabc", pts2, pts1)
    A = A.reshape(3 * n, -1)

    _, _, V_t = np.linalg.svd(A)
    L = V_t[-1] / V_t[-1, -1]
    H1to2 = L.reshape(3, 3)
    return H1to2

3.2. PnP camera localization

Here is the code for PnP camera calibration. If we only want to estimate extrinsics, and have ground-truth intrinsics, it's easy to factor it out too.
def pnp_camera_localization(pts_2d, pts_3d):
    n = pts_2d.shape[0]

    pts_2d = ortho(pts_2d)  # [n, 3, 3]
    A = np.einsum("nab, nc -> nabc", pts_2d, pts_3d)
    A = A.reshape(3 * n, -1)

    _, _, V_t = np.linalg.svd(A)
    L = V_t[-1] / V_t[-1, -1]
    P = L.reshape(3, 4)
    return P

3.3. Triangulation

For triangulation we are solving for y\vy by (xiTi)y=0(\vx_i^{\perp} \mT_i) \vy = \bm{0}, where (xi,Ti)(\vx_i, \mT_i) are the 2D pixel coordinates and camera projection matrix from one image. 11 set of (xi,Ti)(\vx_i, \mT_i) already provides enough constraints for a unique y\vy up to scale, but we are trying to find the best intersection of 2 or more light rays.
def triangulate(P1, x1s, P2, x2s):
    # x1s: [n, 3]

    n = len(x1s)
    x1s = skew(x1s)  # [n, 3, 3]
    x2s = skew(x2s)

    constraint = np.stack([x1s @ P1, x2s @ P2], axis=1)
    constraint = constraint.reshape(n, -1, 4)
    _, _, V_t = svd(constraint, full_matrices=False)
    pts = V_t[:, -1]
    pts = pts / pts[:, -1].reshape(n, 1)
    return pts

3.4. Fundamental Matrix Estimation

Fundamental matrix is the simplest. It's already in the form xTFy=0\vx^T \mF \vy = \bm{0}; 88 point pairs needed.
def eight_point_algorithm(x1s, x2s):
    n, _ = x1s.shape
    A = np.einsum("np, nq -> npq", x2s, x1s)
    A = A.reshape(n, -1)
    _, _, V_t = svd(cons, full_matrices=False)

    F = V_t[-1].reshape(3, 3)
    F = F / F[2, 2]

    F = enforce_rank_2(F)
    return F

Last updated on 2024-12-31. Design inspired by distill.