Concise DLT with Einsum

2022-02

Direct Linear Transform (DLT) problems in computer vision involve messy coefficients in the linear system of equations. Appropriate use of an orthogonalization routine and einsum simplifies the implementation; it also helps to make the code more standardized and readable.

1. DLT problems in Computer Vision
2. Messy coefficients in the linear system
3. Standardizing the linear system
3.1. 2D Homography
3.2. PnP camera localization
3.3. Triangulation
3.4. Fundamental Matrix Estimation

1. DLT problems in Computer Vision

I was TA-ing for the vision class this quarter. The curriculum has always included classic machine vision staples: camera calibration, epipolar geometry, SfM, etc. Many of these are instances of Direct Linear Transformation (DLT) problems. We have a projective equality of the form

\begin{align} \lambda \vx = \mT \vy, \label{dlt}\end{align}

where

\vx, \vy

are 2D or 3D homogeneous coordinates in

\R^3

\R^4

\mT

is a transform matrix.

\lambda

is an arbitrary scaling factor: the equality is satisfied as long as

\vx

. Here are four common problem setups:

2D homography: $\vx \in \R^3, \vy \in \R^3, \mT \in \R^{3 \times 3}$ . The goal is to compute $\mT$ given $\{\vx_i, \vy_i\}$ .
PnP camera localization: $\vx \in \R^3, \vy \in \R^4, \mT \in \R^{3 \times 4}$ . The goal is to compute $\mT$ given $\{\vx_i, \vy_i\}$ .
3D Triangulation: $\vx \in \R^3, \vy \in \R^4, \mT \in \R^{3 \times 4}$ . The goal is to compute $\vy$ given $\{\vx_i, \mT_i \}$ .
Fundamental matrix: $\vx^T \mF \vy = 0, \vx \in \R^3, \vy \in \R^3, \mF \in \R^{3 \times 3}$ . The goal is to compute $\mF$ given $\{ \vx_i, \vy_i \}$ . The setup is a bit different from eq. 1 but related.

We often need multiple such equalities to provide a sufficient number of constraints on the unknown variables. We will discuss the requirements for each case. The overall linear system is developed into

\mA \bm{q} = \bm{0}

, and the unknown

\bm{q}

is solved in a least-squares /

l2

sense by finding the null space of

\mA

, often via SVD/QR.

2. Messy coefficients in the linear system

What I find unsatisfactory is that when deriving the constraint matrix

\mA

from

\lambda \vx = \mT \vy

, many lecture slides (such as this, this, and many others) go through a series of somewhat ad-hoc steps. The presentation has the downsides that 1) the connection between these problems becomes harder to see, and 2) for code implementation, we have to fill in the entries of

\mA

one by one; it's error-prone and a bit messy.

Take 2D homography for example. To solve the entries

\{h_i\}

of the matrix

\begin{align} \lambda \begin{bmatrix}x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} h_1 & h_2 & h_3 \\ h_4 & h_5 & h_6 \\ h_7 & h_8 & h_9 \\ \end{bmatrix} \begin{bmatrix}x \\ y \\ 1\end{bmatrix}\end{align}

we expand the rows, divide row 1, 2 by row 3 to remove

\lambda

, and obtain

\begin{align} h_1 x + h_2 y + h_3 &= x'(h_7 x + h_8 y + h_9) \nonumber \\ h_4 x + h_5 y + h_6 &= y'(h_7 x + h_8 y + h_9) .\end{align}

Let

\vh

be the vector

[h_1 \dots h_9]^T

. The corresponding coefficients of are written in the matrix

\begin{align} \mA_i = \begin{bmatrix*}[r] -x & -y & -1 & 0 & 0 & 0 & xx' & yx' & x' \\ 0 & 0 & 0 & -x & -y & -1 & xy' & yy' & y' \end{bmatrix*} \Longrightarrow \mA_i \vh = \bm{0}.\end{align}

\vh

has

8

degrees of freedom due to scale-ambiguity in the projective setup. 1 point pair yields 2 row constraints. 4 point pairs produce 8 row constraints, which are sufficient to solve for

\vh

. The problem is that

\mA_i

needs to be manually entered element-wise, and the expression is messier for PnP camera localization and other problems.

3. Standardizing the linear system

The aim of the DLT eq. 1 is to make

\vx

and

\mT \vy

parallel. Say

\vx \in \R^3

, and we denote the

\vx^{\perp} \in \R^{2 \times 3}

the vertical stack of independent row vectors, each of which is orthogonal to

\vx

, so that

\vx^{\perp} \vx = \bm{0}

. The parallelism requirement can be written as

\begin{align} \vx \propto \mT \vy ~\Longrightarrow~ \vx^{\perp} \mT \vy &= \bm{0} \\ \underbrace{ (\vx^{\perp} \otimes \vy) }_{\R^{2 \times 3 \times 3}} \cdot \underbrace{\mT \vphantom{(} }_{\R^{3 \times 3}} &= \bm{0}. \label{outer}\end{align}

The

\otimes

in eq. 6 is outer product applied to the last two dimensions, and the dot product is applied on the last two dimensions as well. This type of multidimensional linear system has become popular in scientific computing. The code is compact when written with einsum, and it looks like this:

pts_2d # [n, 3]
pts_3d # [n, 4]
n = pts_2d.shape[0]
ortho_pts_2d = ortho(pts_2d)
A = np.einsum("nab, nc -> nabc", ortho_pts_2d, pts_3d)
A = A.reshape(2 * n, -1)
_, _, V_t = svd(A)
# ...

The orthogonal stack

\vx^{\perp}

is easy for

\vx \in \R^3

; it's just skew-symmetric / cross-product matrix

[\hat{\vx}]

, since

[\hat{\vx}] \vx = \bm{0}

and

[\hat{\vx}]

has rank 2, even though it's

\R^{3 \times 3}

instead of

\R^{2 \times 3}

. This has already covered all cases discussed so far.

We don't deal with

\R^4

and beyond. If we do, the straightforward approach to get

\vx^{\perp}

is QR factorization by Gram-Schmidt / Givens / Householder. The routine is efficient and is implicitly doing all the manual algebra steps before.

\vx^{\perp}

also makes it clear that

1

equality provides

d-1

constraints; hence

8 / (3-1) = 4

inputs required for 2D homography.

3.1. 2D Homography

Here is the code for 2D homography estimation.

def compute_homography(pts1, pts2):
    n = pts1.shape[0]

    pts2 = ortho(pts2)  # [n, 3, 3]
    A = np.einsum("nab, nc -> nabc", pts2, pts1)
    A = A.reshape(3 * n, -1)

    _, _, V_t = np.linalg.svd(A)
    L = V_t[-1] / V_t[-1, -1]
    H1to2 = L.reshape(3, 3)
    return H1to2

3.2. PnP camera localization

Here is the code for PnP camera calibration. If we only want to estimate extrinsics, and have ground-truth intrinsics, it's easy to factor it out too.

def pnp_camera_localization(pts_2d, pts_3d):
    n = pts_2d.shape[0]

    pts_2d = ortho(pts_2d)  # [n, 3, 3]
    A = np.einsum("nab, nc -> nabc", pts_2d, pts_3d)
    A = A.reshape(3 * n, -1)

    _, _, V_t = np.linalg.svd(A)
    L = V_t[-1] / V_t[-1, -1]
    P = L.reshape(3, 4)
    return P

3.3. Triangulation

For triangulation we are solving for

\vy

(\vx_i^{\perp} \mT_i) \vy = \bm{0}

, where

(\vx_i, \mT_i)

are the 2D pixel coordinates and camera projection matrix from one image.

1

set of

(\vx_i, \mT_i)

already provides enough constraints for a unique

\vy

up to scale, but we are trying to find the best intersection of 2 or more light rays.

def triangulate(P1, x1s, P2, x2s):
    # x1s: [n, 3]

    n = len(x1s)
    x1s = skew(x1s)  # [n, 3, 3]
    x2s = skew(x2s)

    constraint = np.stack([x1s @ P1, x2s @ P2], axis=1)
    constraint = constraint.reshape(n, -1, 4)
    _, _, V_t = svd(constraint, full_matrices=False)
    pts = V_t[:, -1]
    pts = pts / pts[:, -1].reshape(n, 1)
    return pts

3.4. Fundamental Matrix Estimation

Fundamental matrix is the simplest. It's already in the form

\vx^T \mF \vy = \bm{0}

;

8

point pairs needed.

def eight_point_algorithm(x1s, x2s):
    n, _ = x1s.shape
    A = np.einsum("np, nq -> npq", x2s, x1s)
    A = A.reshape(n, -1)
    _, _, V_t = svd(cons, full_matrices=False)

    F = V_t[-1].reshape(3, 3)
    F = F / F[2, 2]

    F = enforce_rank_2(F)
    return F

Last updated on 2025-05-07. Design inspired by distill.