Derivation of the coherent point-drift algorithm

Back to Coherent point-drift

In this section I will rederive the coherent point-drift algorithm, described in Point Set Registration: Coherent Point Drift, Myronenko, A., Xubo Song, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, no.12, pp.2262,2275, Dec. 2010. $\DeclareMathOperator*{\argmax}{arg\,max\;}$ $\DeclareMathOperator*{\argmin}{arg\,min\;}$

Expectation-Maximization algorithm

Input

Meaning	Set
Probability space	$(\Omega, \mathcal{F}_{\Omega}, \mathbb{P})$
Actual data; random element	$X : \Omega \to V$
Measurable space	$(V, \mathcal{F}_V)$
Measurable space	$(W, \mathcal{F}_W)$
Observation function; measurable	$p : V \to W$
Observed data	$p(X)$
Set of distribution parameters	$\Theta \subset \mathbb{R}^d$
Set of probability densities	$\{f_{X}^{\theta} : V \to \mathbb{R}^{\geq 0}\}_{\theta \in \Theta}$

Maximum-likelihood

The maximum-likelihood estimator for the distribution parameter — given $X$ — is a random element $\theta_* : \Omega \to \Theta$ such that

$\begin{equation} \begin{split} \theta_* = \argmax_{\theta \in \Theta} f_X^{\theta}(X). \label{Likelihood} \end{split} \end{equation}$

Since $\log$ is an increasing function, and probability densities are non-negative, this is equivalent to

$\begin{equation} \begin{split} \theta_* = \argmax_{\theta \in \Theta} \log\left(f_X^{\theta}(X)\right). \label{LogLikelihood} \end{split} \end{equation}$

Iteration

The expectation-maximization algorithm, or the EM algorithm, is an iterative algorithm for finding a maximum-likelihood estimator $\theta_* : \Omega \to \Theta$ in a neighborhood of the initial parameter $\theta_0 : \Omega \to \Theta$ — given observed data $p(X)$ . The EM-iteration is given by

$\begin{equation} \begin{split} \theta_k = \argmax_{\theta \in \Theta} \mathbb{E}_{\theta_{k - 1}}[\log f_X^{\theta}(X) \mid p(X)], \end{split} \end{equation}$

for all $k \in \mathbb{N}^{> 0}$ .

Intuition

Ideally, we would like to compute $\theta_*$ from Equation $\ref{Likelihood}$ . However, this is not possible, since we only observe $p(X)$ , and not $X$ .

Therefore, we integrate out from $X$ the information that is not contained in $p(X)$ by the conditional expectational, and then try to maximize the remaining likelihood. However, this is often not possible in closed form.

Therefore, we change the strategy to an iterative approximation instead. We assume that the conditional expectation is taken with respect to constant parameters; those from the previous step. If this can be maximized in closed form, then we can improve the approximation by repeating the local maximization. This provides us with a local maximum in the limit. The local maximum may or not may not be a global maximum.

Coherent point-drift

The coherent point-drift algorithm is a point-set registration algorithm. It treats the points in one point-set as the centroids of normal distributions, and then finds the parameters for the normal distributions to best fit the other point-set. This is done by maximimizing likelihood using the EM algorithm. This is useful, because the information on which normal distribution generated the point is lost — something the EM algorithm can deal with.

The normal distributions share the same isotropic variance. The coherent point-drift algorithm includes the variance as an estimated parameter. This requires to add yet another component to the mixture model. The choice here is an improper uniform “distribution”. It is here where the probabilistic interpretability of the algorithm suffers somewhat. However, from the viewpoint of likelihood maximization there are no problems.

Input

Meaning	Set
Model point-set	$\{x_1, \dots, x_n\} \subset \mathbb{R}^d$
Scene point-set	$\{y_1, \dots, y_m\} \subset \mathbb{R}^d$
Set of transformation parameters	$\Theta \subset \mathbb{R}^d$
Set of transformations	$\left\{T_{\theta} : \mathbb{R}^d \to \mathbb{R}^d\right\}_{\theta \in \Theta}$
Initial transformation parameter	$\theta_0 \in \Theta$
Initial variance parameter	$\sigma_0 \in \mathbb{R}^{> 0}$
Noise tolerance	$\alpha \in (0, 1]$

Output

Name	Set
Transformation parameter	$\theta_* \in \mathbb{R}^d$
Variance parameter	$\sigma_* \in \mathbb{R}^{> 0}$

Here $T_{\theta_*}(\{y_1, \dots, y_m\})$ matches $\{x_1, \dots, x_n\}$ well — in some sense.

Model

Let $\left\{(X_i, K_i) : \Omega \to \mathbb{R}^d \times [1, m + 1]_{\mathbb{N}} \right\}_{i \in [1, n]_{\mathbb{N}}}$ be i.i.d. random elements — the actual data — such that

$\begin{equation} \begin{split} f_{X_i \mid K_i}(x_i \mid k_i) & = c_{\sigma} \exp\left(-\frac{1}{2 \sigma^2} \lVert T_{\theta}(y_{k_i}) - x_i \rVert^2\right), \\ f_{K_i}(k_i) & = \frac{1 - \alpha}{m}, \end{split} \end{equation}$

for all $i \in [1, n]_{\mathbb{N}}$ and $k_i \in [1, m]_{\mathbb{N}}$ . Here

$\begin{equation} \begin{split} c_{\sigma} = \frac{1}{(2 \pi \sigma^2)^{d / 2}}. \end{split} \end{equation}$

The missing component is an improper uniform distribution:

$\begin{equation} \begin{split} f_{X_i \mid K_i}(x_i \mid m + 1) & = \frac{1}{n} \\ f_{K_i}(m + 1) & = \alpha. \end{split} \end{equation}$

We assume the $x_i$ to be a realization of $X_i$ — there exists $w \in \Omega$ such that $x_i = X_i(w)$ — and that the observation function is $p : \mathbb{R}^d \times [1, m + 1]_{\mathbb{N}} \to \mathbb{R}^d$ such that $p(x_i, k_i) = x_i$ .

Location-index distribution

The probability density of $(X_i, K_i)$ is given by

$\begin{equation} \begin{split} f_{(X_i, K_i)}(x_i, k_i) & = f_{K_i}(k_i) f_{(X_i \mid K_i)}(x_i \mid k_i) \\ {} & = \frac{1 - \alpha}{m} c_{\sigma} \exp\left(-\frac{1}{2 \sigma^2} \lVert T_{\theta}(y_{k_i}) - x_i \rVert^2\right), \end{split} \end{equation}$

for $k_i \in [1, m]_{\mathbb{N}}$ , and

$\begin{equation} \begin{split} f_{(X_i, K_i)}(x_i, m + 1) & = f_{K_i}(m + 1) f_{(X_i \mid K_i)}(x_i \mid m + 1) \\ {} & = \frac{\alpha}{n}. \end{split} \end{equation}$

The logarithm of the former is

$\begin{equation} \begin{split} \log\left[f_{(X_i, K_i)}(x_i, k_i)\right] = \log\left(\frac{1 - \alpha}{m} c_{\sigma} \right) - \frac{1}{2 \sigma^2} \lVert T_{\theta}(y_{k_i}) - x_i \rVert^2. \end{split} \end{equation}$

Location distribution

As a marginal distribution, we have that

$\begin{equation} \begin{split} f_{X_i}(x_i) & = \sum_{k_i = 1}^{m + 1} f_{(X_i, K_i)}(x_i, k_i), \end{split} \end{equation}$

for all $i \in [1, n]_{\mathbb{N}}$ .

Index given location

Substituting location-index and location,

$\begin{equation} \begin{split} f_{K_i \mid X_i}(k_i \mid x_i) & = \frac{f_{(X_i, K_i)}(x_i, k_i)}{f_{X_i}(x_i)} \\ {} & = \frac{f_{(X_i, K_i)}(x_i, k_i)}{\sum_{j = 1}^{m + 1} f_{(X_i, K_i)}(x_i, j)}. \end{split} \end{equation}$

Expectation

Conditional expectation

The conditional expectation of $\log(f_{(X_i, K_i)}(X_i, K_i))$ given $X_i$ is given by

$\begin{equation} \begin{split} {} & \mathbb{E}\left[\log(f_{(X_i, K_i)}(X_i, K_i)) \mid X_i = x_i \right] \\ = \quad & \sum_{k_i = 1}^{m + 1} f_{K_i \mid X_i}(k_i \mid x_i) \log\left(f_{(X_i, K_i)}(x_i, k_i)\right) \\ = \quad & \log\left(\frac{1 - \alpha}{m} c_{\sigma} \right) \sum_{k_i = 1}^m f_{K_i \mid X_i}(k_i \mid x_i) \\ - \quad & \frac{1}{2 \sigma^2} \sum_{k_i = 1}^m f_{K_i \mid X_i}(k_i \mid x_i) \lVert T_{\theta}(y_{k_i}) - x_i \rVert^2 \\ + \quad & f_{K_i \mid X_i}(m + 1 \mid x_i) \log\left(\frac{\alpha}{n}\right). \end{split} \end{equation}$

Joint conditional expectation

The conditional expectation of $\log(f_{(X, K)}(X, K))$ given $X$ is given by

$\begin{equation} \begin{split} {} & \mathbb{E}\left[\log(f_{(X, K)}(X, K)) \mid X = x \right] \\ = \quad & \mathbb{E}\left[\log\left(\prod_{i = 1}^n f_{(X_i, K_i)}(X_i, K_i)\right) \mid X = x \right] \\ = \quad & \sum_{i = 1}^n \mathbb{E}\left[\log(f_{(X_i, K_i)}(x_i, k_i)) \mid X_i = x_i \right] \\ = \quad & \left[\log\left(\frac{1 - \alpha}{m} \right) + \log\left(c_{\sigma} \right)\right] \sum_{i = 1}^n \sum_{k_i = 1}^m f_{K_i \mid X_i}(k_i \mid x_i) \\ - \quad & \frac{1}{2 \sigma^2} \sum_{i = 1}^n \sum_{k_i = 1}^m f_{K_i \mid X_i}(k_i \mid x_i) \lVert T_{\theta}(y_{k_i}) - x_i \rVert^2 \\ + \quad & \log\left(\frac{\alpha}{n}\right) \sum_{i = 1}^n f_{K_i \mid X_i}(m + 1 \mid x_i). \end{split} \end{equation}$

Let us denote

$\begin{equation} \begin{split} F & = \sum_{i = 1}^n \sum_{k_i = 1}^m f_{K_i \mid X_i}(k_i \mid x_i). \end{split} \end{equation}$

Maximization

Since the first and the last term of the joint conditional expectation do not depend on $\sigma$ or $\theta$ , maximizing it is equivalent to maximizing

$\begin{equation} \begin{split} - \quad & \frac{1}{2 \sigma^2} \sum_{i = 1}^n \sum_{k_i = 1}^m f_{K_i \mid X_i}^{\sigma, \theta}(k_i \mid x_i) \lVert T_{\theta}(y_{k_i}) - x_i \rVert^2 \\ - \quad & \frac{F d}{2} \log\left(\sigma^2\right). \end{split} \end{equation}$

However, this does not seem to be possible in closed form. Instead — following the EM algorithm — we assume constant parameters, and minimize

$\begin{equation} \begin{split} Q_t(\sigma, \theta) & = \frac{1}{2 \sigma^2} \sum_{i = 1}^n \sum_{j = 1}^m f_{K_i \mid X_i}^{(\sigma, \theta)_{t - 1}}(j \mid x_i) \lVert T_{\theta}(y_j) - x_i \rVert^2 \\ {} & + \frac{F d}{2} \log\left(\sigma^2\right). \end{split} \end{equation}$

A necessary condition for minimizing $Q_t$ is that the partial derivative of $Q_t$ with respect to $\sigma$ is zero:

$\begin{equation} \begin{split} \partial_1 Q_t(\sigma_t, \theta_t) & = -\frac{1}{\sigma_t^3} \sum_{i = 1}^n \sum_{j = 1}^m f_{K_i \mid X_i}^{(\sigma, \theta)_{t - 1}}(j \mid x_i) \lVert T_{\theta_t}(y_j) - x_i \rVert^2 + \frac{Fd}{\sigma_t} \\ {} & = 0. \end{split} \end{equation}$

This can be used to solve for $\sigma_t$ :

$\begin{equation} \begin{split} \sigma_t^2 = \frac{1}{Fd} \sum_{i = 1}^n \sum_{j = 1}^m f_{K_i \mid X_i}^{(\sigma, \theta)_{t - 1}}(j \mid x_i) \lVert T_{\theta_t}(y_j) - x_i \rVert^2. \end{split} \end{equation}$

Substituting $\sigma_t$ back, we get that we must minimize

$\begin{equation} \begin{split} \frac{Fd}{2} + \frac{Fd}{2} \log\left(\sum_{i = 1}^n \sum_{j = 1}^m f_{K_i \mid X_i}^{(\sigma, \theta)_{t - 1}}(j \mid x_i) \lVert T_{\theta}(y_j) - x_i \rVert^2\right). \end{split} \end{equation}$

Since $\log$ is increasing, minimizing this is equivalent to minimizing

$\begin{equation} \begin{split} S_t(\theta) = \sum_{i = 1}^n \sum_{j = 1}^m f_{K_i \mid X_i}^{(\sigma, \theta)_{t - 1}}(j \mid x_i) \lVert T_{\theta_t}(y_j) - x_i \rVert^2. \end{split} \end{equation}$

Therefore,

$\begin{equation} \begin{split} \theta_t = \argmin_{\theta} S_t(\theta). \end{split} \end{equation}$

The solution of this problem for many subgroups of affine transformations can be found from here.