8 Majorization Inequalities

8.1 Introduction

8.2 The AM/GM Inequality

8.2.1 Absolute Values

8.2.2 Absolute Values

Suppose the problem we have to solve is to minimize f(x)=ni=1wi|hi(x)| over xX. Here hi(x) is supposed to be differentiable. In statistics it typically is a residual, for instance hi(x)=yizix. Suppose, for the time being, that hi(x)0. Then we have the majorization ni=1wi|hi(x)|12ni=1wi|hi(y)|(h2i(x)+h2i(y)), and we must minimize g(x,y):=ni=1wi|hi(y)|h2i(x), which is a weighted least squares problem.

The simplest case of this is the one-dimensional example is hi(x)=yix, which means we want to compute the weighted median. The algorithm is simply x(k+1)=ni=1ui(x(k))yini=1ui(x(k)), where ui(x)=wiyix.

We have assumed, so far, in this example that hi(y)0. If hi(y)=0 at some point in the iterative process then the majorization function does not exist, and we cannot compute the upgrade. One easy way out of this problem is to minimize fϵ(x)=ni=1wih2i(x)+ϵ2 for small values of ϵ. Clearly if ϵ1>ϵ2 then min It follows that

The function f_\epsilon is differentiable. We find \mathcal{D}f_\epsilon(x)=\sum_{i=1}^n w_i\frac{h_i(x)}{\sqrt{h_i^2(x)+\epsilon^2}}\mathcal{D}h_i(x), and \mathcal{D}^2f_\epsilon(x)=\sum_{i=1}^n w_i\frac{1}{\sqrt{h_i^2(x)+\epsilon^2}} \left\{\frac{\epsilon^2}{h_i^2(x)+\epsilon^2}\left(\mathcal{D}h_i(x)\right)^2+h_i(x)\mathcal{D}^2h_i(x)\right\}. With obvious modifications the same formulas apply if x is a vector of unknowns, for instance if h(x)=y-Zx.

By the implicit function theorem the function x(\epsilon) defined by \mathcal{D}f_\epsilon(x(\epsilon))=0 is differentiable, with derivative \mathcal{D}x(\epsilon)=\epsilon\frac{\sum_{i=1}^nw_i\left[h_i^2(x(\epsilon))^2+\epsilon^2\right]^{-\frac32}h_i(x(\epsilon))\mathcal{D}h_i(x(\epsilon))}{\mathcal{D}^2f_\epsilon(x(\epsilon))}

For the weighted median the iterates are still the same weighted averages, but now with weights u_i(x,\epsilon)=\frac{w_i}{\sqrt{(y_i-x)^2+\epsilon^2}}. Differentiating the algorithmic map gives the convergence ratio \kappa_\epsilon(x)\mathop{=}\limits^{\Delta}\frac{\sum_{i=1}^n u_i(x,\epsilon)\frac{(y_i-x)^2}{(y_i-x)^2+\epsilon^2}}{\sum_{i=1}^n u_i(x,\epsilon)}. Clearly \min_i\frac{(y_i-x)^2}{(y_i-x)^2+\epsilon^2}\leq\kappa_\epsilon(x)\leq\max_i\frac{(y_i-x)^2}{(y_i-x)^2+\epsilon^2}, which implies \kappa_\epsilon(x)< 1. If y_i\not=x for all i, then \lim_{\epsilon\rightarrow 0}\kappa_\epsilon(x)=1 and convergence is asymptotically sublinear.


Insert mediJan.R Here

8.2.3 Gini Mean Difference

Alternatively, we can minimize the Gini Mean Difference of the f_i(\theta). Now

\begin{multline*} \sum_{i=1}^n\sum_{j=1}^n\mid f_i(\theta)-f_j(\theta)\mid \leq\\ \sum_{i=1}^n\sum_{j=1}^n\frac{1}{\mid f_i(\xi)-f_j(\xi)\mid} (f_i(\theta)-f_j(\theta))^2 + \text{terms}, \end{multline*}

which can be rewritten as \cdots=\sum_{i=1}^n\sum_{j=1}^n w_{ij}(\xi)f_i(\theta)f_j(\theta)+\text{terms}, minimization of which is a weighted least squares problem.

8.2.4 Location Problems

The Fermat-Weber problem is to find a point x\in\mathbb{R}^m such that the sum of the Euclidean distances to m given points y_1,\cdots,y_m is minimized. Thus our loss function is f(x)=\sum_{j=1}^m w_jd(x,y_j), where the w_j are positive weights. Other names are the single facility location problem or the spatial median problem.

An early iterative algorithm to solve the Fermat-Weber problem was proposed by Weiszfeld (1937). For a translation see Weiszfeld and Plastria (2009).

Here we show how to use the arithmetic mean-geometric mean inequality for majorization. Suppose our problem is to minimize f(X)=\sum_{i=1}^n\sum_{j=1}^n w_{ij}d_{ij}(X), where the w_{ij} are non-negative weights, and the d_{ij}(X) are again Euclidean distances. This is a location problem. To make it interesting, we suppose that some of the points (facilities) are fixed, others are the variables we have to minimize over. Observe that this is a convex, but non-differentiable, optimization problem.

We use the AM-GM inequality in the form d_{ij}(X)d_{ij}(Y)\leq\frac{1}{2}(d_{ij}^2(X)+d_{ij}^2(Y)). If d_{ij}(Y)>0 then d_{ij}(X)\leq\frac{1}{2}\frac{d_{ij}^2(X)+d_{ij}^2(Y)} {d_{ij}(Y)}. Using the notation from Example a.a we now find \phi(X)\leq\frac{1}{2}(\hbox{tr }X'B(Y)X+\hbox{tr }Y'B(Y)Y), which gives us a quadratic majorization.

If X is partitioned into X_1 and X_2, with rows which are fixed and rows which are to be determined (facilities which have to be located), and B is partitioned correspondingly, then the algorithm we find is X_2^{(k+1)}=B_{22}(X^{(k)})^{-1}B_{21}(X^{(k)})X_1.

8.2.5 The Lasso and the Bridge

8.3 Polar Norms and the Cauchy-Schwarz Inequality

8.3.1 Rayleigh Quotient

Rewrite for minimizing 02/22/15, by maximizing x’Bx over x’Ax=1.

We go back to maximizing the Rayleigh quotient \lambda(x)=\frac{x'Ax}{x'Bx}, where we now assume that both A and B are positive definite. Maximizing \lambda is equivalent to maximizing \sqrt{x'Ax} on the condition that \sqrt{x'Bx}=1. By Cauchy-Schwartz \sqrt{x'Ax}\geq\frac{1}{\sqrt{y'Ay}}x'Ay, and thus for the majorization we maximize x'Ax over x'Bx=1. This defines an algorithmic map which sets the update of x proportional to B^{-1}Ax, i.e. we have a shown global convergence of the power method to compute the largest generalized eigenvalue.

We can also establish the linear convergence rate quite easily, using Ostrowski (1966). For definiteness we normalize in each iteration, and set \mathcal{A}(\omega)=\frac{B^{-1}A\omega}{\|B^{-1}A\omega\|}. At a point \omega which has B^{-1}A\omega=\lambda_1\omega, and \omega'\omega=1 we have \mathcal{M}(\omega)=\frac{1}{\lambda_1}(I-\omega\omega')B^{-1}A. It follows that \mathcal{M} has eigenvalues 0 and \frac{\lambda_s}{\lambda_1}, with \lambda_s the ``remaining’’ eigenvalues of B^{-1}A. Thus if \lambda_1 is the largest eigenvalue, we find a linear rate of \rho=\frac{\lambda_1}{\lambda_2}.

There are several things in this analysis which may go wrong, and they are all quite instructive.

8.3.2 The Majorization Method for MDS

The first is an algorithm for multidimensional scaling, developed by De Leeuw (1977). We want to minimize \sigma(X)=\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m w_{ij}(\delta_{ij}-d_{ij}(X))^2, with d_{ij}(X) again Euclidean distance, i.e. d_{ij}(X)=\sqrt{(x_i-x_j)'(x_i-x_j)}.ƒ We suppose weights w_{ij} and dissimilarities \delta_{ij} are symmetric and hollow (zero diagonal), and satisfy \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m w_{ij}\delta_{ij}^2=1. We now define the following objects \begin{align} \eta^2(X)&\mathop{=}\limits^{\Delta}\sum_{i=1}^m\sum_{j=1}^m w_{ij}d_{ij}(X)^2,\\ \rho(X)&\mathop{=}\limits^{\Delta}\sum_{i=1}^m\sum_{j=1}^m w_{ij}\delta_{ij}d_{ij}(X). \end{align} Thus \sigma(X)=1-2\rho(X)+\frac{1}{2}\eta^2(X). The next step is to use matrices. Let \begin{align} v_{ij}&= \begin{cases}-w_{ij}& \text{if $i\not= j,$}\\ \sum_{k\not=i}^m w_{ik}& \text{if $i=j$,} \end{cases}\\ b_{ij}(X)&= \begin{cases} -\frac{w_{ij}\delta_{ij}}{d_{ij}(X)}& \text{if $i\not= j,$}\\ \sum_{k\not=i}^m\frac{w_{ik}\delta_{ik}}{d_{ik}(X)}& \text{if $i=j$.} \end{cases} \end{align} Now \sigma(X)=1-\hbox{tr}X'B(X)X+\frac{1}{2}\text{tr}X'VX. By Cauchy-Schwarz, d_{ij}(X)\geq\frac{(x_i-x_j)'(y_i-y_j)}{d_{ij}(Y)}, which implies \hbox{tr}X'B(X)X\geq\hbox{tr}X'B(Y)Y. Now let \overline X=V^+B(X)X. This is called the Guttman-transform of a matrix X. Using this transform we see that for all pairs of configurations (X,Y) \begin{align} \sigma(X)&\leq 1-\hbox{tr }X'B(Y)Y+\frac{1}{2}\text{tr }X'VX=\\ &=1-\hbox{tr }XV\overline Y+\frac{1}{2}\text{tr }X'VX=\\ &=1-\frac{1}{2}\text{tr }\overline Y'V\overline Y+ \frac{1}{2}\text{tr }(X-\overline Y)'V(X-\overline Y), \end{align}

while for all configurations X we have \sigma(X)=1-\frac{1}{2}\text{tr }\overline X'V\overline X+ \frac{1}{2}\text{tr }(X-\overline X)'V(X-\overline X).

8.4 Conjugates and Young’s Inequality

Example: Let 0<r<2, p=\frac{2}{r}, and q=\frac{2}{2-r}. Then the previous result, applied to x^r and y^{2-r}, becomes x^ry^{2-r}\leq\frac{r}{2}x^2+\frac{2-r}{2}y^2, which provides us with a quadratic majorization for x^r for all 0<r<2. We have equality if and only if x=y.

showMe <- function (b, r, up = 5) {
    a <- seq (0, up, length = 100)
    ar <- a ^ r
    plot (a, ar, type = "l", col = "RED",
          ylab="f(a)", lwd = 2)
    br <- (r * (a ^ 2)) / 2 + (((2 - r)  * (b ^ 2))/ 2)
    br <- br / (b ^ (2 - r))
    lines (a, br, col = "GREEN", lwd = 2)
    abline (v = b, lwd = 2)
}

Here is an example with y=2 and r=1.5.

One important application of these results is majorization of powers of Euclidean distances (\sqrt{x'Ax})^r by quadratic forms. We find, for 0<r<2, (\sqrt{x'Ax})^r\leq\frac{1} {(\sqrt{y'Ay})^{(2-r)}}\left\{\frac{r}{2}x'Ax+\frac{2-r}{2}y'Ay\right\}

8.4.1 Support Vector Machines