8 Majorization Inequalities

8.1 Introduction

8.2 The AM/GM Inequality

8.2.1 Absolute Values

8.2.2 Absolute Values

Suppose the problem we have to solve is to minimize f(x)=ni=1wi|hi(x)| over xX. Here hi(x) is supposed to be differentiable. In statistics it typically is a residual, for instance hi(x)=yizix. Suppose, for the time being, that hi(x)0. Then we have the majorization ni=1wi|hi(x)|12ni=1wi|hi(y)|(h2i(x)+h2i(y)), and we must minimize g(x,y):=ni=1wi|hi(y)|h2i(x), which is a weighted least squares problem.

The simplest case of this is the one-dimensional example is hi(x)=yix, which means we want to compute the weighted median. The algorithm is simply x(k+1)=ni=1ui(x(k))yini=1ui(x(k)), where ui(x)=wiyix.

We have assumed, so far, in this example that hi(y)0. If hi(y)=0 at some point in the iterative process then the majorization function does not exist, and we cannot compute the upgrade. One easy way out of this problem is to minimize fϵ(x)=ni=1wih2i(x)+ϵ2 for small values of ϵ. Clearly if ϵ1>ϵ2 then minxfϵ1(x)=fϵ1(x1)>fϵ2(x1)minxfϵ2(x). It follows that

The function fϵ is differentiable. We find Dfϵ(x)=ni=1wihi(x)h2i(x)+ϵ2Dhi(x), and D2fϵ(x)=ni=1wi1h2i(x)+ϵ2{ϵ2h2i(x)+ϵ2(Dhi(x))2+hi(x)D2hi(x)}. With obvious modifications the same formulas apply if x is a vector of unknowns, for instance if h(x)=yZx.

By the implicit function theorem the function x(ϵ) defined by Dfϵ(x(ϵ))=0 is differentiable, with derivative Dx(ϵ)=ϵni=1wi[h2i(x(ϵ))2+ϵ2]32hi(x(ϵ))Dhi(x(ϵ))D2fϵ(x(ϵ))

For the weighted median the iterates are still the same weighted averages, but now with weights ui(x,ϵ)=wi(yix)2+ϵ2. Differentiating the algorithmic map gives the convergence ratio κϵ(x)Δ=ni=1ui(x,ϵ)(yix)2(yix)2+ϵ2ni=1ui(x,ϵ). Clearly mini(yix)2(yix)2+ϵ2κϵ(x)maxi(yix)2(yix)2+ϵ2, which implies κϵ(x)<1. If yix for all i, then limϵ0κϵ(x)=1 and convergence is asymptotically sublinear.


Insert mediJan.R Here

8.2.3 Gini Mean Difference

Alternatively, we can minimize the Gini Mean Difference of the fi(θ). Now

ni=1nj=1fi(θ)fj(θ)∣≤ni=1nj=11fi(ξ)fj(ξ)(fi(θ)fj(θ))2+terms,

which can be rewritten as =ni=1nj=1wij(ξ)fi(θ)fj(θ)+terms, minimization of which is a weighted least squares problem.

8.2.4 Location Problems

The Fermat-Weber problem is to find a point xRm such that the sum of the Euclidean distances to m given points y1,,ym is minimized. Thus our loss function is f(x)=mj=1wjd(x,yj), where the wj are positive weights. Other names are the single facility location problem or the spatial median problem.

An early iterative algorithm to solve the Fermat-Weber problem was proposed by Weiszfeld (1937). For a translation see Weiszfeld and Plastria (2009).

Here we show how to use the arithmetic mean-geometric mean inequality for majorization. Suppose our problem is to minimize f(X)=ni=1nj=1wijdij(X), where the wij are non-negative weights, and the dij(X) are again Euclidean distances. This is a location problem. To make it interesting, we suppose that some of the points (facilities) are fixed, others are the variables we have to minimize over. Observe that this is a convex, but non-differentiable, optimization problem.

We use the AM-GM inequality in the form dij(X)dij(Y)12(d2ij(X)+d2ij(Y)). If dij(Y)>0 then dij(X)12d2ij(X)+d2ij(Y)dij(Y). Using the notation from Example a.a we now find ϕ(X)12(tr XB(Y)X+tr YB(Y)Y), which gives us a quadratic majorization.

If X is partitioned into X1 and X2, with rows which are fixed and rows which are to be determined (facilities which have to be located), and B is partitioned correspondingly, then the algorithm we find is X(k+1)2=B22(X(k))1B21(X(k))X1.

8.2.5 The Lasso and the Bridge

8.3 Polar Norms and the Cauchy-Schwarz Inequality

8.3.1 Rayleigh Quotient

Rewrite for minimizing 02/22/15, by maximizing x’Bx over x’Ax=1.

We go back to maximizing the Rayleigh quotient λ(x)=xAxxBx, where we now assume that both A and B are positive definite. Maximizing λ is equivalent to maximizing xAx on the condition that xBx=1. By Cauchy-Schwartz xAx1yAyxAy, and thus for the majorization we maximize xAx over xBx=1. This defines an algorithmic map which sets the update of x proportional to B1Ax, i.e. we have a shown global convergence of the power method to compute the largest generalized eigenvalue.

We can also establish the linear convergence rate quite easily, using Ostrowski (1966). For definiteness we normalize in each iteration, and set A(ω)=B1Aω At a point \omega which has B^{-1}A\omega=\lambda_1\omega, and \omega'\omega=1 we have \mathcal{M}(\omega)=\frac{1}{\lambda_1}(I-\omega\omega')B^{-1}A. It follows that \mathcal{M} has eigenvalues 0 and \frac{\lambda_s}{\lambda_1}, with \lambda_s the ``remaining’’ eigenvalues of B^{-1}A. Thus if \lambda_1 is the largest eigenvalue, we find a linear rate of \rho=\frac{\lambda_1}{\lambda_2}.

There are several things in this analysis which may go wrong, and they are all quite instructive.

8.3.2 The Majorization Method for MDS

The first is an algorithm for multidimensional scaling, developed by De Leeuw (1977). We want to minimize \sigma(X)=\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m w_{ij}(\delta_{ij}-d_{ij}(X))^2, with d_{ij}(X) again Euclidean distance, i.e. d_{ij}(X)=\sqrt{(x_i-x_j)'(x_i-x_j)}.ƒ We suppose weights w_{ij} and dissimilarities \delta_{ij} are symmetric and hollow (zero diagonal), and satisfy \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m w_{ij}\delta_{ij}^2=1. We now define the following objects \begin{align} \eta^2(X)&\mathop{=}\limits^{\Delta}\sum_{i=1}^m\sum_{j=1}^m w_{ij}d_{ij}(X)^2,\\ \rho(X)&\mathop{=}\limits^{\Delta}\sum_{i=1}^m\sum_{j=1}^m w_{ij}\delta_{ij}d_{ij}(X). \end{align} Thus \sigma(X)=1-2\rho(X)+\frac{1}{2}\eta^2(X). The next step is to use matrices. Let \begin{align} v_{ij}&= \begin{cases}-w_{ij}& \text{if $i\not= j,$}\\ \sum_{k\not=i}^m w_{ik}& \text{if $i=j$,} \end{cases}\\ b_{ij}(X)&= \begin{cases} -\frac{w_{ij}\delta_{ij}}{d_{ij}(X)}& \text{if $i\not= j,$}\\ \sum_{k\not=i}^m\frac{w_{ik}\delta_{ik}}{d_{ik}(X)}& \text{if $i=j$.} \end{cases} \end{align} Now \sigma(X)=1-\hbox{tr}X'B(X)X+\frac{1}{2}\text{tr}X'VX. By Cauchy-Schwarz, d_{ij}(X)\geq\frac{(x_i-x_j)'(y_i-y_j)}{d_{ij}(Y)}, which implies \hbox{tr}X'B(X)X\geq\hbox{tr}X'B(Y)Y. Now let \overline X=V^+B(X)X. This is called the Guttman-transform of a matrix X. Using this transform we see that for all pairs of configurations (X,Y) \begin{align} \sigma(X)&\leq 1-\hbox{tr }X'B(Y)Y+\frac{1}{2}\text{tr }X'VX=\\ &=1-\hbox{tr }XV\overline Y+\frac{1}{2}\text{tr }X'VX=\\ &=1-\frac{1}{2}\text{tr }\overline Y'V\overline Y+ \frac{1}{2}\text{tr }(X-\overline Y)'V(X-\overline Y), \end{align}

while for all configurations X we have \sigma(X)=1-\frac{1}{2}\text{tr }\overline X'V\overline X+ \frac{1}{2}\text{tr }(X-\overline X)'V(X-\overline X).

8.4 Conjugates and Young’s Inequality

Example: Let 0<r<2, p=\frac{2}{r}, and q=\frac{2}{2-r}. Then the previous result, applied to x^r and y^{2-r}, becomes x^ry^{2-r}\leq\frac{r}{2}x^2+\frac{2-r}{2}y^2, which provides us with a quadratic majorization for x^r for all 0<r<2. We have equality if and only if x=y.

showMe <- function (b, r, up = 5) {
    a <- seq (0, up, length = 100)
    ar <- a ^ r
    plot (a, ar, type = "l", col = "RED",
          ylab="f(a)", lwd = 2)
    br <- (r * (a ^ 2)) / 2 + (((2 - r)  * (b ^ 2))/ 2)
    br <- br / (b ^ (2 - r))
    lines (a, br, col = "GREEN", lwd = 2)
    abline (v = b, lwd = 2)
}

Here is an example with y=2 and r=1.5.

One important application of these results is majorization of powers of Euclidean distances (\sqrt{x'Ax})^r by quadratic forms. We find, for 0<r<2, (\sqrt{x'Ax})^r\leq\frac{1} {(\sqrt{y'Ay})^{(2-r)}}\left\{\frac{r}{2}x'Ax+\frac{2-r}{2}y'Ay\right\}

8.4.1 Support Vector Machines