Kernel Ridge Regression

This article aims to introduce and clarify kernel ridge regression. To do so, it is necessary to derive a simple formula: the orthogonal projection formula. The reason is that, to understand kernel ridge regression, we need to understand how linear regression works, but the whole theory of linear regression relies on the solution of the following question: what is the projection of a vector \boldsymbol{y} on a vector \boldsymbol{x}.

We assume that the vectors live in finite dimensional Euclidean space and vectors are always column vectors.

Projection of vector on vector

First, let’s consider simple case of projection vector on vector in n-dimensional Euclidean space. Figure 1. Illustrates this scenario. Vector we want to find is \hat{\boldsymbol{y}}, which is the projection of \boldsymbol{y} on \boldsymbol{x}.
Figure 1

First of all, notice that the projection vector \hat{\boldsymbol{y}} is just a “rescaled version” of the vector \boldsymbol{x}, that is \hat{\boldsymbol{y}}= c \boldsymbol{x}. How much we should “rescale” depends on two things: (i) How long the vectors \boldsymbol{x} and \boldsymbol{y} are. (ii) Most importantly mutual arrangement of the vectors \boldsymbol{x} and \boldsymbol{y}! Both of the points above can be formalized by the notion of scalar product (central object of this article). In Euclidean space scalar product is defined as follows (you may recognize the low of cosines)

    \[\langle\boldsymbol{x}, \boldsymbol{y}\rangle=\|\boldsymbol{x}\|\|\boldsymbol{y}\| \cos \alpha\]

To see why the scalar product carries some information about mutual arrangement of the vectors, consider two extreme scenarios: (1) When \boldsymbol{x} and \boldsymbol{y} are orthogonal to each other or (2) they have the same direction. In the first case, the scalar product is zero since \cos \left(\frac{\pi}{2}\right)=0. For the second case, scalar product achieves the maximum value, which is the product of the length of vectors.

Problem 1. How is this fact related to the Cauchy-Schwarz inequality?

Problem 2. Denoting by X and Y random vectors with realizations \boldsymbol{x} and \boldsymbol{y} correspondingly, what is the relationship between sample correlation of \boldsymbol{x} and \boldsymbol{y} and the formula

    \[\frac{\langle\boldsymbol{x}, \boldsymbol{y}\rangle}{\|\boldsymbol{x}\|\|\boldsymbol{y}\|}=\cos \alpha .\]

Going back to the projection formula, we have to find scale factor c in \hat{\boldsymbol{y}}=c \boldsymbol{x}. Notice that dividing \boldsymbol{x} by \|\boldsymbol{x}\| gives us the vector \boldsymbol{x}^{\prime} with the same direction and the length 1 (this procedure called normalization). If we replace \boldsymbol{x} with \boldsymbol{x}^{\prime} in \hat{\boldsymbol{y}}=c \boldsymbol{x} then the problem of finding c is equivalent to the problem of finding the length of \hat{\boldsymbol{y}}, so \hat{\boldsymbol{y}}=\|\hat{\boldsymbol{y}}\| \boldsymbol{x}^{\prime}. By the definition of \cos \alpha we have \|\hat{\boldsymbol{y}}\|= \|\boldsymbol{y}\| \cos \alpha, using the formula from Problem 2 gives us \|\hat{\boldsymbol{y}}\|=\frac{\langle \boldsymbol{x}, \boldsymbol{y}\rangle}{\|\boldsymbol{x}\|}.

    \[\hat{\boldsymbol{y}}=\frac{\langle \boldsymbol{x}, \boldsymbol{y}\rangle}{\|\boldsymbol{x}\|^{2}} x=\frac{\langle \boldsymbol{x}, \boldsymbol{y}\rangle}{\langle \boldsymbol{x}, \boldsymbol{x}\rangle^{2}} \boldsymbol{x}\]

If the vector x has the length one, the formula above can be simplified as follows

    \[\widehat{\boldsymbol{y}}=\langle\boldsymbol{x}, \boldsymbol{y}\rangle \boldsymbol{x} .\]

Let us denote the operation \langle\boldsymbol{x}, \boldsymbol{y}\rangle \boldsymbol{x} by \boldsymbol{x} * \boldsymbol{y}, so the * operation does the projection of any vector to the given vector of unit length. Now observe that the * operation is linear in the second argument

    \[\boldsymbol{x} *\left(a_{1} \boldsymbol{y}_{1}+a_{2} \boldsymbol{y}_{2}\right)=a_{1} \boldsymbol{x} * \boldsymbol{y}_{1}+a_{2} \boldsymbol{x} * \boldsymbol{y}_{2}\]

and maps vector to vector, therefor it can be represented as a matrix, that is there exists matrix P, called projection matrix such that

    \[\boldsymbol{x} * \boldsymbol{y}=P \boldsymbol{y}\]

Problem 3. Show that the matrix P is nothing more than \boldsymbol{x} \boldsymbol{x}^{T} . (Hint: \langle\boldsymbol{x}, \boldsymbol{y}\rangle=\boldsymbol{x}^{T} \boldsymbol{y}.)

Problem 4. Show that \min _{\boldsymbol{x}^{\prime} \in L}\left\|\boldsymbol{x}^{\prime}-\boldsymbol{y}\right\|=\min _{c}\|c \boldsymbol{x}-\boldsymbol{y}\|=\|P \boldsymbol{y}-\boldsymbol{y}\|, where L=\{c \boldsymbol{x}: c \in \mathbb{R}\} .

To summarize, matrix P=\boldsymbol{x} \boldsymbol{x}^{T} acting on the vector \boldsymbol{y} projects it onto the unit length vector \boldsymbol{x}. For general \boldsymbol{x} we have to normalize it by its length \|\boldsymbol{x}\|^{2}=\boldsymbol{x}^{T} \boldsymbol{x}, so

    \[P=\left(\boldsymbol{x}^{T} \boldsymbol{x}\right)^{-1} \boldsymbol{x} \boldsymbol{x}^{T}=\boldsymbol{x}\left(\boldsymbol{x}^{T} \boldsymbol{x}\right)^{-1} \boldsymbol{x}^{T} .\]

The following observation will be important in the future. For the unit length vector x

    \[P \boldsymbol{y}=\sum_{i=1}^{n} y_{i}\left(x_{i} x_{k}\right), \quad k=1, \ldots n\]

It allows us to write projection in terms of linear combination of product x_{i} x_{k}.

Projection of vector on space

What about more general scenario when we want to project \boldsymbol{y} not on the one vector, but on the space L spanned by bunch of vectors \boldsymbol{x}_{1}, \dots, \boldsymbol{x}_{d} ? Figure 2 . Illustrates this scenario when d=2 and space is three-dimensional.
Figure 2
Surprisingly the formula for the projection in general case is very similar. What we have to do is just replace vector \boldsymbol{x} with matrix X=\left(\boldsymbol{x}_{1} \ldots \boldsymbol{x}_{d}\right) whose columns are the vectors spanning the subspace we want to project on

    \[P=X\left(X^{T} X\right)^{-1} X^{T}\]

When the vectors are orthonormal, that is when \left\langle\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right\rangle=\delta_{i j}, then X^{T} X=I and

    \[P \boldsymbol{y}=X X^{T} \boldsymbol{y}=\sum_{i=1}^{d} \boldsymbol{x}_{i} * \boldsymbol{y}\]

Equality above means that the projection on the space spanned by orthonormal system is just a sum of projections on individual vector.

Problem 5. What does it mean in terms of linear regression? (Hint: In linear regression vectors \boldsymbol{x}_{1} \dots \boldsymbol{x}_{d} are inputs (features), orthogonality just means that they are uncorrelated. When the inputs are uncorrelated, they have no effect on each other’s parameter estimates in the model.)

As was discussed for one vector case, the projection is just rescaled (reweighted) version of the vector where we are projecting, with the weight given by a scalar \omega=\left(\boldsymbol{x}^{T} \boldsymbol{x}\right)^{-1} \boldsymbol{x}^{T} \boldsymbol{y}=\frac{\langle\boldsymbol{x}, \boldsymbol{y}\rangle}{\|\boldsymbol{x}\|^{2}}. Same is true for the projection on the space spanned by multiple vectors, what changes is that we have to reweight each \boldsymbol{x}_{i} according to the i^{t h} component of the weighting vector \boldsymbol{\omega}=\left(\omega_{1}, \ldots, \omega_{d}\right)^{T}=\left(X^{T} X\right)^{-1} X^{T} \boldsymbol{y}. It means that

    \[\hat{\boldsymbol{y}}=P \boldsymbol{y}=\sum_{i=1}^{d} \omega_{i} \boldsymbol{x}_{\boldsymbol{i}} .\]

Notice that the projection of \boldsymbol{y} into \boldsymbol{x} always exists whenever \|\boldsymbol{x}\| \neq 0. In general case it can happen that \left\|\boldsymbol{x}_{i}\right\| \neq 0, for all i=1,\dots,p, but \left(X^{T} X\right)^{-1} may not be defined.

Problem 6. When does it happen?

This problem can be solved easily by adding a positive constant to the diagonal of X^{T} X before inversion. Even if X^{T} X is not invertible \left(X^{T} X+\lambda I\right)^{-1} is well defined for \lambda>0. Procedure just described is known as Tikhonov regularization in literature. This motivates us to define regularized projection matrix

    \[P_{\lambda}=X\left(X^{T} X+\lambda I\right)^{-1} X^{T}\]

and corresponding projection vector

(1)   \begin{equation*}     \hat{\boldsymbol{y}}=P_{\lambda} \boldsymbol{y}=\sum_{i=1}^{d} \omega_{i, \lambda} \boldsymbol{x}_{\boldsymbol{i}} \end{equation*}

where \boldsymbol{\omega}_{\lambda}=\left(\omega_{1, \lambda}, \ldots, \omega_{d, \lambda}\right)=\left(X^{T} X+\lambda I\right)^{-1} X^{T} \boldsymbol{y}. \section{Kernel ridge regression} Now it’s time to talk about kernel ridge regression. To do so it’s more natural to talk about rows (observations) of the matrix X rather then columns. Let’s denote them by \mathbf{x}_{1}, \ldots, \mathbf{x}_{\mathbf{n}}, so X=\left(\mathbf{x}_{1}^{T}, \ldots, \mathbf{x}_{\mathbf{n}}^{T}\right). First we need to rewrite identity (1) in terms of row vectors of X. To do so we need the following identity

    \[\left(X^{T} X+\lambda I\right)^{-1} X^{T}=X^{T}\left(X X^{T}+\lambda I\right)^{-1}\]



Problem 7. Show that the identity above is true. Hint: \left(X^{T} X+\lambda I\right) X^{T}=X^{T}\left(X X^{T}+\lambda I\right).

Using this identity, we can write

    \[P_{\lambda} \boldsymbol{y}=X X^{T}\left(X X^{T}+\lambda I\right)^{-1} \boldsymbol{y}=\sum_{i=1}^{n} \alpha_{i}\left\langle\mathbf{x}_{i}, \mathbf{x}_{k}\right\rangle, k=1, \ldots, n\]

where \boldsymbol{\alpha}=\left(X X^{T}+\lambda I\right)^{-1} \boldsymbol{y}. As one can see, to calculate projection you only need to know scalar product of the observations. It means that all you need to know is similarity measure of two input observations. In the derivation above we used ordinary scalar product for that. Main idea behind kernel ridge regression (and other kernel based methods SVM, GP,… ) is to replace the scalar product everywhere in the representation above by the scalar product of the form

    \[K\left(\mathbf{x}_{i}, \mathbf{x}_{k}\right)\]

where K is some symmetric, positive definite function called \textit{kernel}. For instance, the kernel

    \[K\left(\mathbf{x}_{i}, \mathbf{x}_{k}\right)=\left\langle\mathbf{x}_{i}, \mathbf{x}_{k}\right\rangle\]

known as a linear kernelgive us back linear regression. In practice non-linear kernels are more popular. Example of non-linear kernel is polynomial kernel

    \[K\left(\mathbf{x}_{i}, \mathbf{x}_{k}\right)=\left(\left\langle\mathbf{x}_{i}, \mathbf{x}_{k}\right\rangle+1\right)^{m}\]

Another popular kernel is exponential kernel

    \[K\left(\mathbf{x}_{i}, \mathbf{x}_{k}\right)=e^{-\gamma|| \mathbf{x}_{i}-\mathbf{x}_{k} \|^{2}} .\]

Problem 8. Why do we require kernel to be symmetric and positive definite? Do the examples of kernels defined above satisfiy this requirement? Can you give other examples of kernels?

Davit Gogolashvili

Davit, born in Georgia, obtained both his BA in economics and MA in statistics Tbilisi State University, his main interest being nonparametric statistics. He is now based at Eurecom in France, under the supervision of professor Maurizio Filippone. The topic of his project is Analysis and Optimisation of Collaborative Machine Learning.