# Curse of Dimensionality

2021-06-22 16:13:05

## 1 Curse of dimensionality

We know ,$$k$$-NN Algorithm is a very simple and effective algorithm , Its core idea is local approximation . The reason is , Because it can approximate conditional expectation very well , On the one hand, it replaces expectation with sample mean , On the other hand, it replaces a given point with its neighborhood , Combine , It's the sample mean used in the neighborhood , Instead of the conditional expectation at that point .

however , In high dimensional problems ,$$k$$-NN Will gradually become ineffective . Why? ？ We should also start with some characteristics of high-dimensional problems .

First , Samples in high-dimensional air , The distribution is very sparse . Let's say there's a hypercube per unit volume （hypercube）, That is, each dimension of “ Side length ” All for $$1$$, its “ Volume ” Also for the $$1$$, And the samples are evenly distributed inside . If we want to get a certain proportion of it $$r$$ The sample of , That is to say, take the hypercube $$r$$ Proportional volume , that , How many scale ranges should each side take ？ It's simple , Each side length should be taken as $$e_p(r)=r^{1/p}$$. If in $$10$$ In the dimension , Just want to take it $$10\%$$ Volume , We should take each side $$e_{10}(0.1)=0.80$$ The length of , That is to say, for each side, take $$80\%$$ The scope of the .

second , Samples in high dimensional space , Almost all of them are distributed in “ edge ” It's about . consider $$p$$ In dimensional space $$N$$ Samples , Suppose they are evenly distributed in a unit ball , The center of the ball is at the origin , that , From the origin lately The sample of , It goes to the origin “ distance ” What's the median of ？ Make $$D=\min_i\{\Vert x_i \Vert\}$$ Is the minimum distance between each sample and the origin , Calculate its cumulative distribution function

\begin{aligned} &F(d)\\ =& \Pr(D\leq d)\\ =& 1-\Pr(D\gt d)\\ =& 1- \prod_{i=1}^{N} \Pr(\Vert x_i \Vert \gt d)\\ =& 1- \prod_{i=1}^{N} [1-\Pr(\Vert x_i \Vert \leq d)]\\ =& 1- \left[1-d^p\right]^{N} \end{aligned}

Want to know the median distance , Just let the cumulative distribution function take a value $$1/2$$ that will do . You can work out , The median of the nearest distance $$d(N,p)=\left[1-\left(1/2 \right)^{1/N}\right]^{1/p}$$. such as $$p=10$$,$$N=500$$ Words ,$$d(10,500)\approx 0.52$$, in other words , The point closest to the origin , It's half the distance away .

Third , In Gaowei , Sampling density and $$N^{1/p}$$ Is proportional to the . If in $$1$$ We're sampling $$100$$ A little bit , So in $$10$$ We need to sample in time $$100^{10}$$ Two points to maintain the same sampling density .

## 2.1 In Gaowei $$1$$-NN

Set up ：$$N=1000$$,$$X$$ by $$p$$ Dimensional random variable , And in $$[-1,1]^p$$ Evenly distributed ,$$Y=f(X)=\exp(-8 \Vert X \Vert^2)$$, The training set is $$\mathcal{T}$$, We need to use it. $$1$$-NN To predict $$x_0=0$$ Situated $$y_0$$. Of course , We already know the answer $$y_0=f(x_0)=1$$.

It can be done to MSE（mean squared error, Mean square error ） Break it down ：

\begin{aligned} &\text{MSE}(x_0)\\ =& E_{\mathcal{T}}[f(x_0)-\hat{y}_0]^2\\ =& [f(x_0)-E_{\mathcal{T}}(\hat{y}_0)]^2 +E_{\mathcal{T}}[E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0]^2 \end{aligned}

The last equation is because $$E_{\mathcal{T}}\{[f(x_0)-E_{\mathcal{T}}(\hat{y}_0)](E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0)\}=0$$. The first part is bias The square of , The second part is variance.

stay $$p=1$$ when ,$$1$$-NN The closest point found by the algorithm , Probably not in $$0$$ It's about , So there must be $$E_{\mathcal{T}}(\hat{y}_0)\lt 0$$, But because of this $$N=1000$$ The larger , The point you're looking for is basically away from $$x_0$$ It's closer , therefore bias and variance It's not too big .

But in high dimensions , The problem began to arise . such as $$p=10$$, So as mentioned above , The shortest distance to the origin will be greatly increased ： Yes $$99\%$$ The above samples , To $$x_0=0$$ The nearest distance between the two will be greater than $$0.5$$. So the predicted $$\hat{y}_0$$ There's a high probability that it's close to $$0$$,bias It will be very big. , Even if the variance Very small , It can also lead to MSE Close to the $$1$$ 了 .

Sometimes it doesn't have to be bias Too much influence on MSE, For example, if the real functional relationship is only related to a few dimensions , Such as $$f(X)=(X_1+1)^3/2$$, here ,bias Not too big , stay MSE Medium is variance Played a decisive role .

## 2.2 In Gaowei LS

Set up ： The real variable relationship is $$y=X\beta+\epsilon$$, among $$\epsilon\sim N(0,\sigma^2 I_N)$$ And with the $$X$$ irrelevant , We still have to estimate $$x_0$$ Situated $$y_0$$.

First, use the least square method , We have $$\hat\beta=(X'X)^{-1}X'y=\beta+(X'X)^{-1}X'\epsilon$$,$$\hat y_0=x_0'\hat\beta=x_0'\beta+x_0'(X'X)^{-1}X'\epsilon$$, ad locum , We focus on $$x_0$$ Situated expected (squared) prediction error（ Expected prediction error ）$$\text{EPE}(x_0)=E_{y_0|x_0}E_{\mathcal{T}} (y_0-\hat{y}_0)^2$$.

And 2.1 In this section , There's an additional perturbation term here $$\epsilon$$, We will $$y_0-\hat{y}_0$$ It's broken down into two parts ：

$y_0-\hat{y}_0=(y_0-x_0'\beta)+(x_0'\beta -\hat{y}_0)$

By simple calculation , The square term of the first term can be reduced to $$E_{y_0|x_0}E_{\mathcal{T}} (y_0-x_0'\beta)^2=\sigma^2$$, Change the square term of the second term to

$E_{y_0|x_0}E_{\mathcal{T}}(x_0'\beta -\hat{y}_0) ^2 =[x_0'\beta-E_{\mathcal{T}}(\hat{y}_0)]^2 +E_{\mathcal{T}}[E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0]^2$

also , because $$E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta+x_0'E_{\mathcal{T}} [(X'X)^{-1}X'\epsilon]$$, recycling $$E_{\mathcal{T}} [(X'X)^{-1}X'\epsilon]=E_{\mathcal{X}} E_{\mathcal{Y|X}} [(X'X)^{-1}X'\epsilon]=E_{\mathcal{X}} \left[ (X'X)^{-1}X' E_{\mathcal{Y|X}} (\epsilon)\right]=0$$, You know $$E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta$$, The first term of the above formula is bias The square of is $$0$$, In the end, it's just variance, And it can be further transformed into

\begin{aligned} &E_{y_0|x_0}E_{\mathcal{T}}(x_0'\beta -\hat{y}_0) ^2\\ =& E_{\mathcal{T}}[E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0]^2\\ =& E_{\mathcal{T}}[x_0'(X'X)^{-1}X'\epsilon\epsilon' X (X'X)^{-1}x_0]\\ =& x_0' E_{\mathcal{X}} \left[ (X'X)^{-1}X' [E_{\mathcal{Y|X}}(\epsilon\epsilon')]X (X'X)^{-1} \right] x_0\\ =&x_0' E_{\mathcal{X}} \left[(X'X)^{-1}\right]x_0 \sigma^2 \end{aligned}

Last , To use again $$E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta$$, The cross term is

\begin{aligned} &2E_{y_0|x_0}E_{\mathcal{T}}[(y_0-x_0'\beta)(x_0'\beta -\hat{y}_0)]\\ =& 2E_{y_0|x_0}\left[(y_0-x_0'\beta)E_{\mathcal{T}} (x_0'\beta -\hat{y}_0)\right]\\ =& 0 \end{aligned}

Sum up the above 3 results , Yes ：

$\text{EPE}(x_0)=E_{y_0|x_0}E_{\mathcal{T}} (y_0-\hat{y}_0)^2=\sigma^2+x_0' E_{\mathcal{X}} \left[(X'X)^{-1}\right]x_0 \sigma^2$

if $$\mathcal{T}$$ For random samples , Assume $$E(x)=0$$, When $$N\to\infty$$ when ,$$X'X\to N \text{Cov}(x)$$, And for all $$x_0$$ Take expectations , Yes

\begin{aligned} &E_{x_0}\text{EPE}(x_0)\\ \sim& \sigma^2+\dfrac{\sigma^2}{N} E_{x_0} [x_0' \text{Cov}(x)^{-1} x_0]\\ =& \sigma^2+\dfrac{\sigma^2}{N} \text{tr} \left[ \text{Cov}(x)^{-1} E_{x_0} (x_0x_0' )\right]\\ =& \sigma^2+\dfrac{\sigma^2}{N} \text{tr} (I_p)\\ =& \sigma^2+\dfrac{p}{N}\sigma^2 \end{aligned}

It can be seen that ,EPE Will follow $$p$$ To increase by .

## reference

• Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.

https://chowdera.com/2021/06/20210622161230378F.html