## 1 Curse of dimensionality

We know ,\(k\)-NN Algorithm is a very simple and effective algorithm , Its core idea is local approximation . The reason is , Because it can approximate conditional expectation very well , On the one hand, it replaces expectation with sample mean , On the other hand, it replaces a given point with its neighborhood , Combine , It's the sample mean used in the neighborhood , Instead of the conditional expectation at that point .

however , In high dimensional problems ,\(k\)-NN Will gradually become ineffective . Why? ？ We should also start with some characteristics of high-dimensional problems .

First , Samples in high-dimensional air , The distribution is very sparse . Let's say there's a hypercube per unit volume （hypercube）, That is, each dimension of “ Side length ” All for \(1\), its “ Volume ” Also for the \(1\), And the samples are evenly distributed inside . If we want to get a certain proportion of it \(r\) The sample of , That is to say, take the hypercube \(r\) Proportional volume , that , How many scale ranges should each side take ？ It's simple , Each side length should be taken as \(e_p(r)=r^{1/p}\). If in \(10\) In the dimension , Just want to take it \(10\%\) Volume , We should take each side \(e_{10}(0.1)=0.80\) The length of , That is to say, for each side, take \(80\%\) The scope of the .

second , Samples in high dimensional space , Almost all of them are distributed in “ edge ” It's about . consider \(p\) In dimensional space \(N\) Samples , Suppose they are evenly distributed in a unit ball , The center of the ball is at the origin , that , From the origin ** lately ** The sample of , It goes to the origin “ distance ” What's the median of ？ Make \(D=\min_i\{\Vert x_i \Vert\}\) Is the minimum distance between each sample and the origin , Calculate its cumulative distribution function

Want to know the median distance , Just let the cumulative distribution function take a value \(1/2\) that will do . You can work out , The median of the nearest distance \(d(N,p)=\left[1-\left(1/2 \right)^{1/N}\right]^{1/p}\). such as \(p=10\),\(N=500\) Words ,\(d(10,500)\approx 0.52\), in other words , The point closest to the origin , It's half the distance away .

Third , In Gaowei , Sampling density and \(N^{1/p}\) Is proportional to the . If in \(1\) We're sampling \(100\) A little bit , So in \(10\) We need to sample in time \(100^{10}\) Two points to maintain the same sampling density .

## 2 Examples of high dimensional problems

## 2.1 In Gaowei \(1\)-NN

Set up ：\(N=1000\),\(X\) by \(p\) Dimensional random variable , And in \([-1,1]^p\) Evenly distributed ,\(Y=f(X)=\exp(-8 \Vert X \Vert^2)\), The training set is \(\mathcal{T}\), We need to use it. \(1\)-NN To predict \(x_0=0\) Situated \(y_0\). Of course , We already know the answer \(y_0=f(x_0)=1\).

It can be done to MSE（mean squared error, Mean square error ） Break it down ：

The last equation is because \(E_{\mathcal{T}}\{[f(x_0)-E_{\mathcal{T}}(\hat{y}_0)](E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0)\}=0\). The first part is bias The square of , The second part is variance.

stay \(p=1\) when ,\(1\)-NN The closest point found by the algorithm , Probably not in \(0\) It's about , So there must be \(E_{\mathcal{T}}(\hat{y}_0)\lt 0\), But because of this \(N=1000\) The larger , The point you're looking for is basically away from \(x_0\) It's closer , therefore bias and variance It's not too big .

But in high dimensions , The problem began to arise . such as \(p=10\), So as mentioned above , The shortest distance to the origin will be greatly increased ： Yes \(99\%\) The above samples , To \(x_0=0\) The nearest distance between the two will be greater than \(0.5\). So the predicted \(\hat{y}_0\) There's a high probability that it's close to \(0\),bias It will be very big. , Even if the variance Very small , It can also lead to MSE Close to the \(1\) 了 .

Sometimes it doesn't have to be bias Too much influence on MSE, For example, if the real functional relationship is only related to a few dimensions , Such as \(f(X)=(X_1+1)^3/2\), here ,bias Not too big , stay MSE Medium is variance Played a decisive role .

## 2.2 In Gaowei LS

Set up ： The real variable relationship is \(y=X\beta+\epsilon\), among \(\epsilon\sim N(0,\sigma^2 I_N)\) And with the \(X\) irrelevant , We still have to estimate \(x_0\) Situated \(y_0\).

First, use the least square method , We have \(\hat\beta=(X'X)^{-1}X'y=\beta+(X'X)^{-1}X'\epsilon\),\(\hat y_0=x_0'\hat\beta=x_0'\beta+x_0'(X'X)^{-1}X'\epsilon\), ad locum , We focus on \(x_0\) Situated expected (squared) prediction error（ Expected prediction error ）\(\text{EPE}(x_0)=E_{y_0|x_0}E_{\mathcal{T}} (y_0-\hat{y}_0)^2\).

And 2.1 In this section , There's an additional perturbation term here \(\epsilon\), We will \(y_0-\hat{y}_0\) It's broken down into two parts ：

By simple calculation , The square term of the first term can be reduced to \(E_{y_0|x_0}E_{\mathcal{T}} (y_0-x_0'\beta)^2=\sigma^2\), Change the square term of the second term to

also , because \(E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta+x_0'E_{\mathcal{T}} [(X'X)^{-1}X'\epsilon]\), recycling \(E_{\mathcal{T}} [(X'X)^{-1}X'\epsilon]=E_{\mathcal{X}} E_{\mathcal{Y|X}} [(X'X)^{-1}X'\epsilon]=E_{\mathcal{X}} \left[ (X'X)^{-1}X' E_{\mathcal{Y|X}} (\epsilon)\right]=0\), You know \(E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta\), The first term of the above formula is bias The square of is \(0\), In the end, it's just variance, And it can be further transformed into

Last , To use again \(E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta\), The cross term is

Sum up the above 3 results , Yes ：

if \(\mathcal{T}\) For random samples , Assume \(E(x)=0\), When \(N\to\infty\) when ,\(X'X\to N \text{Cov}(x)\), And for all \(x_0\) Take expectations , Yes

It can be seen that ,EPE Will follow \(p\) To increase by .

## reference

- Friedman, Jerome, Trevor Hastie, and Robert Tibshirani.
*The elements of statistical learning*. Vol. 1. No. 10. New York: Springer series in statistics, 2001.