当前位置:网站首页>Curse of Dimensionality

Curse of Dimensionality

2021-06-22 16:13:05 analysis101

1 Curse of dimensionality

We know ,\(k\)-NN Algorithm is a very simple and effective algorithm , Its core idea is local approximation . The reason is , Because it can approximate conditional expectation very well , On the one hand, it replaces expectation with sample mean , On the other hand, it replaces a given point with its neighborhood , Combine , It's the sample mean used in the neighborhood , Instead of the conditional expectation at that point .

however , In high dimensional problems ,\(k\)-NN Will gradually become ineffective . Why? ? We should also start with some characteristics of high-dimensional problems .

First , Samples in high-dimensional air , The distribution is very sparse . Let's say there's a hypercube per unit volume (hypercube), That is, each dimension of “ Side length ” All for \(1\), its “ Volume ” Also for the \(1\), And the samples are evenly distributed inside . If we want to get a certain proportion of it \(r\) The sample of , That is to say, take the hypercube \(r\) Proportional volume , that , How many scale ranges should each side take ? It's simple , Each side length should be taken as \(e_p(r)=r^{1/p}\). If in \(10\) In the dimension , Just want to take it \(10\%\) Volume , We should take each side \(e_{10}(0.1)=0.80\) The length of , That is to say, for each side, take \(80\%\) The scope of the .

second , Samples in high dimensional space , Almost all of them are distributed in “ edge ” It's about . consider \(p\) In dimensional space \(N\) Samples , Suppose they are evenly distributed in a unit ball , The center of the ball is at the origin , that , From the origin lately The sample of , It goes to the origin “ distance ” What's the median of ? Make \(D=\min_i\{\Vert x_i \Vert\}\) Is the minimum distance between each sample and the origin , Calculate its cumulative distribution function

\[\begin{aligned} &F(d)\\ =& \Pr(D\leq d)\\ =& 1-\Pr(D\gt d)\\ =& 1- \prod_{i=1}^{N} \Pr(\Vert x_i \Vert \gt d)\\ =& 1- \prod_{i=1}^{N} [1-\Pr(\Vert x_i \Vert \leq d)]\\ =& 1- \left[1-d^p\right]^{N} \end{aligned} \]

Want to know the median distance , Just let the cumulative distribution function take a value \(1/2\) that will do . You can work out , The median of the nearest distance \(d(N,p)=\left[1-\left(1/2 \right)^{1/N}\right]^{1/p}\). such as \(p=10\),\(N=500\) Words ,\(d(10,500)\approx 0.52\), in other words , The point closest to the origin , It's half the distance away .

Third , In Gaowei , Sampling density and \(N^{1/p}\) Is proportional to the . If in \(1\) We're sampling \(100\) A little bit , So in \(10\) We need to sample in time \(100^{10}\) Two points to maintain the same sampling density .

2 Examples of high dimensional problems

2.1 In Gaowei \(1\)-NN

Set up :\(N=1000\),\(X\) by \(p\) Dimensional random variable , And in \([-1,1]^p\) Evenly distributed ,\(Y=f(X)=\exp(-8 \Vert X \Vert^2)\), The training set is \(\mathcal{T}\), We need to use it. \(1\)-NN To predict \(x_0=0\) Situated \(y_0\). Of course , We already know the answer \(y_0=f(x_0)=1\).

It can be done to MSE(mean squared error, Mean square error ) Break it down :

\[\begin{aligned} &\text{MSE}(x_0)\\ =& E_{\mathcal{T}}[f(x_0)-\hat{y}_0]^2\\ =& [f(x_0)-E_{\mathcal{T}}(\hat{y}_0)]^2 +E_{\mathcal{T}}[E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0]^2 \end{aligned} \]

The last equation is because \(E_{\mathcal{T}}\{[f(x_0)-E_{\mathcal{T}}(\hat{y}_0)](E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0)\}=0\). The first part is bias The square of , The second part is variance.

stay \(p=1\) when ,\(1\)-NN The closest point found by the algorithm , Probably not in \(0\) It's about , So there must be \(E_{\mathcal{T}}(\hat{y}_0)\lt 0\), But because of this \(N=1000\) The larger , The point you're looking for is basically away from \(x_0\) It's closer , therefore bias and variance It's not too big .

But in high dimensions , The problem began to arise . such as \(p=10\), So as mentioned above , The shortest distance to the origin will be greatly increased : Yes \(99\%\) The above samples , To \(x_0=0\) The nearest distance between the two will be greater than \(0.5\). So the predicted \(\hat{y}_0\) There's a high probability that it's close to \(0\),bias It will be very big. , Even if the variance Very small , It can also lead to MSE Close to the \(1\) 了 .

Sometimes it doesn't have to be bias Too much influence on MSE, For example, if the real functional relationship is only related to a few dimensions , Such as \(f(X)=(X_1+1)^3/2\), here ,bias Not too big , stay MSE Medium is variance Played a decisive role .

2.2 In Gaowei LS

Set up : The real variable relationship is \(y=X\beta+\epsilon\), among \(\epsilon\sim N(0,\sigma^2 I_N)\) And with the \(X\) irrelevant , We still have to estimate \(x_0\) Situated \(y_0\).

First, use the least square method , We have \(\hat\beta=(X'X)^{-1}X'y=\beta+(X'X)^{-1}X'\epsilon\),\(\hat y_0=x_0'\hat\beta=x_0'\beta+x_0'(X'X)^{-1}X'\epsilon\), ad locum , We focus on \(x_0\) Situated expected (squared) prediction error( Expected prediction error )\(\text{EPE}(x_0)=E_{y_0|x_0}E_{\mathcal{T}} (y_0-\hat{y}_0)^2\).

And 2.1 In this section , There's an additional perturbation term here \(\epsilon\), We will \(y_0-\hat{y}_0\) It's broken down into two parts :

\[y_0-\hat{y}_0=(y_0-x_0'\beta)+(x_0'\beta -\hat{y}_0) \]

By simple calculation , The square term of the first term can be reduced to \(E_{y_0|x_0}E_{\mathcal{T}} (y_0-x_0'\beta)^2=\sigma^2\), Change the square term of the second term to

\[E_{y_0|x_0}E_{\mathcal{T}}(x_0'\beta -\hat{y}_0) ^2 =[x_0'\beta-E_{\mathcal{T}}(\hat{y}_0)]^2 +E_{\mathcal{T}}[E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0]^2 \]

also , because \(E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta+x_0'E_{\mathcal{T}} [(X'X)^{-1}X'\epsilon]\), recycling \(E_{\mathcal{T}} [(X'X)^{-1}X'\epsilon]=E_{\mathcal{X}} E_{\mathcal{Y|X}} [(X'X)^{-1}X'\epsilon]=E_{\mathcal{X}} \left[ (X'X)^{-1}X' E_{\mathcal{Y|X}} (\epsilon)\right]=0\), You know \(E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta\), The first term of the above formula is bias The square of is \(0\), In the end, it's just variance, And it can be further transformed into

\[\begin{aligned} &E_{y_0|x_0}E_{\mathcal{T}}(x_0'\beta -\hat{y}_0) ^2\\ =& E_{\mathcal{T}}[E_{\mathcal{T}}(\hat{y}_0)-\hat{y}_0]^2\\ =& E_{\mathcal{T}}[x_0'(X'X)^{-1}X'\epsilon\epsilon' X (X'X)^{-1}x_0]\\ =& x_0' E_{\mathcal{X}} \left[ (X'X)^{-1}X' [E_{\mathcal{Y|X}}(\epsilon\epsilon')]X (X'X)^{-1} \right] x_0\\ =&x_0' E_{\mathcal{X}} \left[(X'X)^{-1}\right]x_0 \sigma^2 \end{aligned} \]

Last , To use again \(E_{\mathcal{T}}(\hat{y}_0)=x_0'\beta\), The cross term is

\[\begin{aligned} &2E_{y_0|x_0}E_{\mathcal{T}}[(y_0-x_0'\beta)(x_0'\beta -\hat{y}_0)]\\ =& 2E_{y_0|x_0}\left[(y_0-x_0'\beta)E_{\mathcal{T}} (x_0'\beta -\hat{y}_0)\right]\\ =& 0 \end{aligned} \]

Sum up the above 3 results , Yes :

\[\text{EPE}(x_0)=E_{y_0|x_0}E_{\mathcal{T}} (y_0-\hat{y}_0)^2=\sigma^2+x_0' E_{\mathcal{X}} \left[(X'X)^{-1}\right]x_0 \sigma^2 \]

if \(\mathcal{T}\) For random samples , Assume \(E(x)=0\), When \(N\to\infty\) when ,\(X'X\to N \text{Cov}(x)\), And for all \(x_0\) Take expectations , Yes

\[\begin{aligned} &E_{x_0}\text{EPE}(x_0)\\ \sim& \sigma^2+\dfrac{\sigma^2}{N} E_{x_0} [x_0' \text{Cov}(x)^{-1} x_0]\\ =& \sigma^2+\dfrac{\sigma^2}{N} \text{tr} \left[ \text{Cov}(x)^{-1} E_{x_0} (x_0x_0' )\right]\\ =& \sigma^2+\dfrac{\sigma^2}{N} \text{tr} (I_p)\\ =& \sigma^2+\dfrac{p}{N}\sigma^2 \end{aligned} \]

It can be seen that ,EPE Will follow \(p\) To increase by .

reference

  • Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.

版权声明
本文为[analysis101]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/06/20210622161230378F.html