当前位置:网站首页>Hard threshold of support vector machine

Hard threshold of support vector machine

2021-06-24 00:44:43 ZhiboZhao

Support vector machine ( support vector machine, SVM ) It's using hyperplanes for a given p Non probabilistic binary linear classifier based on dimension vector .

One 、 hyperplane ( hyperplane )

In a p Dimension input space , The hyperplane is \(p-1\) The subspace of dimension . such as : In a two-dimensional input space , A hyperplane is one dimension , It's a straight line . The formula is as follows :

\[b + w_{1}x_{11} + w_{2}x_{12}=0 \]

In a three-dimensional input space , A hyperplane is two-dimensional , It's a plane , The formula is as follows :

\[b + w_{1}x_{11} + w_{2}x_{12} + w_{3}x_{13}=0 \]

When you enter dimensions \(p > 3\) when , It's hard to understand the hyperplane intuitively , But it can be simply recorded as : The hyperplane divides the input space into two parts , So the final mathematical expression of the hyperplane is :

\[b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} = 0 \]

among ,\(\alpha_{i}\) The parameter representing the hyperplane . therefore , The hyperplane will be a p The dimensional space is divided into two parts , Consider one \(p\) Input instance of dimension \(x = (x_{1}, x_{2},..., x_{p})\), When the instance is on a hyperplane , Yes :

\[b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} = 0 \]

When it is not on the hyperplane , There must be :

\[b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} > 0 \]

perhaps :

\[b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} < 0 \]

Two 、 Using hyperplanes for binary classification

Consider a series of inputs \(x = (x_{1}, x_{2},...,x_{n})\), among \(x_{i} = (x_{i1}, x_{i2},...,x_{ip})^{T}\) , So the corresponding label It should belong to different classes , Here we use \((1, -1)\) To express , namely :\(y = (y_{1}, y_{2},...,y_{n}) \in (-1, 1)\). So we need training data based on these observations \(T = ( (x_{1},y_{1}), (x_{2},y_{2}),...,(x_{n},y_{n})\) To learn SVM Parameters of \((\alpha_{1}, \alpha_{2},...,\alpha_{p})\), Make it possible for any given \(p\) Dimension input \(x_{n+1}\) Make the right judgment . In the process of training , If on one side of the hyperplane is \(-1\), So on the other side is \(+1\), So the standard of training can be expressed as :

\[b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip} > 0,\quad if \quad y_{i}=1 \\ b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip} < 0,\quad if \quad y_{i}=-1 \]

therefore , The final training standard can be unified as :

\[(b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip})y_{i}>0 \]

The final hyperplane equation is written in matrix form as :\(f(x) = w^{T}x + b\)

But in theory, by spinning , Translation and other operations , Meet this condition SVM There are countless classifiers , As shown in the left part of the figure below , Let's take two-dimensional input as an example , Think of the blue dots as the same kind , And the red dots are the second kind , The black line is the classifier learned , So which one should I choose ?

The first 1 One classifier is too close to the blue and pink dots , A little bit of disturbance is likely to cause the wrong classification , Poor robustness

The first 2,3 A classifier seems to be OK , But how to define the selection criteria ?

A natural idea is to choose the largest edge hyperplane \((maximal \quad margin \quad hyperplane)\). namely : We can calculate \(p\) The distance from all the points in a dimensional space to the hyperplane , And the smallest distance (\(margin\)) Maximum . Because the closer the point is to the hyperplane, the easier it is to distinguish , After maximizing these distances, we can have a certain degree of robustness , So we can find the optimal hyperplane . So through this strategy , The optimal hyperplane in the figure above can be described as the figure below :

From the figure above, we can see that the three training observations are equidistant from the maximum margin hyperplane , And along the dotted line indicating the width of the margin . So these three observations are called support vectors . Interestingly , The maximum edge hyperplane depends directly on the support vector , It doesn't depend on other observations , Any other observed movement does not affect the separation hyperplane , The premise is that the movement of observation does not cause it to span \(margin\).

3、 ... and 、 Maximum interval threshold (\(maximum\quad margin\)) Express

Recall the relevant content of high school mathematics , For a plane \(\Phi:Ax+By+Cz+D=0\), Then the normal vector is \([A, B, C]\), Any point in space \((x,y,z)\) The distance to the plane is :

\[distance = \dfrac{|Ax_{0}+By_{0}+Cz_{0}+D|}{\sqrt {A^{2}+B^{2}+C^{2}}} \]

So in the same way , stay \(p\) In dimensional space ,\(p-1\) The dimensional hyperplane is expressed as :\(\Phi: w^{T}x = 0\), Then the normal vector is \(w^{T}\), Any point in space \(x_{i}^{T} = (x_{i1},x_{i2},...,x_{ip})\) The distance to the hyperplane is :

\[distance = \dfrac{|w^{T}x_{i}+b|}{w^{2}} = \dfrac{|w^{T}x_{i}+b|}{||w||} \]

Adding hyperplanes enables correct classification , So let the distance from the support vector to the hyperplane be \(\beta\), We can get the value of classification as :

\[(b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip})y_{i}>\beta \]

namely :\(w^{T}x > \beta\), Through observation, we found that , The zoom \(w\) The value of does not affect the position of the hyperplane , So it doesn't matter \(\beta\) Take any value , In the end, they all pass \(w'=w/\beta,\quad b'=b/\beta\) Turn the result into :\(w^{T}x+b > 1\), So the distance between the support vectors becomes : \(2/||w||\), So finally SVM And that's the model of :

\[\mathop{argmax} \limits_{w} \dfrac{2}{||w||}\quad \quad s.t.\quad y_{i}(w^{T}x_{i}+b)>1,\quad x = 1, 2,...,m \]

Equivalent to :

\[\mathop{argmin} \limits_{w} \dfrac{1}{2}||w||^{2}\quad s.t.\quad y_{i}(w^{T}x_{i}+b)>1,\quad x = 1,2,...,m \]

namely : Move the hyperplane left and right 1 One unit gets the hyperplane \(\Phi_{1},\Phi_{2}\), Find out all of them for +1 The point of class is \(\Phi_{1}\) left , All for -1 The point of class is \(\Phi_{2}\) On the right side , And satisfy all the parameters of the hyperplane at the same time \(w\) The hyperplane with the smallest sum of squares of .

Four 、 Dual problem and KKT Conditions

Solving inequalities with constraints , According to the Lagrange multiplier method, the formula \((13)\) It's an unconstrained inequality , obtain :

\[L(w,b,\lambda) = \dfrac{1}{2}||w||^{2}+\sum_{i=1}^{m}\lambda_{i}(1-y_{i}(w^{T}x_{i}+b)) \]

So we were right \(L\) Finding partial derivatives :

\[\dfrac{\partial{L(w,b,\lambda)}}{\partial w} = w-\sum_{i=1}^{m}\lambda_{i}y_{i}x_{i}=0\\ \dfrac{\partial{L(w,b,\lambda)}}{\partial b} = \sum_{i=1}^{m}\lambda_{i}y_{i} = 0 \]

What will be obtained \(w = \sum_{i=1}^{m}\lambda_{i}y_{i}x_{i},\quad \sum_{i=1}^{m}\lambda_{i}y_{i} = 0\) Substituting \((14)\) obtain :

\[L(w,b,\lambda) = \dfrac{1}{2}(\sum_{i=1}^{m}\lambda_{i}y_{i}x_{i})^{T}\sum_{j=1}^{m}\lambda_{j}y_{j}x_{j}+\sum_{i=1}^{m}\lambda_{i} - \sum_{i=1}^{m}\lambda_{i}y_{i}(w^{T}x_{i})\\ =\dfrac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}+\sum_{i=1}^{m}\lambda_{i} - \sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}\\ =\sum_{i=1}^{m}\lambda_{i} - \dfrac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j} \]

That is the final formula \((14)\) The dual problem of can be written as :

\[\mathop{argmax}\limits_{\lambda}\sum_{i=1}^{m}\lambda_{i} - \dfrac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}, \quad s.t. \quad \sum_{i=1}^{m}\lambda_{i}y_{i} = 0 \]

Finally, by finding out \(\lambda\) , Bring it to the formula \((15)\) And then we can find out \(w\) , And then we can find out \(b\).

In the process of transforming the original problem into a dual problem , The original problem needs to be met \(KKT\) Conditions , namely :

\[\begin{equation} \left\{ \begin{array}{lr} \lambda_{i}>=0; & \\ y_{i}f(x_{i})-1 >=0; & \\ \lambda_{i}(y_{i}f(x_{i})-1) = 0 ; \end{array} \right. \end{equation} \]

This condition is the seventh edition of Tongji 《 Advanced mathematics 》 On the application premise of Lagrange multiplier method ? Interested friends can check it by themselves , After all, I haven't studied mathematics for many years , If there is any omission, please correct it .

So there's a very interesting phenomenon , namely :

if \(\lambda_{i}=0\), be \(y_{i}f(x_{i}) >= 1\)

if \(\lambda_{i}>0\), be \(y_{i}f(x_{i}) = 1\), The corresponding sample just falls on the maximum boundary ( Support vector ), Therefore, most of the training samples do not need to be retained after the training , The final model is only about support vectors .

The above main analysis is when the training samples are linearly separable ,SVM Some of the theory and solution process , That is hard threshold SVM. When the training samples are not linearly separable, it is necessary to design soft threshold and kernel function SVM, Limited to space , It will be explained in the following article .

版权声明
本文为[ZhiboZhao]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/06/20210624004352819c.html

随机推荐