# Hard threshold of support vector machine

2021-06-24 00:44:43

Support vector machine ( support vector machine, SVM ) It's using hyperplanes for a given p Non probabilistic binary linear classifier based on dimension vector .

### One 、 hyperplane ( hyperplane )

In a p Dimension input space , The hyperplane is $$p-1$$ The subspace of dimension . such as ： In a two-dimensional input space , A hyperplane is one dimension , It's a straight line . The formula is as follows ：

$b + w_{1}x_{11} + w_{2}x_{12}=0$

In a three-dimensional input space , A hyperplane is two-dimensional , It's a plane , The formula is as follows ：

$b + w_{1}x_{11} + w_{2}x_{12} + w_{3}x_{13}=0$

When you enter dimensions $$p ＞ 3$$ when , It's hard to understand the hyperplane intuitively , But it can be simply recorded as ： The hyperplane divides the input space into two parts , So the final mathematical expression of the hyperplane is ：

$b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} = 0$

among ,$$\alpha_{i}$$ The parameter representing the hyperplane . therefore , The hyperplane will be a p The dimensional space is divided into two parts , Consider one $$p$$ Input instance of dimension $$x = (x_{1}, x_{2},..., x_{p})$$, When the instance is on a hyperplane , Yes ：

$b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} = 0$

When it is not on the hyperplane , There must be ：

$b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} > 0$

perhaps ：

$b + w_{1}x_{11} + w_{2}x_{12} +...+ w_{p}x_{1p} < 0$

### Two 、 Using hyperplanes for binary classification

Consider a series of inputs $$x = (x_{1}, x_{2},...,x_{n})$$, among $$x_{i} = (x_{i1}, x_{i2},...,x_{ip})^{T}$$ , So the corresponding label It should belong to different classes , Here we use $$(1, -1)$$ To express , namely ：$$y = (y_{1}, y_{2},...,y_{n}) \in (-1, 1)$$. So we need training data based on these observations $$T = ( (x_{1},y_{1}), (x_{2},y_{2}),...,(x_{n},y_{n})$$ To learn SVM Parameters of $$(\alpha_{1}, \alpha_{2},...,\alpha_{p})$$, Make it possible for any given $$p$$ Dimension input $$x_{n+1}$$ Make the right judgment . In the process of training , If on one side of the hyperplane is $$-1$$, So on the other side is $$+1$$, So the standard of training can be expressed as ：

$b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip} > 0,\quad if \quad y_{i}=1 \\ b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip} < 0,\quad if \quad y_{i}=-1$

therefore , The final training standard can be unified as ：

$(b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip})y_{i}>0$

The final hyperplane equation is written in matrix form as ：$$f(x) = w^{T}x + b$$

But in theory, by spinning , Translation and other operations , Meet this condition SVM There are countless classifiers , As shown in the left part of the figure below , Let's take two-dimensional input as an example , Think of the blue dots as the same kind , And the red dots are the second kind , The black line is the classifier learned , So which one should I choose ？ The first 1 One classifier is too close to the blue and pink dots , A little bit of disturbance is likely to cause the wrong classification , Poor robustness

The first 2,3 A classifier seems to be OK , But how to define the selection criteria ？

A natural idea is to choose the largest edge hyperplane $$(maximal \quad margin \quad hyperplane)$$. namely ： We can calculate $$p$$ The distance from all the points in a dimensional space to the hyperplane , And the smallest distance （$$margin$$） Maximum . Because the closer the point is to the hyperplane, the easier it is to distinguish , After maximizing these distances, we can have a certain degree of robustness , So we can find the optimal hyperplane . So through this strategy , The optimal hyperplane in the figure above can be described as the figure below ： From the figure above, we can see that the three training observations are equidistant from the maximum margin hyperplane , And along the dotted line indicating the width of the margin . So these three observations are called support vectors . Interestingly , The maximum edge hyperplane depends directly on the support vector , It doesn't depend on other observations , Any other observed movement does not affect the separation hyperplane , The premise is that the movement of observation does not cause it to span $$margin$$.

### 3、 ... and 、 Maximum interval threshold （$$maximum\quad margin$$） Express

Recall the relevant content of high school mathematics , For a plane $$\Phi:Ax+By+Cz+D=0$$, Then the normal vector is $$[A, B, C]$$, Any point in space $$(x,y,z)$$ The distance to the plane is ：

$distance = \dfrac{|Ax_{0}+By_{0}+Cz_{0}+D|}{\sqrt {A^{2}+B^{2}+C^{2}}}$

So in the same way , stay $$p$$ In dimensional space ,$$p-1$$ The dimensional hyperplane is expressed as ：$$\Phi: w^{T}x = 0$$, Then the normal vector is $$w^{T}$$, Any point in space $$x_{i}^{T} = (x_{i1},x_{i2},...,x_{ip})$$ The distance to the hyperplane is ：

$distance = \dfrac{|w^{T}x_{i}+b|}{w^{2}} = \dfrac{|w^{T}x_{i}+b|}{||w||}$

Adding hyperplanes enables correct classification , So let the distance from the support vector to the hyperplane be $$\beta$$, We can get the value of classification as ：

$(b + w_{1}x_{i1} + w_{2}x_{i2} +...+ w_{p}x_{ip})y_{i}>\beta$

namely ：$$w^{T}x > \beta$$, Through observation, we found that , The zoom $$w$$ The value of does not affect the position of the hyperplane , So it doesn't matter $$\beta$$ Take any value , In the end, they all pass $$w'=w/\beta,\quad b'=b/\beta$$ Turn the result into ：$$w^{T}x+b > 1$$, So the distance between the support vectors becomes ： $$2/||w||$$, So finally SVM And that's the model of ：

$\mathop{argmax} \limits_{w} \dfrac{2}{||w||}\quad \quad s.t.\quad y_{i}(w^{T}x_{i}+b)>1,\quad x = 1, 2,...,m$

Equivalent to ：

$\mathop{argmin} \limits_{w} \dfrac{1}{2}||w||^{2}\quad s.t.\quad y_{i}(w^{T}x_{i}+b)>1,\quad x = 1,2,...,m$

namely ： Move the hyperplane left and right 1 One unit gets the hyperplane $$\Phi_{1},\Phi_{2}$$, Find out all of them for +1 The point of class is $$\Phi_{1}$$ left , All for -1 The point of class is $$\Phi_{2}$$ On the right side , And satisfy all the parameters of the hyperplane at the same time $$w$$ The hyperplane with the smallest sum of squares of .

### Four 、 Dual problem and KKT Conditions

Solving inequalities with constraints , According to the Lagrange multiplier method, the formula $$(13)$$ It's an unconstrained inequality , obtain ：

$L(w,b,\lambda) = \dfrac{1}{2}||w||^{2}+\sum_{i=1}^{m}\lambda_{i}(1-y_{i}(w^{T}x_{i}+b))$

So we were right $$L$$ Finding partial derivatives ：

$\dfrac{\partial{L(w,b,\lambda)}}{\partial w} = w-\sum_{i=1}^{m}\lambda_{i}y_{i}x_{i}=0\\ \dfrac{\partial{L(w,b,\lambda)}}{\partial b} = \sum_{i=1}^{m}\lambda_{i}y_{i} = 0$

What will be obtained $$w = \sum_{i=1}^{m}\lambda_{i}y_{i}x_{i},\quad \sum_{i=1}^{m}\lambda_{i}y_{i} = 0$$ Substituting $$(14)$$ obtain ：

$L(w,b,\lambda) = \dfrac{1}{2}(\sum_{i=1}^{m}\lambda_{i}y_{i}x_{i})^{T}\sum_{j=1}^{m}\lambda_{j}y_{j}x_{j}+\sum_{i=1}^{m}\lambda_{i} - \sum_{i=1}^{m}\lambda_{i}y_{i}(w^{T}x_{i})\\ =\dfrac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}+\sum_{i=1}^{m}\lambda_{i} - \sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}\\ =\sum_{i=1}^{m}\lambda_{i} - \dfrac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}$

That is the final formula $$(14)$$ The dual problem of can be written as ：

$\mathop{argmax}\limits_{\lambda}\sum_{i=1}^{m}\lambda_{i} - \dfrac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\lambda_{i}y_{i}\lambda_{j}y_{j}x_{i}^{T}x_{j}, \quad s.t. \quad \sum_{i=1}^{m}\lambda_{i}y_{i} = 0$

Finally, by finding out $$\lambda$$ , Bring it to the formula $$(15)$$ And then we can find out $$w$$ , And then we can find out $$b$$.

In the process of transforming the original problem into a dual problem , The original problem needs to be met $$KKT$$ Conditions , namely ：

$\begin{equation} \left\{ \begin{array}{lr} \lambda_{i}>=0; & \\ y_{i}f(x_{i})-1 >=0; & \\ \lambda_{i}(y_{i}f(x_{i})-1) = 0 ; \end{array} \right. \end{equation}$

This condition is the seventh edition of Tongji 《 Advanced mathematics 》 On the application premise of Lagrange multiplier method ？ Interested friends can check it by themselves , After all, I haven't studied mathematics for many years , If there is any omission, please correct it .

So there's a very interesting phenomenon , namely ：

if $$\lambda_{i}=0$$, be $$y_{i}f(x_{i}) >= 1$$

if $$\lambda_{i}>0$$, be $$y_{i}f(x_{i}) = 1$$, The corresponding sample just falls on the maximum boundary ( Support vector ), Therefore, most of the training samples do not need to be retained after the training , The final model is only about support vectors .

The above main analysis is when the training samples are linearly separable ,SVM Some of the theory and solution process , That is hard threshold SVM. When the training samples are not linearly separable, it is necessary to design soft threshold and kernel function SVM, Limited to space , It will be explained in the following article .

https://chowdera.com/2021/06/20210624004352819c.html