Support vector machine ( support vector machine, SVM ) It's using hyperplanes for a given p Non probabilistic binary linear classifier based on dimension vector .

### One 、 hyperplane ( hyperplane )

In a p Dimension input space , The hyperplane is \(p-1\) The subspace of dimension . such as ： In a two-dimensional input space , A hyperplane is one dimension , It's a straight line . The formula is as follows ：

In a three-dimensional input space , A hyperplane is two-dimensional , It's a plane , The formula is as follows ：

When you enter dimensions \(p ＞ 3\) when , It's hard to understand the hyperplane intuitively , But it can be simply recorded as ： The hyperplane divides the input space into two parts , So the final mathematical expression of the hyperplane is ：

among ,\(\alpha_{i}\) The parameter representing the hyperplane . therefore , The hyperplane will be a p The dimensional space is divided into two parts , Consider one \(p\) Input instance of dimension \(x = (x_{1}, x_{2},..., x_{p})\), When the instance is on a hyperplane , Yes ：

When it is not on the hyperplane , There must be ：

perhaps ：

### Two 、 Using hyperplanes for binary classification

Consider a series of inputs \(x = (x_{1}, x_{2},...,x_{n})\), among \(x_{i} = (x_{i1}, x_{i2},...,x_{ip})^{T}\) , So the corresponding label It should belong to different classes , Here we use \((1, -1)\) To express , namely ：\(y = (y_{1}, y_{2},...,y_{n}) \in (-1, 1)\). So we need training data based on these observations \(T = ( (x_{1},y_{1}), (x_{2},y_{2}),...,(x_{n},y_{n})\) To learn SVM Parameters of \((\alpha_{1}, \alpha_{2},...,\alpha_{p})\), Make it possible for any given \(p\) Dimension input \(x_{n+1}\) Make the right judgment . In the process of training , If on one side of the hyperplane is \(-1\), So on the other side is \(+1\), So the standard of training can be expressed as ：

therefore , The final training standard can be unified as ：

The final hyperplane equation is written in matrix form as ：\(f(x) = w^{T}x + b\)

But in theory, by spinning , Translation and other operations , Meet this condition SVM There are countless classifiers , As shown in the left part of the figure below , Let's take two-dimensional input as an example , Think of the blue dots as the same kind , And the red dots are the second kind , The black line is the classifier learned , So which one should I choose ？

The first 1 One classifier is too close to the blue and pink dots , A little bit of disturbance is likely to cause the wrong classification , Poor robustness

The first 2,3 A classifier seems to be OK , But how to define the selection criteria ？

A natural idea is to choose the largest edge hyperplane \((maximal \quad margin \quad hyperplane)\). namely ： We can calculate \(p\) The distance from all the points in a dimensional space to the hyperplane , And the smallest distance （\(margin\)） Maximum . Because the closer the point is to the hyperplane, the easier it is to distinguish , After maximizing these distances, we can have a certain degree of robustness , So we can find the optimal hyperplane . So through this strategy , The optimal hyperplane in the figure above can be described as the figure below ：

From the figure above, we can see that the three training observations are equidistant from the maximum margin hyperplane , And along the dotted line indicating the width of the margin . So these three observations are called support vectors . Interestingly , The maximum edge hyperplane depends directly on the support vector , It doesn't depend on other observations , Any other observed movement does not affect the separation hyperplane , The premise is that the movement of observation does not cause it to span \(margin\).

### 3、 ... and 、 Maximum interval threshold （\(maximum\quad margin\)） Express

Recall the relevant content of high school mathematics , For a plane \(\Phi:Ax+By+Cz+D=0\), Then the normal vector is \([A, B, C]\), Any point in space \((x,y,z)\) The distance to the plane is ：

So in the same way , stay \(p\) In dimensional space ,\(p-1\) The dimensional hyperplane is expressed as ：\(\Phi: w^{T}x = 0\), Then the normal vector is \(w^{T}\), Any point in space \(x_{i}^{T} = (x_{i1},x_{i2},...,x_{ip})\) The distance to the hyperplane is ：

Adding hyperplanes enables correct classification , So let the distance from the support vector to the hyperplane be \(\beta\), We can get the value of classification as ：

namely ：\(w^{T}x > \beta\), Through observation, we found that , The zoom \(w\) The value of does not affect the position of the hyperplane , So it doesn't matter \(\beta\) Take any value , In the end, they all pass \(w'=w/\beta,\quad b'=b/\beta\) Turn the result into ：\(w^{T}x+b > 1\), So the distance between the support vectors becomes ： \(2/||w||\), So finally SVM And that's the model of ：

Equivalent to ：

namely ： Move the hyperplane left and right 1 One unit gets the hyperplane \(\Phi_{1},\Phi_{2}\), Find out all of them for +1 The point of class is \(\Phi_{1}\) left , All for -1 The point of class is \(\Phi_{2}\) On the right side , And satisfy all the parameters of the hyperplane at the same time \(w\) The hyperplane with the smallest sum of squares of .

### Four 、 Dual problem and KKT Conditions

Solving inequalities with constraints , According to the Lagrange multiplier method, the formula \((13)\) It's an unconstrained inequality , obtain ：

So we were right \(L\) Finding partial derivatives ：

What will be obtained \(w = \sum_{i=1}^{m}\lambda_{i}y_{i}x_{i},\quad \sum_{i=1}^{m}\lambda_{i}y_{i} = 0\) Substituting \((14)\) obtain ：

That is the final formula \((14)\) The dual problem of can be written as ：

Finally, by finding out \(\lambda\) , Bring it to the formula \((15)\) And then we can find out \(w\) , And then we can find out \(b\).

In the process of transforming the original problem into a dual problem , The original problem needs to be met \(KKT\) Conditions , namely ：

This condition is the seventh edition of Tongji 《 Advanced mathematics 》 On the application premise of Lagrange multiplier method ？ Interested friends can check it by themselves , After all, I haven't studied mathematics for many years , If there is any omission, please correct it .

So there's a very interesting phenomenon , namely ：

if \(\lambda_{i}=0\), be \(y_{i}f(x_{i}) >= 1\)

if \(\lambda_{i}>0\), be \(y_{i}f(x_{i}) = 1\), The corresponding sample just falls on the maximum boundary ( Support vector ), Therefore, most of the training samples do not need to be retained after the training , The final model is only about support vectors .

** The above main analysis is when the training samples are linearly separable ,SVM Some of the theory and solution process , That is hard threshold SVM. When the training samples are not linearly separable, it is necessary to design soft threshold and kernel function SVM, Limited to space , It will be explained in the following article .**