CNN notes: popular understanding of convolutional neural networks
2020-12-07 13:28:59 【osc_ kedi1mvz】
Convolutional neural networks （cs231n And 5 month dl Class notes ）
2012 I organized in Beijing in 8 period machine learning A Book Club , At that time “ machine learning ” Very hot , A lot of people have a huge passion for it . When I 2013 When I came to Beijing again in , There is a word that seems to be more than “ machine learning ” More fire , That's it “ Deep learning ”.
This blog has written some articles about machine learning , But the last technical article “LDA Theme model ” Or in 2014 year 11 month , After all, self 2015 After starting an online education business in , Too many chores 、 Trivial things , Let me always want to write some more technical articles, but I can't spare time every time . However, because the company is constantly opening machine learning 、 In depth learning and other related online courses , I'm in the middle of it , Always learning by the way .
I'm not involved in any courses （ My company “ July online ” All of our online courses are conducted by our current team of lecturers 100 Several lecturers ）, But you can still use the smallest white way Write down the complicated things at first sight . That's the value of rewriting technology blogs .
stay dl in , There is a very important concept , Convolutional neural network CNN, It's basically the introduction dl Something that must be understood . This article is based on Stanford's open machine learning course 、cs231n、 Talk with Han Xiaoyang online in July 5 month dl class written , It's a course note .
At first, I just want to focus on CNN How to calculate and operate the convolution operation in , But I keep adding , Including a lot of their own understanding , So it has become a popular introduction to convolutional neural network . What's the problem , Welcome to correct .
2 Artificial neural network
A neural network is made up of a large number of neurons connected to each other . After each neuron receives input from a linear combination , At first it was just a simple linear weighting , Then we added a nonlinear activation function to each neuron , And then the output is . The connection between each of the two neurons represents a weighted value , Call it weight （weight）. Different weights and activation functions , It will lead to different output of neural network .
An example of handwriting recognition , Given an unknown number , Let neural networks recognize what numbers are . The input of the neural network is defined by a set of input neurons activated by the pixels of the input image . After nonlinear transformation by nonlinear activation function , Neurons are activated and then passed on to other neurons . Repeat the process , Until the last output neuron is activated . So as to identify what the current number is .
Each neuron of the neural network is as follows
basic wx + b In the form of , among
- 、 Express Input vector
- 、 For weight , A few inputs mean a few weights , That is, each input is given a weight
- b For bias bias
- g(z) Is the activation function
- a For export
If it's just that , It's estimated that nine out of ten people who haven't been touched before must be confused again . in fact , The simple model can be traced back to 20 century 50/60 Age sensor , It can be understood as a perceptron based on different factors 、 And the importance of each factor to make a decision .
for instance , There is a strawberry Music Festival in Beijing this weekend , Then whether to go or not ？ There are two factors that determine whether you go , These two factors can correspond to two inputs , Use them separately x1、x2 Express . Besides , These two factors have different influence on decision making , The influence degree of each is weighted w1、w2 Express . Generally speaking , The concert of the festival has a great influence on whether you go or not , If you sing well I can stand it even if I'm not accompanied , But if it's not good, it's not as good as singing on stage . therefore , We can say as follows ：
- ： Do you have any favorite singers . = 1 You like these guests , = 0 You don't like these guests . The weight of the guest factor = 7
- ： Whether someone will accompany you to . = 1 Someone will accompany you to , = 0 No one's going with you . The weight of being accompanied = 3.
such , Our decision-making model has been established ：g(z) = g( * + * + b ),g Is the activation function , there b Can be interpreted as A bias term adjusted to better achieve a goal .
At first, for simplicity , People define the activation function as a linear function , That is, make a linear change in the result , For example, a simple linear activation function is g(z) = z, The output is the linear transformation of the input . Later, it was found that , The linear activation function is too limited , So people introduce the nonlinear activation function .
2.2 Activation function
The commonly used nonlinear activation functions are sigmoid、tanh、relu wait , The first two sigmoid/tanh More common in the full connection layer , the latter relu Common in convolutions . Here is a brief introduction to the most basic sigmoid function （btw, In this blog SVM It was mentioned at the beginning of the article ）.
sigmoid Function of The expression is as follows
among z It's a linear combination , such as z Can be equal to ：b + * + *. By substituting a large positive number or a small negative number into g(z) We can see in the function , The result tends to be 0 or 1.
therefore ,sigmoid function g(z) The graphic representation of is as follows （ The horizontal axis represents the domain of definition z, The vertical axis represents the range of values g(z) ）：
in other words ,sigmoid Functional The function is equivalent to compressing a real number to 0 To 1 Between . When z It's a very large positive number ,g(z) It will approach to 1, and z When it's a very small negative number , be g(z) It will approach to 0.
Compression - 0 To 1 What's the use ？ The utility is that the activation function can be regarded as a kind of “ The probability of classification ”, For example, the output of the activation function is 0.9 That would be interpreted as 90% The probability of is a positive sample .
for instance , Here's the picture （ Figure quoted Stanford Machine learning open class ）
z = b + * + *, among b For the offset term Suppose to take -30,、 All are taken as 20
- If = 0 = 0, be z = -30,g(z) = 1/( 1 + e^-z ) Tend to be 0. Besides , From above sigmoid The graph of the function also shows , When z=-30 When ,g(z) The value of is close to 0
- If = 0 = 1, or =1 = 0, be z = b + * + * = -30 + 20 = -10, Again ,g(z) The value of is close to 0
- If = 1 = 1, be z = b + * + * = -30 + 20*1 + 20*1 = 10, here ,g(z) Tend to be 1.
In other words , Only and Take all 1 When ,g(z)→1, Positive sample ; or take 0 When ,g(z)→0, Negative sample , So as to achieve the purpose of classification .
2.3 neural network
Let's look at this single neuron in the picture below
Organize together , It's a neural network . Below is a three-layer neural network structure
The original input information on the far left in the figure above is called the input layer , The right most neurons are called the output layer （ The output layer in the picture above has only one neuron ）, The middle one is called the hidden layer .
What is the input layer 、 Output layer 、 The hidden layer ？
- Input layer （Input layer）, Many neurons （Neuron） Accept a large number of non-linear input messages . The input message is called the input vector .
- Output layer （Output layer）, Messages are transmitted through neural links 、 analysis 、 Balance , Form output results . The output message is called the output vector .
- Hidden layer （Hidden layer）, abbreviation “ Cryptic layer ”, It's the layers of neurons and links between the input layer and the output layer . If there are multiple hidden layers , It means multiple activation functions .
meanwhile , Each layer may consist of one or more neurons , The output of each layer will be the input data of the next layer . For example, in the middle hidden layer of the figure below , Hidden layer 3 Neurons a1、a2、a3 Each receives input from multiple different weights （ Because there is x1、x2、x3 These three inputs , therefore a1 a2 a3 Will accept x1 x2 x3 The respective weights given , That is, several inputs and several weights ）, next ,a1、a2、a3 Under the influence of their own different weights As the input of the output layer , Finally, the output layer outputs the final result .
Upper figure （ Figure quoted Stanford Machine learning open class ） in
- It means the first one j Layer i Unit activation function / Neuron
- Says from the first j Layers are mapped to j+1 The weight matrix of the control function of the layer
Besides , There is an offset between the input layer and the hidden layer （bias unit), So the picture above also An offset term has been added ：x0、a0. For the above figure , There is a formula
Besides , All of the above are hidden layers , But in reality, there are also multiple hidden layers , That is, the input layer and the output layer are sandwiched with several hidden layers , Layer to layer is a fully connected structure , There is no connection between neurons in the same layer .
3 The hierarchical structure of convolutional neural networks
cs231n In the course, the convolution neural network structure is given , Here's the picture
Above picture CNN What to do is ： Given a picture , Whether it's a car or a horse is unknown , What kind of car is unknown , Now we need a model to determine what is in this picture , In short, output a result ： If it's a car What kind of car is that
- On the far left is the data input layer , Do some data processing , For example, to average （ Centralize all dimensions of input data into 0, Avoid too much data bias , Affect training effect ）、 normalization （ Put all the data in the same range ）、PCA/ Albinism and so on .CNN Only for the training set “ To average ” This step .
In the middle is
- CONV： Convolution computing layer , Linear product Sum up .
- RELU： layer , Above, 2.2 It is mentioned in section ：ReLU It's a kind of activation function .
- POOL： Pooling layer , in short , I.e. average or maximum area .
On the far right is
- FC： Fully connected layer
In these parts , The convolution layer is CNN At the heart of , The following will focus on .
4 CNN Convolution computing layer
4.1 CNN How to identify
in short , When we give a "X" The design of , How can a computer recognize this pattern is “X” Well ？ One possible way is for a computer to store a standard “X” pattern , Then follow the unknown pattern to the standard "X" Compare the patterns , If they are the same , Then determine that the unknown pattern is a "X" pattern .
And even if the unknown pattern may have some translation or slight deformation , You can still tell it's a X pattern . such ,CNN It's about putting unknown patterns and standards X A local contrast of patterns , As shown in the figure below [ The picture comes from the reference copy 25]
And the parts and standards of the unknown pattern X Part by part of the pattern is the calculation process of comparison , It's convolution . The convolution result is 1 Represents a match , Otherwise it doesn't match .
To be specific , To make sure an image contains "X" still "O", It's like we need to judge whether it contains "X" perhaps "O", And suppose you have to choose one of the two , No "X" Namely "O".
The ideal situation is like this ：
The standard "X" and "O", The letters are in the center of the image , And the proportion is appropriate , No deformation
For computers , As long as the image changes a little bit , It's not standard , So it's not so easy to solve this problem ：
The computer should solve the above problem , One of the more naive ways is to save a "X" and "O" The standard image of （ Like the example given earlier ）, Then compare the other new images with these two standard images , See which picture matches better , Just decide which letter .
But to do so , In fact, it is very unreliable , Because computers are still rigid . At the computer “ Vision ” in , A picture looks like a two-dimensional array of pixels （ Think of it as a chessboard ）, Each position corresponds to a number . In our case , Pixel values "1" For white , Pixel values "-1" For black .
When comparing two pictures , If any pixel value does not match , Then these two pictures don't match , At least for computers .
For this example , The computer thinks that the white pixels in the above two pictures are in addition to the middle ones 3*3 It's the same in the little squares , The other four corners are different ：
therefore , On the face of it , The computer judges that the picture on the right is not "X", The two pictures are different , Come to the conclusion ：
But do it , It seems unreasonable . ideally , We hope , For those who just do something like translate , The zoom , rotate , A simple transformed image, such as a microform , The computer can still recognize the "X" and "O". Like the following , We hope that the computer can still recognize it quickly and accurately ：
This is the same. CNN There are problems to be solved .
about CNN Come on , It's a land by land comparison . It compares this “ A small piece ” We call it Features（ features ）. Find some rough features in the same position in the two images to match ,CNN Can better see the similarity of the two pictures , Compared with the traditional method of comparing the whole picture one by one .
every last feature It's like a little picture （ It's a relatively small two-dimensional array of values ）. Different Feature Match different features in the image . In the letters "X" In the case of , Those consist of diagonals and crosses features Basically able to identify most of "X" An important characteristic of .
these features It's likely to match anything that contains letters "X" The letters in the picture X Its four corners and its center . So how exactly does it match ？ as follows ：
See if there is a leader here . But it's just the first step , You know that Features How to match on the original . But you don't know what kind of mathematical calculation is going on here , For example, under this 3*3 What's the matter with this little piece of ？
The math operation in this , That's what we often say “ Convolution ” operation . Next , Let's see what convolution operations are .
4.2 What is convolution
To image （ Different data window data ） And filter matrix （ A set of fixed weights ： Because multiple weights of each neuron are fixed , So it can be seen as a constant filter filter） do Inner product （ Multiply and sum one by one ） The operation of is called 『 Convolution 』 operation , It's also the name of convolutional neural network .
Not strictly speaking , The part in the red frame in the figure below can be understood as a filter , That is, neurons with a set of fixed weights . Multiple filters are added to form a convolution layer .
OK, Take a specific example . As shown in the picture below , The left part of the figure is the original input data , The middle part of the picture is the filter filter, On the right is the new two-dimensional data output .
Break it down
In the corresponding position, the numbers are multiplied first and then added =
Intermediate filter filter Make inner product with data window , The specific calculation process is ：4*0 + 0*0 + 0*0 + 0*0 + 0*1 + 0*1 + 0*0 + 0*1 + -4*2 = -8
4.3 Convolution on the image
In the calculation process corresponding to the figure below , Input is a certain area size (width*height) The data of , And filters filter（ Neurons with a set of fixed weights ） Do inner product and wait for new 2D data .
say concretely , On the left is the image input , The middle part is the filter filter（ Neurons with a set of fixed weights ）, Different filters filter Different output data will be obtained , For example, the color is dark and light 、 outline . Equivalent to if you want to extract different features of an image , Different filters filter, Extract specific information about the image you want ： A shade or outline of color .
As shown in the figure below
4.4 GIF Dynamic convolutions
stay CNN in , filter filter（ Neurons with a set of fixed weights ） Convolution calculation of local input data . After calculating the local data in a data window , The data window keeps moving , Until all the data have been calculated . In the process , There are several parameters ：
a. depth depth： Number of neurons , Decide what to output depth The thickness of the . At the same time, it represents the number of filters .
b. step stride： Decide how many steps you can slide to the edge .
c. Fill value zero-padding： Add a few laps to the outer edge 0, It is convenient to slide the end position from the initial position in steps , Generally speaking, it's to divide the total length by steps .
cs231n There is a convolution diagram in the course , It seems to use d3js And a util Painted , I according to cs231n The convolution moving graphs of are successively obtained 18 Pictures , Then use one gif The drawing tool makes a gif Dynamic convolutions . as follows gif As shown in the figure
You can see ：
- Two neurons , namely depth=2, It means there are two filters .
- The data window moves two steps at a time 3*3 Partial data of , namely stride=2.
And then two filters filter Convolution calculation for axis sliding array , Got two different results .
If you look at the picture first , Maybe not immediately understand what it means , But combined with the above , It's not very difficult to understand this moving picture ：
- On the left is the input （7*7*3 in ,7*7 Pixels representing the image / Length and width ,3 representative R、G、B Three color channels ）
- The middle part is two different filters Filter w0、Filter w1
- On the far right are two different outputs
As you pan and slide the data window on the left , filter Filter w0 / Filter w1 Convolution calculation for different local data .
It is worth mentioning that ：
- The data on the left is changing , Each filter is convolution for a local data window , That's what's called CNN Medium Local awareness Mechanism .
- For example , The filter is like a pair of eyes , Human perspective is limited , Glance at , You can only see parts of the world . If you can see the world at a glance , You'll be dead tired , And all the information from all over the world , Your brain can't accept . Of course , Even looking at the parts , For local information, human eyes are also biased 、 Preferred . For example, look at beautiful women , Face to face 、 chest 、 The leg is focused on , So this 3 The weight of each input is relatively large .
meanwhile , Data window slide , Causing the input to change , But the intermediate filter Filter w0 The weight of （ That is, the weight of each neuron connected to the data window ） It's fixed , This constant weight is called CNN Medium Parameters （ The weight ） share Mechanism .
- Here's another analogy , Someone travels around the world , The information we see is changing , But the eyes that gather information don't change .btw, Different people's eyes Look at the same partial information The difference felt , A thousand readers, a thousand Hamlets , So different filters It's like different eyes , Different people have different feedback results .
When I first saw the dynamic graph above , I just feel dazzled , In addition, it is said that the calculation process is “ Multiply and add ”, But it's a calculation process of how to multiply and add Can't see at a glance , There is no clear calculation process on the Internet . In this article, I will go into details .
First , Let's decompose the moving graph , Here's the picture
next , Let's go through the detailed calculation process in the figure above . That is, the output in the figure above 1 How exactly is it calculated ？ Actually , similar wx + b,w Corresponding filter Filter w0,x Corresponding to different data windows ,b Corresponding Bias b0, Equivalent to a filter Filter w0 Multiply and sum with data windows , Then add Bias b0 Get the output 1, The following procedure shows ：
1* 0 + 1*0 + -1*0
-1*0 + 0*0 + 1*1
-1*0 + -1*0 + 0*1
-1*0 + 0*0 + -1*0
0*0 + 0*1 + -1*1
1*0 + -1*0 + 0*2
0*0 + 1*0 + 0*0
1*0 + 0*2 + 1*0
0*0 + -1*0 + 1*0
Then the filter Filter w0 Fixed , Move the data window to the right 2 Step , Continue to do the inner product calculation , obtain 0 Output result of
Last , Change to a different filter Filter w1、 Different offsets Bias b1, Then convolute with the leftmost data window in the graph , You can get another different output .
5 CNN Incentive layer and pooling layer
5.1 ReLU layer
2.2 The activation function is described in section sigmoid, But the actual gradient is falling ,sigmoid Easy to saturate 、 Causes the termination of gradient transfer , And there's no 0 Centralization . Do how? , You can try another activation function ：ReLU, Its graphic representation is as follows
ReLU The advantage is fast convergence , It's easy to find gradients .
5.2 Pooling pool layer
I said before , Pooling , in short , I.e. average or maximum area , As shown in the figure below （ Figure quoted cs231n）
The picture above shows the largest area , In the left part of the picture above top left corner 2x2 In the matrix of 6 Maximum , Upper right corner 2x2 In the matrix of 8 Maximum , The lower left corner 2x2 In the matrix of 3 Maximum , The lower right corner 2x2 In the matrix of 4 Maximum , So we get the result on the right side of the graph ：6 8 3 4. It's very simple not ？
This article basically looks at 5 month dl What ban Han said CNN Take notes on the video , I've seen many of them off and on before CNN Relevant information （ Include cs231n）, But after watching the video , To systematically understand CNN What is it , As an audience I really like it 、 Clear . Then I'm writing CNN When it comes to related things , Discover some pre knowledge （ Like neurons 、 Multi layer neural network also need to be introduced ）, Include CNN Other levels of organization （ For example, the incentive layer ）, So this text just want to introduce the convolution operation , But consider the context of knowledge , So the longer you write , It's this article .
Besides , In the process of writing this article , I asked the cold of our lecturer team 、 Feng Liang , Thank him both . meanwhile , Thank you for your tweet , Thanks to all my colleagues online in July .
Here is the modification log ：
- 2016 year 7 month 5 Japan , Fixed some clerical errors 、 error , In order to make the full text more popular 、 More accurate . There are any problems or troughs , You are welcome to point out .
- 2016 year 7 month 7 Japan , The second round of revision is over . And according to cs231n The convolution moving graphs of are successively obtained 18 Pictures , Then I made a drawing tool gif Dynamic convolutions , Put it in the text 4.3 section .
- 2016 year 7 month 16 Japan , Complete the third round of revision . This round of revision is mainly reflected in sigmoid Function description , By giving examples and unifying relevant symbols, we can make the meaning more clear 、 Clearer .
- 2016 year 8 month 15 Japan , Complete the fourth round of revision , Add details . Such as supplementary 4.3 section GIF The explanation of the input part of the dynamic convolution graph , namely 7*7*3 The meaning of （ among 7*7 Pixels representing the image / Length and width ,3 representative R、G、B Three color channels ）. It's always easier to understand .
- 2016 year 8 month 22 Japan , Complete the fifth round of revision . This round of modification mainly strengthens the explanation of the filter , And the introduction of CNN The common metaphor of filter in .
July、 Last revised on August 22, 2016 at noon in July online office .
7 References and recommended readings
- Artificial neural network wikipedia
- Stanford machine learning open class
- Rain stone Convolutional neural networks ：http://blog.csdn.net/stdcoutzyx/article/details/41596663
- cs231n Structure of neural network and excitation function of neuron ：http://cs231n.github.io/neural-networks-1/, Chinese translation
- cs231n Convolutional neural networks ：http://cs231n.github.io/convolutional-networks/
- What teacher Han said in July 5 month dl Class 4 Second class CNN With common frame video , The cut part has been put on the online official website in July ：julyedu.com
- July online 5 Deep learning class of the month 5 course CNN Training notes part of the video ：https://www.julyedu.com/video/play/42/207
- July online 5 Deep learning class of the month ：https://www.julyedu.com/course/getDetail/37
- July online 5 Course notes of the deep learning class of the month ——No.4《CNN With common frames 》：http://blog.csdn.net/joycewyj/article/details/51792477
- July online 6 Monthly data mining class 7 Lesson video ： Data classification and sorting
- Hand handle introduction neural network series (1)_ From the point of view of elementary mathematics, neural network is explored ：http://blog.csdn.net/han_xiaoyang/article/details/50100367
- Deep learning and computer vision series (6)_ Structure of neural network and excitation function of neuron ：http://blog.csdn.net/han_xiaoyang/article/details/50447834
- Deep learning and computer vision series (10)_ The convolutional neural network ：http://blog.csdn.net/han_xiaoyang/article/details/50542880
- zxy Some knowledge points of image convolution and filtering ：http://blog.csdn.net/zouxy09/article/details/49080029
- zxy Deep learning CNN note ：http://blog.csdn.net/zouxy09/article/details/8781543/
- http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/, Chinese translation
- 《 Neural networks and deep learning 》 Chinese handout ：http://vdisk.weibo.com/s/A_pmE4iIPs9D
- ReLU And sigmoid/tanh The difference between ：https://www.zhihu.com/question/29021768
- CNN、RNN、DNN Differences in internal network structure ：https://www.zhihu.com/question/34681168
- Understanding convolution ：https://www.zhihu.com/question/22298352
- A brief history of neural networks and deep learning ：1 Perceptron and BP Algorithm 、4 The great rejuvenation of deep learning
- Online production gif Moving graph ：http://www.tuyitu.com/photoshop/gif.htm
- General introduction to support vector machines （ understand SVM Three levels of state ）
- CNN How does it work step by step ？ This blog has clearly written the calculation process of convolution operation , But this article also clearly explains why convolution is needed , And the picture of spouse is very vivid , Very good .
- C++ 数字、string和char*的转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
C + + programming experience (6): using C + + style type conversion
Latest party and government work report ppt - Park ppt
Online ID number extraction birthday tool
Field pointer? Dangling pointer? This article will help you understand!
GVRP of hcna Routing & Switching
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing＆Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+＃1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11：如何处理标定助手品质问题
- HALCON 20.11：标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- Reverse linked list
- JS data type
- Remember the bug encountered in reading and writing a file
- Singleton mode
- 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
- In this world of N programming languages, is there a future for C + +?
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World