当前位置：网站首页>The language model of vector: from ngram to nnlm and rnnlm
The language model of vector: from ngram to nnlm and rnnlm
20201208 14:47:25 【Thinkgamer_】
This topic article will be divided into three parts , The theme of each part is ：
 word2vec Prelude to  Statistical language model
 word2vec Detailed explanation  The style does not diminish
 other xxx2vec Introduction to papers and Applications
It will be updated later Embedding Related articles , It could be a series of separate , It may also be put in 《 Feature Engineering Embedding In the series 》, Welcome to continue to pay attention 「 Search and recommend Wiki」
1.1、 Basic knowledge of
a） Definition
Language model （Language model） It is an important technology of natural language processing , The most common natural language processing is text data , We can think of a natural language text as a discrete time series , Suppose the length of a segment is $T$ The words in the text are in turn $w_{1},w_{2},...,w_{T}$, The language model is to calculate his probability ：
$P(w_{1},w_{2},...,w_{T})$
In other words, language model is to model the probability distribution of statements .
Language models can be divided into ： Statistical language model and neural network language model .
b） Probability means
hypothesis $S$ To express a meaningful sentence ,eg： It's sunny today , Suitable for outdoor mountain climbing , This sentence can be expressed as ：$S=w_{1},w_{2},...,w_{n}$, Change to the sentence in the example ：$w_{1}=todayGod,w_{2}=Godgas,w_{3}=FineLang,w_{4}=optimumclose,w_{5}=HouseholdOutside,w_{6}=climbmountain$.
use $P(S)$ The probability of the occurrence of this sentence , Begin to ：
$P(S)=P(w_{1},w_{2},...,w_{n})$
The conditional probability can be transformed into ：
$P(S)=P(w_{1},w_{2},...,w_{n})=P(w_{1})P(w_{2}∣w_{1})P(w_{3}∣w_{1},w_{2})...P(w_{n}∣w_{1},w_{2},...,w_{n−1})$
among $P(w_{1})$ The probability of the first word appearing , namely 「 today 」 The probability of appearing in the whole corpus ,$P(w_{2}∣w_{1})$ It means that given the first word , The probability of the second word appearing , That is, given in the whole corpus 「 today 」 The word ,「 The weather 」 The probability that this word also appears , And so on .
Among them $P(w_{1})$ and $P(w_{2}∣w_{1})$ It's easy to calculate , however $P(w_{3}∣w_{1},w_{2})$ After that, there are many variables involved , The computational complexity will also become more complex .
1.2、 Statistical language model ——Ngram Model
a） Markov hypothesis
The above problem is too complicated to solve , Need to introduce Markov hypothesis , A very important point in the Markov hypothesis is that The finite field hypothesis , That is, each state is only related to the $n−1$ The states are related to , This is known as $n$ Order Markov chain
b）ngram
When applied to a language model , It means that the probability of each word is only the same as that of the preceding word $n−1$ A word has something to do with , This is called $n$ Metalanguage model , When $n=2$ when , It's called a binary model , At this time, the above formula is expanded to ：
$P(S)=P(w_{1},w_{2},...,w_{n})=P(w_{1})P(w_{2}∣w_{1})P(w_{3}∣w_{2})...P(w_{n}∣n_{1})$
After simplification of Markov Hypothesis , Calculation $P(S)$ And it's going to be a lot easier , Of course as $n$ An increase in , The corresponding computational complexity will also increase , and $n$ The bigger it is , The closer we get to the true distribution of the data ,$n$ Usually the value is 2、3、4、5.
c） Probability estimation
Through the above description , What is clear is that ：
 Each sentence can be broken down into different word permutations
 Each sentence can be calculated by conditional probability formula to get a rationality probability of the sentence
 By introducing the Markov Hypothesis , Simplify the calculation probability of sentences
Take the binary model as an example , How to calculate $P(w_{i}∣w_{i−1})$？ From the probability and statistics we can see that ：
$P(w_{i}∣w_{i−1})=P(w_{i})P(w_{i},w_{i−1}) $
In the case of large corpus , Based on the large number theorem , words $<w_{i},w_{i−1}>$ Divided by $w_{i}$ The number of occurrences of can be approximately equal to $P(w_{i}∣w_{i−1})$, So there is ：
$P(w_{i}∣w_{i−1})=P(w_{i})P(w_{i},w_{i−1}) =N(w_{i})N(w_{i},w_{i−1}) $
So in general , Statistical language models require large enough corpus , The result will be more accurate . But there is also a problem here , If $N(w_{i},w_{i−1})=N(w_{i})=1$ Or both equal to 0 Words , The result of the calculation is obviously unreasonable . So the smoothing technique is introduced .
d）ngram The technique of smoothing in the model
Smoothing technology is to solve c） The number of times described in the statistical ratio is not reasonable . Common smoothing techniques are （ Don't expand here , If you are interested, you can search for it ）：
 additive smoothing
 Goodall  Turing (GoodTuring) Estimation
 Katz Smoothing method
 JelinekMercer Smoothing method
 WittenBell Smoothing method
 Absolute impairment method
 KneserNey Smoothing method
e）ngram The advantages and disadvantages of language model
advantage ：
 (1) Maximum likelihood estimation is used , Parameters are easy to train
 (2) It completely includes the former n1 All the information of a word
 (3) Strong explanatory ability , Intuitive and easy to understand .
shortcoming ：
 (1) Lack of longterm dependence , It can only be modeled to the front n1 Word
 (2) With n The increase of , The parameter space grows exponentially
 (3) Data sparseness , It is inevitable that OOV The problem of
 (4) Purely based on statistical frequency , Poor generalization ability .
1.3、 Neural network language model ——NNLM
NNLM The paper ：《A Neural Probabilistic Language Model》
Download link ：http://www.ai.mit.edu/projects/jmlr/papers/volume3/bengio03a/bengio03a.pdf
NNLM The network of the model is a threelayer network structure diagram , As shown below ：
The bottom layer is before the output word $n−1$ Word ,NNLM The goal of the model is to rush this $n−1$ One word calculation The first $t$ Word $w_{t}$ Probability .
$NNLM$ The training goal is ：
$f(w_{t},...,w_{t−n+1})=P(w_{t}∣w_{1})$
among $w_{t}$ It means the first one $t$ Word ,$w_{1}$ Says from the first 1 Words to $t−1$ A subsequence of words , The constraints of the model are to satisfy the needs of ：
 $f(w_{t},...,w_{t−n+1})>0$
 $∑_{1}f(w_{t},...,w_{t−n+1})=1$
The model above means that when a sequence is given , By the front of $t−1$ Words predict the first $n$ Probability of words .
 Limit one ： That is, every probability value obtained through the network should be greater than 0.
 And for the second constraint ： Because the output of our neural network model is for every $t−1$ Word input to predict the next , That is to say $t$ What is a word . So the actual output of the model is a vector , Each component of the vector corresponds to the probability that the next word is a word in the dictionary . therefore $∣v∣$ There must be a maximum probability in the probability value of dimension , The other probabilities are smaller .
$NNLM$ The training goal of the model can be divided into two parts ：
 1、 Feature mapping ： the $V$ No $i$ A word is mapped to a $m$ Dimension vector $C(i)$, The mapping here can be $OneHot$（ It can be an initial word vector , Because we have to train with the model ）, And then we will get the feature map $C(w_{t−n+1}),...,C(w_{t−1})$ Join together to form a $m(n−1)$ Dimension vector . You could say ： One from the vocabulary $V$ Mapping to real vector space $C$. Through this mapping, we get the vector representation of each word .
 2、 Calculate the conditional probability distribution ： Through a function $g$ The input word vector sequence $(C(w_{t−n+1}),...,C(w_{t−1}))$ It turns into a probability distribution $y∈R_{∣V∣}$, So the output here is $∣V∣$ Dimensional , The dimensions of a dictionary are the same ,$y$ pass the civil examinations $i$ A word means the first in a word sequence $n$ The word is $V_{i}$ Probability , namely ：$f(i,w_{t−1},...,w_{t−n+1})=g(i,C(w_{t−1}),...,C(w_{t−n+1}))$
$NNLM$ The output of the model is a $softmax$ function , Form the following ：
$P(w_{t}∣w_{t−1},...,w_{t−n+1})=∑_{i}e_{y_{w}}e_{y_{w}} $
among $y_{i}$ It means No $i$ The probability that a word is not normalized , Its calculation formula is ：
$y=b+W_{x}+Utanh(d+H_{x})$
The model parameters are ：$θ=(b,d,W,U,H,C)$
 $x$ Represents the input matrix of the neural network ：$x=(C(w_{t−1}),...,C(w_{t−n+1}))$
 $h$ Represents the number of neurons in the hidden layer
 $d$ Represents the offset parameter of the hidden layer
 $m$ Represents the vector dimension of each word
 $W$ Is an optional parameter , If there is no direct connection between the input layer and the output layer , May make $W=0$
 $H$ It represents the weight matrix from input layer to hidden layer
 $U$ The weight matrix from hidden layer to output layer
The general neural network does not need to train the input , But the input in the model $x$ It's a word vector , It's also a parameter that needs training , It can be seen that the weight parameters and word vectors of the model are trained at the same time , After the training of the model, the weight parameters and word vectors of the network are obtained simultaneously
$NNLM$ The goal of training is to maximize the likelihood function , namely ：
$L=T1 i∑ logf(w_{t},w_{t−1},...,w_{t−n+1},;θ)+R(θ)$
among $θ$ For all model parameters ,$R(θ)$ Is a regularized term （ In the corresponding experiment of the paper ,$R$ Represents the weight attenuation parameter , It is only applicable to neural networks and vector matrices corresponding to words ）.
Then use the gradient descent method to update the parameters ：
$θ←θ+ϵϑθϑlogp(w_{t}∣w_{t−1},...,w_{t−n+1}) $
among $ϵ$ For learning rate （ step ）.
be based on Pytorch and Tf Implementation code reference ：https://www.jianshu.com/p/be242ed3f314
1.4、 Neural network language model ——RNNLM
RNNLM The paper ：《Recurrent neural network based language model》
Download link ：https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf
RNNLM The idea of the model is relatively simple , The main improvement is NNLM Feedforward neural network in , Its main structure is shown below ：
At first glance, the reader may not know what this description is , take it easy , Let's fill in the simple one first RNN knowledge .
A short answer RNN The structure is shown in the following figure ：
It contains the input layer 、 Hidden layer and output layer ,$X$ Represents the vector of the input layer ,$U$ Represents the weight matrix from the input layer to the hidden layer ,$S$ Represents the vector value of the hidden layer ,$V$ It represents the weight matrix from hidden layer to output layer ,$O$ Represents the vector value of the output layer ,$w$ Represents the last value on the hidden layer .
Expand the above figure to ：
Now it looks clearer , This network is in $t$ Always receive input $x_{t}$ after , The value of the hidden layer is $s_{t}$ , The output value is $o_{t}$ . The key point is ,$s_{t}$ It's not just about $x_{t}$ , It also depends on $s_{t−1}$.
Then look at RNNLM Structure diagram of the model ,$INPUT(t)$ That means $t$ Time input $x_{t}$,$CONTEXT(t)$ That means $t$ The hidden layer of time （$s_{t}$）,$CONTEXT(t−1)$ said $t−1$ The hidden layer value of the moment （$s_{t−1}$）,$OUTPUT(t)$ That means $t$ The output of time （$o_{t}$）.
among ：
$x(t)=w(t)+s(t−1)s_{j}(t)=f(i∑ x_{i}(t)u_{ji})y_{k}(t)=g(i∑ s_{j}(t)v_{kj})$
among $f(z)$ It means $sigmoid$ function ：
$f(z)=1+e_{−z}1 $
$g(z)$ It means $softmax$ function ：
$g(z_{m})=∑_{k}e_{z_{k}}e_{z_{m}} $
There are some details to pay attention to ：
 $s(0)$ Usually set to a smaller vector value , Such as 0.1
 The number of hidden layer units is generally 30500, Because the input vocabulary is very large
 The weights are initialized using a Gaussian function with noise
 The optimization algorithm uses random gradient descent
 The initial value of learning rate is set to 0.1, In the following, with the iteration going on , If there is no obvious improvement , The learning rate has been revised to half of the original
 Usually in 1020 Convergence begins after iterations
each epoch After that , The vector error is calculated based on the cross entropy criterion ：
$error(t)=desired(t)−y(t)$
among $desired(t)$ Represents the predicted value ,$y(t)$ For real value .
Finally, the formula for calculating the probability of word occurrence is ：
$P(w_{i}(t+1)∣wi,s(t−1))={C_{rare}y_{rare}(t) y_{i}(t) ifw_{i}(t+1)israreotherwise $
among $C_{rare}$ Represents the number of words in the vocabulary that appear less than the threshold . It is mentioned in the article that $Browncorpus$ Data set （ Yes 80 Ten thousand words ）, The threshold is set to 5, The number of hidden layer units is 100.
About RNNLM Code implementation can refer to ：https://www.jianshu.com/p/f53f606944c6
scan Pay attention to WeChat public number ！ Master Focus on search and recommendation systems , Try to use algorithms to better serve users , Including but not limited to machine learning , Deep learning , Reinforcement learning , natural language understanding , Knowledge map , Not yet regularly sharing technology , Information , Thinking and other articles ！
版权声明
本文为[Thinkgamer_]所创，转载请带上原文链接，感谢
https://chowdera.com/2020/12/20201208144703280d.html
边栏推荐
 C++ 数字、string和char*的转换
 C++学习——centos7上部署C++开发环境
 C++学习——一步步学会写Makefile
 C++学习——临时对象的产生与优化
 C++学习——对象的引用的用法
 C++编程经验（6）：使用C++风格的类型转换
 Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
 C + + number, string and char * conversion
 C + + Learning  capacity() and resize() in C + +
 C + + Learning  about code performance optimization
猜你喜欢

C + + programming experience (6): using C + + style type conversion

Latest party and government work report ppt  Park ppt

在线身份证号码提取生日工具

Online ID number extraction birthday tool

️野指针？悬空指针？️ 一文带你搞懂！

Field pointer? Dangling pointer? This article will help you understand!

HCNA Routing＆Switching之GVRP

GVRP of hcna Routing & Switching

Seq2Seq实现闲聊机器人

【闲聊机器人】seq2seq模型的原理
随机推荐
 LeetCode 91. 解码方法
 Seq2seq implements chat robot
 [chat robot] principle of seq2seq model
 Leetcode 91. Decoding method
 HCNA Routing＆Switching之GVRP
 GVRP of hcna Routing & Switching
 HDU7016 Random Walk 2
 [Code+＃1]Yazid 的新生舞会
 CF1548C The Three Little Pigs
 HDU7033 Typing Contest
 HDU7016 Random Walk 2
 [code + 1] Yazid's freshman ball
 CF1548C The Three Little Pigs
 HDU7033 Typing Contest
 Qt Creator 自动补齐变慢的解决
 HALCON 20.11：如何处理标定助手品质问题
 HALCON 20.11：标定助手使用注意事项
 Solution of QT creator's automatic replenishment slowing down
 Halcon 20.11: how to deal with the quality problem of calibration assistant
 Halcon 20.11: precautions for use of calibration assistant
 “十大科学技术问题”揭晓！青年科学家50²论坛
 "Top ten scientific and technological issues" announced Young scientists 50 ² forum
 求反转链表
 Reverse linked list
 js的数据类型
 JS data type
 记一次文件读写遇到的bug
 Remember the bug encountered in reading and writing a file
 单例模式
 Singleton mode
 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
 In this world of N programming languages, is there a future for C + +?
 es6模板字符
 js Promise
 js 数组方法 回顾
 ES6 template characters
 js Promise
 JS array method review
 【Golang】️走进 Go 语言️ 第一课 Hello World
 [golang] go into go language lesson 1 Hello World