Word2vec: two models, two optimization and practical use

2020-12-08 14:47:23

This topic article will be divided into three parts , The theme of each part is ：

• word2vec Prelude to - Statistical language model
• word2vec Detailed explanation - The style does not diminish
• other xxx2vec Introduction to papers and Applications

It will be updated later Embedding Related articles , It could be a series of separate , It may also be put in 《 Feature Engineering -Embedding In the series 》, Welcome to continue to pay attention 「 Search and recommend Wiki

2.1、 Background introduction

word2vec yes Google 2013 A tool for calculating word vectors , In the paper Efficient Estimation of Word Representations in Vector Space in , The author puts forward Word2vec Calculation tools , And by contrast NNLM、RNNLM The language model validates word2vec The effectiveness of the .

word2vec There are two models in the tool ：CBOW and skip-gram. The introduction in the paper is relatively simple , As shown in the figure below ,CBOW It is the prediction of the head word by the word in the context ,Skip-gram The word in the context is predicted by inputting words . 2.2、CBOW and Skip-gram

The introduction of these two models is rough in the original paper , In the paper 《word2vec Parameter Learning Explained》 It is explained and explained in detail in , Let's take a closer look at CBOW and Skip-gram.

a）CBOW

One-word context

First, let's look at the situation where there is only one context word The total number of words is V V , The number of neurons in the hidden layer is N N , The weight matrix from input layer to hidden layer is W V ∗ N W_{V*N} , The weight matrix from hidden layer to output layer is W N ∗ V ′ W'_{N*V} .

The input layer is word One-hot code . From input layer to hidden layer ：
h = W T x : = v w I T h = W^T x := v^T_{w_I}
v w I v_{w_I} It means to input words w I w_I Vector representation of （ Attention and input layer x x Distinguish , The input vector is W W The vector in represents , The output vector is W ′ W' The vector in represents ）, Its dimension is [ N , 1 ] [N, 1] , After transposition, the dimension becomes [ 1 , N ] [1,N] , An input representation used to represent a vector , it is to be noted that It's not here [ 1 , N ] [1,N] , Otherwise, it is easy to confuse the dimensions when multiplying in the next formula , Symbol : = := It means Defined as .

From hidden layer to input layer ：
u j = ( v w j ′ ) T h u_j = (v'_{w_j})^T h
among v w j ′ v'_{w_j} It's a matrix W ′ W' Of the j j Column , Its dimension is [ N , 1 ] [N,1] , Calculated u j u_j For a specific value , It means No j j The input words are in the output layer j j Values corresponding to positions .

Finally using s o f t m a x softmax Normalize it （ Because the output layer is fixed V V Word , So it can be seen as classification , Therefore use s o f t m a x softmax Normalize ）, Get the input word w I w_I The probability of each word in its thesaurus ：
p ( w j ∣ w I ) = y j = e x p ( u j ) ∑ j ′ = 1 V e x p ( u j ′ ) p(w_j | w_I) = y_j = \frac{exp(u_j)}{ \sum_{j'=1}^{V}exp(u_j') }
among y j y_j Represents the output layer of the j j The number of neurons .

Combining the above three formulas, we can get ：
p ( w j ∣ w I ) = e x p ( ( v w j ′ ) T ∗ v w I T ) ∑ j ′ = 1 V e x p ( ( v w j ′ ′ ) T ∗ v w I T ) p(w_j | w_I) = \frac{ exp((v'_{w_j})^T * v^T_{w_I}) }{\sum_{j'=1}^{V}exp((v'_{w_{j'}})^T * v^T_{w_I})}

among v w v_w It can be understood as the input vector representation of a word , v w ′ v_w' For the output vector of a word .

The loss function in this case is ：
m a x    p ( w O ∣ w I ) = m a x   y j ∗ = m a x   l o g   y j ∗ = u j ∗ − l o g ∑ j ′ = 1 V e x p ( u j ′ ) : = − E max \, \, p(w_O|w_I) = max \, y_{j*} \\ = max \, log \, y_{j*} \\ = u_{j*} - log \sum_{j'=1}^{V} exp(u_{j'}) \\ := -E
The above formula can be transformed into ：
E = − u j ∗ + l o g ∑ j ′ = 1 V e x p ( u j ′ ) E = - u_{j*} + log \sum_{j'=1}^{V} exp(u_{j'})
among j ∗ j* Express The subscript corresponding to the output value of the actual output layer . because u j ∗ u^*_j Is constant , therefore m a x   l o g   y j ∗ max \,log \, y_{j*} when , Just ask for the denominator l o g log that will do

multi-word context

When there are multiple context words, the corresponding picture is ： At this time h h Expression for ：
h = 1 C W T ( x 1 + x 2 + . . . . + x C ) = 1 C ( v w 1 + v w 2 + . . . + v w C ) T h = \frac{1}{C} W^T (x_1 + x_2 + .... + x_C) \\ = \frac{1}{C} (v_{w_1} + v_{w_2}+ ... + v_{w_C})^T
among C C The number of context words , w 1 , w 2 , . . . , w C w_1, w_2, ..., w_C It means the context word , v w v_w An input vector that represents a word （ Attention and input layer x x difference ）.

The objective function is ：
E = − l o g   p ( w O ∣ w I 1 , w I 2 , . . . , w I C ) = − u j ∗ l o g ∑ j ′ = 1 V e x p ( u j ′ ) = − ( v w O ′ ) T ∗ h + l o g ∑ j ′ = 1 V e x p ( ( v w j ′ ) T ∗ h ) E = -log \, p(w_O | w_{I_1}, w_{I_2}, ..., w_{I_C}) \\ = - u_j * log \sum_{j'=1}^{V} exp(u_j') \\ = -(v'_{w_O})^T * h + log \sum_{j'=1}^{V} exp((v'_{w_j})^T * h)

b）Skip-gram From input layer to hidden layer ：
h = W k , . T : = v w I T h =W^T_{k,.} := v^T_{w_I}
From hidden layer to output layer ：
p ( w c , j = w O , c ∣ w I ) = y c , j = e x p ( u c , j ) ∑ j ′ = 1 V e x p ( u j ′ ) p(w_{c,j}= w_{O,c} | w_I) = y_{c, j} = \frac{exp(u_{c,j})} {\sum_{j'=1}^{V}exp(u_{j'})}
among ：

• w I w_I It means the input word
• w c , j w_{c,j} Represents the output layer c c The word actually fell in the first place j j Neurons
• w O , c w_{O,c} Represents the output layer c c The first word should fall on O O Neurons
• y c , j y_{c,j} Represents the output layer c c The word actually fell in the first place j j Normalized probabilities on neurons
• u c , j u_{c,j} Represents the output layer c c The word actually fell in the first place j j Normalized values on neurons

Because the output layer shares weights , therefore ：
u c , j = u j = ( v w j ′ ) T ∗ h , f o r   c = 1 , 2 , . . . , C u_{c,j} = u_j = (v'_{w_j})^T * h, for \, c=1,2,..., C
among v w j ′ v'_{w_j} It means the first one j j A vector of words , Its value is the output weight matrix W ′ W' Of the j j Column .

The loss function becomes ：
E = − l o g   p ( w O , 1 , w O , 2 , . . , w O , C ∣ w I ) = − l o g ∏ c = 1 C e x p ( u c , j c ∗ ) ∑ j ′ = 1 V e x p ( u j ′ ) = − ∑ c = 1 C u j c ∗ + C ∗ l o g ∑ j ′ = 1 V e x p ( u j ′ ) E = -log \, p(w_{O,1}, w_{O,2}, .., w_{O,C}|w_I) \\ = -log \prod_{c=1}^{C} \frac{exp(u_{c,j^*_c})}{ \sum_{j'=1}^{V} exp(u_{j'})} \\ = -\sum_{c=1}^{C} u_{j^*_c} + C * log \sum_{j'=1}^{V} exp(u_{j'})

Be careful ️

• Empirically, we usually choose to use skip-gram Model , Because the effect is better
• stay Word2vec In the model , If you choose to use CBOW when , The final product of word embedding by The output vector of the word （ W N ∗ V ′ W'_{N*V} ） Express , If you choose to use skip-gram when , The final product of word embedding Input vector for words （ W N ∗ V W_{N*V} ） Express , Because they prefer to choose the weight matrix close to the head word .

2.3、hierarchical softmax and negative sampling

Because it is based on word2vec Frame training requires a very large corpus , Only in this way can we ensure the accuracy of the results , But with the increase of the expected library , Then comes the computing time and resource consumption . Is there any room for optimization ？ For example, you can sacrifice some accuracy to speed up your training , The answer is hierarchical softmax and negative sampling.

In the paper 《Distributed Representations of Words and Phrases and their Compositionality》 Training is introduced in word2vec Two techniques of （ Also in the paper 《word2vec Parameter Learning Explained》 It is explained and explained in detail in ）, Let's take a look at it in detail .

a） Huffman tree and Huffman coding

At the level of understanding softmax（hierarchical softmax） Before , Let's first understand what Huffman trees and Huffman codes are .

Huffman tree is essentially an optimal binary tree , It refers to the binary tree with the shortest path length constructed by a group of leaf nodes with certain weights .

So for a set of weight values , How to construct a Hoffman tree ？ according to Nodes with large weights should be as close to the root as possible This principle , An algorithm with general law is given , be called Huffman algorithm , Its description is as follows ：

• 1、 According to the given n n Weight w 1 , w 2 , . . . , w n {w_1,w_2,...,w_n} constitute n n A collection of binary trees F = T 1 , T 2 , . . . , T n F={T_1,T_2,...,T_n} ; among , Every binary tree T i ( 1 < = i < = n ) T_i(1<=i<=n) There's only one with weights w i w_i Root node of , To the left 、 The right subtree is empty
• 2、 stay F F Two binary trees with the minimum weight of the root node are selected as the left 、 Right subtree to construct a new binary tree , And the weight of the new root node of the binary tree is the sum of the weights of the root nodes of its left and right subtrees
• 3、 stay F F Delete these two trees , At the same time, add a new binary tree to F F in
• 4、 repeat 2、3, until F F Only a binary tree is left to join F F in

For example, a group of data has a weight of ：[9,11,13,8,4,5], It's a Hoffman tree （ The picture comes from Baidu experience ）： Be careful ️：

• When constructing Huffman trees , There is no left or right leaf node , Just agree on a rule , Follow this rule from beginning to end . Traditionally, the left node is smaller than the right node .

So what is Huffman coding ？ Huffman coding is a coding method based on Huffman tree , It's a kind of variable length coding .

For a well constructed Huffman tree 0/1 code , The left sub tree is 0, The right subtree is 1, For the Huffman tree constructed in the above figure , The Huffman codes of each leaf node are as follows ：

• 9 -> 00
• 11 -> 01
• 13 -> 10
• 8 -> 110
• 4 -> 1110
• 5 -> 1111

Be careful ️：

• Similarly, the coding of Huffman tree does not specify that the left subtree is 1 Or the left subtree is 0
• stay word2vec in , The construction and coding of Huffman tree is contrary to the above statement , That is to say, the left subtree code is 1, The right subtree is coded as 0（ What the paper says is -1, Have the same meaning ）, At the same time, it is agreed that the weight of the left subtree is not less than that of the right subtree

b）hierarchical softmax The picture above shows a Huffman coding tree , The white node represents all the words in the thesaurus , Black nodes represent hidden nodes inside , word w 2 w_2 The corresponding path code is shown by the black line connection in the figure , The path length is 4, n ( w , j ) n(w,j) It's about words w w , The first one on its path j j Nodes .

Optimized based on Huffman tree word2vec, Removed weight matrix from hidden layer to output layer （ That's the output vector ）, Instead of using the hidden node coding in the Huffman tree （ Like the black node in the picture above ）, Then the output node is the input word w w The probability can be expressed as ：
p ( w = w O ) = ∏ j = 1 L ( w ) − 1 σ ( [ n ( w , j + 1 ) = c h ( n ( w , j ) ) ] ⋅ ( v n ( w , j ) ′ ) T ∗ h ) p(w=w_O) = \prod_{j=1}^{L(w)-1} \sigma ( \left[ n(w,j+1)=ch(n(w,j)) \right] \cdot(v'_{n(w,j)})^T * h )

among ：

• c h ( n ) ch(n) Represents the hidden left node in the path
• v n ( w , j ) ′ v'_{n(w,j)} Express Vector representation of hidden nodes （ The auxiliary vector in the whole optimization process of the algorithm ）
• n ( w , j ) n(w,j) Means the word w w On the path j j Nodes
• h h Represents the output of the hidden layer （skip-gram In the model, it is equal to v w I v_{w_I} ,cbow In the model, it is equal to 1 C ∑ c = 1 C v w c \frac{1}{C} \sum_{c=1}^{C}v_{w_c} , The input word vector is averaged ）
• L ( w ) L(w) Denotes that the leaf node is w w The length of the shortest path , reduce 1 Represents the hidden node before reaching the node
• [ x ] \left[ x \right ] Is defined as follows （ That is, if you go to the left subtree, the path is 1, The path of the right subtree is -1）
[ x ] = { 1 i f   x   i s   t r u e − 1 o t h e r w i s e \left [ x \right ] = \left\{\begin{matrix} 1 & if \, x \, is \, true \\ -1 & otherwise \end{matrix}\right.

In the diagram above , We define the probability of left node as ：
p ( n , l e f t ) = σ ( ( v n ′ ) T ⋅ h ) p(n,left) = \sigma((v'_n)^T \cdot h)
among σ \sigma It means s i g m o i d sigmoid function

The probability of the right node is ：
p ( n , r i g h t ) = 1 − σ ( ( v n ′ ) T ⋅ h ) = σ ( − ( v n ′ ) T ⋅ h ) p(n, right) = 1 - \sigma((v'_n)^T \cdot h) = \sigma(- (v'_n)^T \cdot h)

So for the words in the picture w 2 w_2 , The probability is ：
p ( w 2 = w O ) = p ( n ( w 2 , 1 ) , l e f t ) ⋅ p ( n ( w 2 , 2 ) , l e f t ) ⋅ p ( n ( w 2 , 3 ) , r i g h t ) = σ ( ( v n ( w 2 , 1 ) ′ ) T ⋅ h ) ⋅ σ ( ( v n ( w 2 , 2 ) ′ ) T ⋅ h ) ⋅ σ ( − ( v n ( w 2 , 3 ) ′ ) T ⋅ h ) p(w_2 = w_O) \\ = p(n(w_2,1),left) \cdot p(n(w_2,2), left) \cdot p(n(w_2,3),right) \\ = \sigma((v'_{n(w_2,1)})^T \cdot h) \cdot \sigma((v'_{n(w_2,2)})^T \cdot h) \cdot \sigma(- (v'_{n(w_2,3)})^T \cdot h)

For all the words there are ：
∑ i = 1 V p ( w i = w O ) = 1 \sum_{i=1}^{V} p(w_i = w_O) = 1

In this case, the loss function is ：
E = − l o g   p ( w = w O ∣ w I ) = − ∑ j = 1 L ( w ) − 1 l o g   σ ( [ n ( w , j + 1 ) = c h ( n ( w , j ) ) ] ⋅ ( v n ( w , j ) ′ ) T ∗ h ) E = -log \, p(w=w_O|w_I) = - \sum_{j=1}^{L(w)-1} log \, \sigma ( \left[ n(w,j+1)=ch(n(w,j)) \right] \cdot(v'_{n(w,j)})^T * h )

c）negative sampling

except hierarchical softmax, Another optimization method is Noise Contrasive Estimation（NCE）, In the paper 《Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics》 There are detailed explanations and explanations in , But because NCE It's a bit complicated , So here's a simplified version of , be called ：Negative Sampling.

Because the calculation of the total number of negative samples is relatively large , So negative sampling is done , After negative sampling, the corresponding loss function is ：
E = − l o g σ ( ( v w O ′ ) T h ) − ∑ w j ∈ W n e g l o g   σ ( − ( v w i ′ ) T h ) = − l o g σ ( ( v w O ′ ) T h ) + ∑ w j ∈ W n e g l o g   σ ( ( v w i ′ ) T h ) E = -log \sigma ((v'_{w_O})^T h) - \sum_{w_j \in W_{neg}} log \, \sigma(-(v'_{w_i})^T h) \\ = -log \sigma ((v'_{w_O})^T h) + \sum_{w_j \in W_{neg}} log \, \sigma((v'_{w_i})^T h)
among ：

• w O w_O The word for output
• v w O ′ v'_{w_O} Express w O w_O The output word vector of
• h h Represents the output of the hidden layer , When the model is CBOW When is 1 C ∑ c = 1 C v w c \frac{1}{C} \sum_{c=1}^{C}v_{w_c} , If the time is skip-gram When the model is ： v w I v_{w_I}
• W n e g W_{neg} It represents the number of negative samples

Be careful ️：

• Based on hierarchy softmax perhaps negative sampling Optimization of the cbow perhaps skip-gram Model , The output word vector should be the word vector between the input layer and the hidden layer （ It should be said that , It is because there is no specific explanation in the paper , And I don't see it in the public data , Maybe I didn't take it seriously ）
• guess ： Whether the leaf node can be represented according to the average vector of the shortest path node , The word vector ？

2.4、Gensim in Word2vec Use

Before using, we need to introduce the corresponding model class
f r o m g e n s i m . m o d e l s . w o r d 2 v e c i m p o r t W o r d 2 V e c from gensim.models.word2vec import Word2Vec
Create a model
m o d e l = W o r d 2 V e c ( s e n t e n c e s = t o p i c s l i s t , i t e r = 5 , s i z e = 128 , w i n d o w = 5 , m i n c o u n t = 0 , w o r k e r s = 10 , s g = 1 , h s = 1 , n e g a t i v e = 1 , s e e d = 128 , c o m p u t e l o s s = T r u e ) model = Word2Vec(sentences=topics_list, iter=5, size=128, window=5, min_count=0, workers=10, sg=1, hs=1, negative=1, seed=128, compute_loss=True)
The corresponding model parameters have many , The main ones are ：

• sentences： The corpus of the training model , It's an iterative sequence
• corpus_file： To load data from a file , and sentences Mutually exclusive
• size：word Dimensions , The default is 100, Usually take 64、128、256 etc.
• window： Size of sliding window , The default value is 5
• min_count：word Times less than this value are ignored , The default value is 5
• seed： For random number generators
• workers： How many threads are used for model training , The default is 3
• min_alpha=0.0001
• sg：1 Express Skip-gram 0 Express CBOW, The default is 0
• hs：1 Express hierarchical softmax 0 And negative Parameters for 0 Words negative sampling Will be enabled , The default is 0
• negative：0 It means not to use ,1 Said the , The recommended value is in 5-20 The number of noise words , The default is 5

For more parameters, refer to the notes in the model

Save the model
m o d e l . s a v e ( m o d e l p a t h ) model.save(model_path)

m o d e l = W o r d 2 V e c . l o a d ( m o d e l p a t h ) model = Word2Vec.load(model_path)

Output loss value
m o d e l . g e t l a t e s t t r a i n i n g l o s s ( ) model.get_latest_training_loss()

Calculate similarity
m o d e l . w v . s i m i l a r i t y ( w o r d 1 , w o r d 2 ) model.wv.similarity(word1, word2)

【 technical service 】, Click View for details ： https://mp.weixin.qq.com/s/PtX9ukKRBmazAWARprGIAg scan Pay attention to WeChat public number ！ Master Focus on search and recommendation systems , Try to use algorithms to better serve users , Including but not limited to machine learning , Deep learning , Reinforcement learning , natural language understanding , Knowledge map , Not yet regularly sharing technology , Information , Thinking and other articles ！

https://chowdera.com/2020/12/20201208144703278v.html