# Word2vec: two models, two optimization and practical use

2020-12-08 14:47:23

This topic article will be divided into three parts , The theme of each part is ：

• word2vec Prelude to - Statistical language model
• word2vec Detailed explanation - The style does not diminish
• other xxx2vec Introduction to papers and Applications

It will be updated later Embedding Related articles , It could be a series of separate , It may also be put in 《 Feature Engineering -Embedding In the series 》, Welcome to continue to pay attention 「 Search and recommend Wiki

## 2.1、 Background introduction

word2vec yes Google 2013 A tool for calculating word vectors , In the paper Efficient Estimation of Word Representations in Vector Space in , The author puts forward Word2vec Calculation tools , And by contrast NNLM、RNNLM The language model validates word2vec The effectiveness of the .

word2vec There are two models in the tool ：CBOW and skip-gram. The introduction in the paper is relatively simple , As shown in the figure below ,CBOW It is the prediction of the head word by the word in the context ,Skip-gram The word in the context is predicted by inputting words .

## 2.2、CBOW and Skip-gram

The introduction of these two models is rough in the original paper , In the paper 《word2vec Parameter Learning Explained》 It is explained and explained in detail in , Let's take a closer look at CBOW and Skip-gram.

### a）CBOW

One-word context

First, let's look at the situation where there is only one context word

The total number of words is V V , The number of neurons in the hidden layer is N N , The weight matrix from input layer to hidden layer is W V ∗ N W_{V*N} , The weight matrix from hidden layer to output layer is W N ∗ V ′ W'_{N*V} .

The input layer is word One-hot code . From input layer to hidden layer ：
h = W T x : = v w I T h = W^T x := v^T_{w_I}
v w I v_{w_I} It means to input words w I w_I Vector representation of （ Attention and input layer x x Distinguish , The input vector is W W The vector in represents , The output vector is W ′ W' The vector in represents ）, Its dimension is [ N , 1 ] [N, 1] , After transposition, the dimension becomes [ 1 , N ] [1,N] , An input representation used to represent a vector , it is to be noted that It's not here [ 1 , N ] [1,N] , Otherwise, it is easy to confuse the dimensions when multiplying in the next formula , Symbol : = := It means Defined as .

From hidden layer to input layer ：
u j = ( v w j ′ ) T h u_j = (v'_{w_j})^T h
among v w j ′ v'_{w_j} It's a matrix W ′ W' Of the j j Column , Its dimension is [ N , 1 ] [N,1] , Calculated u j u_j For a specific value , It means No j j The input words are in the output layer j j Values corresponding to positions .

Finally using s o f t m a x softmax Normalize it （ Because the output layer is fixed V V Word , So it can be seen as classification , Therefore use s o f t m a x softmax Normalize ）, Get the input word w I w_I The probability of each word in its thesaurus ：
p ( w j ∣ w I ) = y j = e x p ( u j ) ∑ j ′ = 1 V e x p ( u j ′ ) p(w_j | w_I) = y_j = \frac{exp(u_j)}{ \sum_{j'=1}^{V}exp(u_j') }
among y j y_j Represents the output layer of the j j The number of neurons .

Combining the above three formulas, we can get ：
p ( w j ∣ w I ) = e x p ( ( v w j ′ ) T ∗ v w I T ) ∑ j ′ = 1 V e x p ( ( v w j ′ ′ ) T ∗ v w I T ) p(w_j | w_I) = \frac{ exp((v'_{w_j})^T * v^T_{w_I}) }{\sum_{j'=1}^{V}exp((v'_{w_{j'}})^T * v^T_{w_I})}

among v w v_w It can be understood as the input vector representation of a word , v w ′ v_w' For the output vector of a word .

The loss function in this case is ：
m a x    p ( w O ∣ w I ) = m a x   y j ∗ = m a x   l o g   y j ∗ = u j ∗ − l o g ∑ j ′ = 1 V e x p ( u j ′ ) : = − E max \, \, p(w_O|w_I) = max \, y_{j*} \\ = max \, log \, y_{j*} \\ = u_{j*} - log \sum_{j'=1}^{V} exp(u_{j'}) \\ := -E
The above formula can be transformed into ：
E = − u j ∗ + l o g ∑ j ′ = 1 V e x p ( u j ′ ) E = - u_{j*} + log \sum_{j'=1}^{V} exp(u_{j'})
among j ∗ j* Express The subscript corresponding to the output value of the actual output layer . because u j ∗ u^*_j Is constant , therefore m a x   l o g   y j ∗ max \,log \, y_{j*} when , Just ask for the denominator l o g log that will do

multi-word context

When there are multiple context words, the corresponding picture is ：

At this time h h Expression for ：
h = 1 C W T ( x 1 + x 2 + . . . . + x C ) = 1 C ( v w 1 + v w 2 + . . . + v w C ) T h = \frac{1}{C} W^T (x_1 + x_2 + .... + x_C) \\ = \frac{1}{C} (v_{w_1} + v_{w_2}+ ... + v_{w_C})^T
among C C The number of context words , w 1 , w 2 , . . . , w C w_1, w_2, ..., w_C It means the context word , v w v_w An input vector that represents a word （ Attention and input layer x x difference ）.

The objective function is ：
E = − l o g   p ( w O ∣ w I 1 , w I 2 , . . . , w I C ) = − u j ∗ l o g ∑ j ′ = 1 V e x p ( u j ′ ) = − ( v w O ′ ) T ∗ h + l o g ∑ j ′ = 1 V e x p ( ( v w j ′ ) T ∗ h ) E = -log \, p(w_O | w_{I_1}, w_{I_2}, ..., w_{I_C}) \\ = - u_j * log \sum_{j'=1}^{V} exp(u_j') \\ = -(v'_{w_O})^T * h + log \sum_{j'=1}^{V} exp((v'_{w_j})^T * h)

### b）Skip-gram

From input layer to hidden layer ：
h = W k , . T : = v w I T h =W^T_{k,.} := v^T_{w_I}
From hidden layer to output layer ：
p ( w c , j = w O , c ∣ w I ) = y c , j = e x p ( u c , j ) ∑ j ′ = 1 V e x p ( u j ′ ) p(w_{c,j}= w_{O,c} | w_I) = y_{c, j} = \frac{exp(u_{c,j})} {\sum_{j'=1}^{V}exp(u_{j'})}
among ：

• w I w_I It means the input word
• w c , j w_{c,j} Represents the output layer c c The word actually fell in the first place j j Neurons
• w O , c w_{O,c} Represents the output layer c c The first word should fall on O O Neurons
• y c , j y_{c,j} Represents the output layer c c The word actually fell in the first place j j Normalized probabilities on neurons
• u c , j u_{c,j} Represents the output layer c c The word actually fell in the first place j j Normalized values on neurons

Because the output layer shares weights , therefore ：
u c , j = u j = ( v w j ′ ) T ∗ h , f o r   c = 1 , 2 , . . . , C u_{c,j} = u_j = (v'_{w_j})^T * h, for \, c=1,2,..., C
among v w j ′ v'_{w_j} It means the first one j j A vector of words , Its value is the output weight matrix W ′ W' Of the j j Column .

The loss function becomes ：
E = − l o g   p ( w O , 1 , w O , 2 , . . , w O , C ∣ w I ) = − l o g ∏ c = 1 C e x p ( u c , j c ∗ ) ∑ j ′ = 1 V e x p ( u j ′ ) = − ∑ c = 1 C u j c ∗ + C ∗ l o g ∑ j ′ = 1 V e x p ( u j ′ ) E = -log \, p(w_{O,1}, w_{O,2}, .., w_{O,C}|w_I) \\ = -log \prod_{c=1}^{C} \frac{exp(u_{c,j^*_c})}{ \sum_{j'=1}^{V} exp(u_{j'})} \\ = -\sum_{c=1}^{C} u_{j^*_c} + C * log \sum_{j'=1}^{V} exp(u_{j'})

Be careful ️

• Empirically, we usually choose to use skip-gram Model , Because the effect is better
• stay Word2vec In the model , If you choose to use CBOW when , The final product of word embedding by The output vector of the word （ W N ∗ V ′ W'_{N*V} ） Express , If you choose to use skip-gram when , The final product of word embedding Input vector for words （ W N ∗ V W_{N*V} ） Express , Because they prefer to choose the weight matrix close to the head word .

## 2.3、hierarchical softmax and negative sampling

Because it is based on word2vec Frame training requires a very large corpus , Only in this way can we ensure the accuracy of the results , But with the increase of the expected library , Then comes the computing time and resource consumption . Is there any room for optimization ？ For example, you can sacrifice some accuracy to speed up your training , The answer is hierarchical softmax and negative sampling.

In the paper 《Distributed Representations of Words and Phrases and their Compositionality》 Training is introduced in word2vec Two techniques of （ Also in the paper 《word2vec Parameter Learning Explained》 It is explained and explained in detail in ）, Let's take a look at it in detail .

### a） Huffman tree and Huffman coding

At the level of understanding softmax（hierarchical softmax） Before , Let's first understand what Huffman trees and Huffman codes are .

Huffman tree is essentially an optimal binary tree , It refers to the binary tree with the shortest path length constructed by a group of leaf nodes with certain weights .

So for a set of weight values , How to construct a Hoffman tree ？ according to Nodes with large weights should be as close to the root as possible This principle , An algorithm with general law is given , be called Huffman algorithm , Its description is as follows ：

• 1、 According to the given n n Weight w 1 , w 2 , . . . , w n {w_1,w_2,...,w_n} constitute n n A collection of binary trees F = T 1 , T 2 , . . . , T n F={T_1,T_2,...,T_n} ; among , Every binary tree T i ( 1 < = i < = n ) T_i(1<=i<=n) There's only one with weights w i w_i Root node of , To the left 、 The right subtree is empty
• 2、 stay F F Two binary trees with the minimum weight of the root node are selected as the left 、 Right subtree to construct a new binary tree , And the weight of the new root node of the binary tree is the sum of the weights of the root nodes of its left and right subtrees
• 3、 stay F F Delete these two trees , At the same time, add a new binary tree to F F in
• 4、 repeat 2、3, until F F Only a binary tree is left to join F F in

For example, a group of data has a weight of ：[9,11,13,8,4,5], It's a Hoffman tree （ The picture comes from Baidu experience ）：

Be careful ️：

• When constructing Huffman trees , There is no left or right leaf node , Just agree on a rule , Follow this rule from beginning to end . Traditionally, the left node is smaller than the right node .

So what is Huffman coding ？ Huffman coding is a coding method based on Huffman tree , It's a kind of variable length coding .

For a well constructed Huffman tree 0/1 code , The left sub tree is 0, The right subtree is 1, For the Huffman tree constructed in the above figure , The Huffman codes of each leaf node are as follows ：

• 9 -> 00
• 11 -> 01
• 13 -> 10
• 8 -> 110
• 4 -> 1110
• 5 -> 1111

Be careful ️：

• Similarly, the coding of Huffman tree does not specify that the left subtree is 1 Or the left subtree is 0
• stay word2vec in , The construction and coding of Huffman tree is contrary to the above statement , That is to say, the left subtree code is 1, The right subtree is coded as 0（ What the paper says is -1, Have the same meaning ）, At the same time, it is agreed that the weight of the left subtree is not less than that of the right subtree

### b）hierarchical softmax

The picture above shows a Huffman coding tree , The white node represents all the words in the thesaurus , Black nodes represent hidden nodes inside , word w 2 w_2 The corresponding path code is shown by the black line connection in the figure , The path length is 4, n ( w , j ) n(w,j) It's about words w w , The first one on its path j j Nodes .

Optimized based on Huffman tree word2vec, Removed weight matrix from hidden layer to output layer （ That's the output vector ）, Instead of using the hidden node coding in the Huffman tree （ Like the black node in the picture above ）, Then the output node is the input word w w The probability can be expressed as ：
p ( w = w O ) = ∏ j = 1 L ( w ) − 1 σ ( [ n ( w , j + 1 ) = c h ( n ( w , j ) ) ] ⋅ ( v n ( w , j ) ′ ) T ∗ h ) p(w=w_O) = \prod_{j=1}^{L(w)-1} \sigma ( \left[ n(w,j+1)=ch(n(w,j)) \right] \cdot(v'_{n(w,j)})^T * h )

among ：

• c h ( n ) ch(n) Represents the hidden left node in the path
• v n ( w , j ) ′ v'_{n(w,j)} Express Vector representation of hidden nodes （ The auxiliary vector in the whole optimization process of the algorithm ）
• n ( w , j ) n(w,j) Means the word w w On the path j j Nodes
• h h Represents the output of the hidden layer （skip-gram In the model, it is equal to v w I v_{w_I} ,cbow In the model, it is equal to 1 C ∑ c = 1 C v w c \frac{1}{C} \sum_{c=1}^{C}v_{w_c} , The input word vector is averaged ）
• L ( w ) L(w) Denotes that the leaf node is w w The length of the shortest path , reduce 1 Represents the hidden node before reaching the node
• [ x ] \left[ x \right ] Is defined as follows （ That is, if you go to the left subtree, the path is 1, The path of the right subtree is -1）
[ x ] = { 1 i f   x   i s   t r u e − 1 o t h e r w i s e \left [ x \right ] = \left\{\begin{matrix} 1 & if \, x \, is \, true \\ -1 & otherwise \end{matrix}\right.

In the diagram above , We define the probability of left node as ：
p ( n , l e f t ) = σ ( ( v n ′ ) T ⋅ h ) p(n,left) = \sigma((v'_n)^T \cdot h)
among σ \sigma It means s i g m o i d sigmoid function

The probability of the right node is ：
p ( n , r i g h t ) = 1 − σ ( ( v n ′ ) T ⋅ h ) = σ ( − ( v n ′ ) T ⋅ h ) p(n, right) = 1 - \sigma((v'_n)^T \cdot h) = \sigma(- (v'_n)^T \cdot h)

So for the words in the picture w 2 w_2 , The probability is ：
p ( w 2 = w O ) = p ( n ( w 2 , 1 ) , l e f t ) ⋅ p ( n ( w 2 , 2 ) , l e f t ) ⋅ p ( n ( w 2 , 3 ) , r i g h t ) = σ ( ( v n ( w 2 , 1 ) ′ ) T ⋅ h ) ⋅ σ ( ( v n ( w 2 , 2 ) ′ ) T ⋅ h ) ⋅ σ ( − ( v n ( w 2 , 3 ) ′ ) T ⋅ h ) p(w_2 = w_O) \\ = p(n(w_2,1),left) \cdot p(n(w_2,2), left) \cdot p(n(w_2,3),right) \\ = \sigma((v'_{n(w_2,1)})^T \cdot h) \cdot \sigma((v'_{n(w_2,2)})^T \cdot h) \cdot \sigma(- (v'_{n(w_2,3)})^T \cdot h)

For all the words there are ：
∑ i = 1 V p ( w i = w O ) = 1 \sum_{i=1}^{V} p(w_i = w_O) = 1

In this case, the loss function is ：
E = − l o g   p ( w = w O ∣ w I ) = − ∑ j = 1 L ( w ) − 1 l o g   σ ( [ n ( w , j + 1 ) = c h ( n ( w , j ) ) ] ⋅ ( v n ( w , j ) ′ ) T ∗ h ) E = -log \, p(w=w_O|w_I) = - \sum_{j=1}^{L(w)-1} log \, \sigma ( \left[ n(w,j+1)=ch(n(w,j)) \right] \cdot(v'_{n(w,j)})^T * h )

### c）negative sampling

except hierarchical softmax, Another optimization method is Noise Contrasive Estimation（NCE）, In the paper 《Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics》 There are detailed explanations and explanations in , But because NCE It's a bit complicated , So here's a simplified version of , be called ：Negative Sampling.

Because the calculation of the total number of negative samples is relatively large , So negative sampling is done , After negative sampling, the corresponding loss function is ：
E = − l o g σ ( ( v w O ′ ) T h ) − ∑ w j ∈ W n e g l o g   σ ( − ( v w i ′ ) T h ) = − l o g σ ( ( v w O ′ ) T h ) + ∑ w j ∈ W n e g l o g   σ ( ( v w i ′ ) T h ) E = -log \sigma ((v'_{w_O})^T h) - \sum_{w_j \in W_{neg}} log \, \sigma(-(v'_{w_i})^T h) \\ = -log \sigma ((v'_{w_O})^T h) + \sum_{w_j \in W_{neg}} log \, \sigma((v'_{w_i})^T h)
among ：

• w O w_O The word for output
• v w O ′ v'_{w_O} Express w O w_O The output word vector of
• h h Represents the output of the hidden layer , When the model is CBOW When is 1 C ∑ c = 1 C v w c \frac{1}{C} \sum_{c=1}^{C}v_{w_c} , If the time is skip-gram When the model is ： v w I v_{w_I}
• W n e g W_{neg} It represents the number of negative samples

Be careful ️：

• Based on hierarchy softmax perhaps negative sampling Optimization of the cbow perhaps skip-gram Model , The output word vector should be the word vector between the input layer and the hidden layer （ It should be said that , It is because there is no specific explanation in the paper , And I don't see it in the public data , Maybe I didn't take it seriously ）
• guess ： Whether the leaf node can be represented according to the average vector of the shortest path node , The word vector ？

## 2.4、Gensim in Word2vec Use

Before using, we need to introduce the corresponding model class
f r o m g e n s i m . m o d e l s . w o r d 2 v e c i m p o r t W o r d 2 V e c from gensim.models.word2vec import Word2Vec
Create a model
m o d e l = W o r d 2 V e c ( s e n t e n c e s = t o p i c s l i s t , i t e r = 5 , s i z e = 128 , w i n d o w = 5 , m i n c o u n t = 0 , w o r k e r s = 10 , s g = 1 , h s = 1 , n e g a t i v e = 1 , s e e d = 128 , c o m p u t e l o s s = T r u e ) model = Word2Vec(sentences=topics_list, iter=5, size=128, window=5, min_count=0, workers=10, sg=1, hs=1, negative=1, seed=128, compute_loss=True)
The corresponding model parameters have many , The main ones are ：

• sentences： The corpus of the training model , It's an iterative sequence
• corpus_file： To load data from a file , and sentences Mutually exclusive
• size：word Dimensions , The default is 100, Usually take 64、128、256 etc.
• window： Size of sliding window , The default value is 5
• min_count：word Times less than this value are ignored , The default value is 5
• seed： For random number generators
• workers： How many threads are used for model training , The default is 3
• min_alpha=0.0001
• sg：1 Express Skip-gram 0 Express CBOW, The default is 0
• hs：1 Express hierarchical softmax 0 And negative Parameters for 0 Words negative sampling Will be enabled , The default is 0
• negative：0 It means not to use ,1 Said the , The recommended value is in 5-20 The number of noise words , The default is 5

For more parameters, refer to the notes in the model

Save the model
m o d e l . s a v e ( m o d e l p a t h ) model.save(model_path)

m o d e l = W o r d 2 V e c . l o a d ( m o d e l p a t h ) model = Word2Vec.load(model_path)

Output loss value
m o d e l . g e t l a t e s t t r a i n i n g l o s s ( ) model.get_latest_training_loss()

Calculate similarity
m o d e l . w v . s i m i l a r i t y ( w o r d 1 , w o r d 2 ) model.wv.similarity(word1, word2)

【 technical service 】, Click View for details ： https://mp.weixin.qq.com/s/PtX9ukKRBmazAWARprGIAg

scan Pay attention to WeChat public number ！ Master Focus on search and recommendation systems , Try to use algorithms to better serve users , Including but not limited to machine learning , Deep learning , Reinforcement learning , natural language understanding , Knowledge map , Not yet regularly sharing technology , Information , Thinking and other articles ！

https://chowdera.com/2020/12/20201208144703278v.html