# Brothers of omnipotent embedding5 - skip thought [trim / CNN LSTM / quick thought]

2020-12-07 08:39:15

In this chapter, let's talk about skip-thought My three brothers , They're solving skip-thought Different attempts have been made on the remaining problems 【Ref1～4】, following paper Maybe not the best solution ( For different NLP In fact, there is no optimal task, only the most appropriate ） But they offer another way of thinking and possibility . The previous chapter, skip-thought There are several points worth further discussion

• Q1 RNN Calculation efficiency is low ：Encoder-Decoder It's all for RNN, RNN This computing method that depends on the output of the previous step is naturally mutually exclusive with parallel computing , So training is called slow
• Q2 Decoder： As a component that is completely unused in the final prediction ,Decoder It takes up a lot of time in training , Can you optimize ?
• Q3 Sample construction of general text vector ：skip-thought Just before prediction / Is the last sentence reasonable ？
• Q4 Two decoder Does it make sense ?
• Q5 pretrain word embedding Think about it ?
• Q6 except hidden_state Is there any other way to extract sentence vectors ？

Here's a list of articles from small to large

## Trim/Rethink skip-thought

【Ref1/2】 It's from the same author a/b Right skip-thought Some details of the model are adjusted , And in benchmark I got and skip-thought The results are comparable . Mainly for the above Q4,Q5,Q6

The author thinks that two decorder There's no need to , Information based on intermediate sentences , You can use the same sentence before and after decoder Conduct reconstruct. This hypothesis is not acceptable to the language model of translation class , But in the context of training general text vectors, it seems acceptable , Because we want to encoder The part can extract the maximum information as much as possible and can be generalized in any context , So simplify Decoder More appropriate .

The author compares the use of Glove,word2vec To initialize the word vector , The results are shown in Evaluation Better than random initialization . There are two advantages to feel initialization with a pre training word vector , One is to accelerate convergence , The other is doing vocabulary expansion when ,linear-mapping It might be more accurate , Initialization with pre training word vector is a common solution .

in the light of Q6, The original skip-thought The final output of the text vector is Encoder the last one hidden_state, Is it possible for us to make use of the whole sequence Of hiddden state Output ? The author suggests that we can learn from avg+max pooling, Yes Encoder Part of it hidden state do avg, max pooling And then stitching it as The output text vector =$$[\frac{\sum_{i=1}^T h_i}{T} , max_{i=1}^T h_i]$$. The assumption of this scheme is not to put embedding As a whole , But the embedding Every one of unit Alone as a / Class characteristics , Sequence of different positions of output state Maybe different information has been extracted , adopt avg/max To extract the most representative features as sentence features . We will encounter this problem many times later , After language model training, what is more suitable for sentence vector ？ Here's a foreword

So I feel like what I'm doing is Trimed skip-thought, I am using word2vec To initialize the , It only took 1 individual decoder To train pair sample ... Interested in looking over Github-Embedding-skip_thought

Trim for skip-thought I've been slimming , Want to speed up ？ Look below

## CNN-LSTM

【Ref3】 Yes Q1 The solution given is to use CNN To replace RNN As for extracting sentence information Encoder, This will solve the problem RNN Computing can't be parallelized . The specific implementation needs to solve two problems ：

• How to put the indefinite length of sequence Compress to the same length
• CNN How to extract sequence features

The structure of the model is as follows , here sequence Of token after embedding And then as input , Assume sequence Of padding length It's all the same N,embedding The dimensions of are all K, Input is N * K. Press 1 Dimensional images to understand , here N It's the image length ,K It's an image channel.

The author defines 3 Species difference kernel_size=3/4/5 Of cnn cell, Actually sum n-gram The principle of approximation is to learn parts separately window_size=3/4/5 Three kinds of sequence information of , because cnn It's shared parameters, so 1 individual filter It can only be extracted 1 Kind of token The sequence characteristics of the combination , So each cnn cell There are 800 individual filter. With kernel_size=3 For example ,cnn The weight vector dimension of is 3K800, and sequence embedding The output of the calculation is （N-3+1）* 800.

To compress to the same length , After the above output, add max_pooling layer （ Most of the cnn be used for NLP In the office of max It is said that doubi avg It is better to ）, Along the sequence Dimensionality pooling Compress the above output to 1* 800, The simple understanding is that every filter In the sequence Keep only the most significant 1 Features .3 Different kernel_size The output splicing of the hidden_size=2400 Vector . This is also the vector representation of the final text .

Consider only encoder There's a big difference , Simply put CNN-LSTM And the last chapter skip-thought Put a piece of , Only right encoder/decoder Of cell The choice makes a distinction . Here only CNN Encodere The implementation of the ,bridge Part of the reference is google Of seq2seq, See here for the complete code Github-Embedding-skip_thought

def cnn_encoder(input_emb, input_len, params):
# batch_szie * seq_len * emb_size -> batch_size * (seq_len-kernel_size + 1) * filters
outputs = []
params = params['encoder_cell_params']
for i in range(len(params['filters'])):
output = tf.layers.conv1d(inputs = input_emb,
filters = params['filters'][i],
kernel_size = params['kernel_size'][i], # window size, simlar as n-gram
strides = params['strides'][i],
)
output = params['activation'][i](output)
# batch_size * (seq_len-kernel_size + 1) * filters -> batch_size * filters
outputs.append(tf.reduce_max(output, axis=1))
# batch_size * sum(filters)
output = tf.concat(outputs, axis=1)
return ENCODER_OUTPUT(output=output, state=(output,))


It feels like it can be compressed to the same length here Padding, as well as cnn Learn different lengths of text information , The author uses the difference kernel size Make a splice , You can also try stack cnn, These two kernel=3 Of cnn You can learn that the length is 9 Text sequence information of .

Decoder Here the author uses LSTM, But just like before skip-thought Mentioned in , Because there is teacher forcing Feeling decoder It's not very important. I won't mention it here .

There is another interesting point in the paper Q3, The author of skip-thought The core hypothesis of the question of the soul ： Why is the information in the middle sentence = be used for reconstruct Before and after the sentence information ？ （ Actually, the above Trim This paper also made a similar attempt, here and together say ）

The author gives several solutions

• Middle sentence reconstruct In the middle of the sentence autoencoder Mission
• Middle sentence reconstruct Middle sentence , And before / after 1 A sentence of composite Mission
• Zoom in on the time window , Use the middle sentence to predict the next several sentences hierarchical Mission

Feeling autoencoder More catching intra-sentence Of syntax Information , For example, grammar / Sentence structure , And before and after the sentence reconstruct Task based learning inter-sentence Of semantic Information , For example, context . So it can also be understood as ,autoencoder The similarity of the training text vector may grow to be similar , However, the similarity of the text vectors trained by the former and the latter sentences will have more semantics / The similarity of context .

Let's leave aside the intuitionistic indexism , stay Trim Add... To the paper AE Only in the model question-type Classification task of （more syntax) There's a promotion on , For others, for example movie-review etc. semantic classification There's a loss in the mission . But in CNN In his paper, only AE/ Join in AE Our model performs better on all classification tasks , I'm also a little confused ...

What kind of training samples can be used to train the general text vector ？ The general term here means at any time downstream The task can get good results . Here's a question , Look at the back USE Can we solve this problem by using multi task joint learning ～

## Quick-thought

【Ref4】 Finally, it is out of the frame of translation language model , Yes Q2 A new solution is given . Since for text vector representation Decoder Slow and useless , Then we just don't want to , Put... Directly reconstruct The task is replaced by a classified task . And then the idea is BERT As in pre training NSP Training tasks are used directly .

Here is the idea of classifying tasks and word2vec Used in negative sampling To train the word vector can be said to be the same recipe familiar with the taste , Both involve the construction of positive and negative samples , about word2vec Of skip-gram The positive sample is window_size The words inside , Negative samples are randomly sampled from dictionaries . here Quick-thought and skip-thought bring into correspondence with , The positive sample is window_size The sentence in , In other words, the middle sentence is used to predict the front and back sentences , Negative samples are batch In addition to the front and back sentences, other sentences .

Since we talk about positive and negative samples , that skip-thought What are the positive and negative samples of ？ in consideration of teacher-forcing Use ,skip-thought It's based on the middle sentence and the front and back sentence T-1 To predict the number of words T What is a word , The negative sample is except the first one T More than a word vocabulary Other words in it （ and skip-gram 10 Fen is the same ）. So the author also mentioned this kind of reconstruct The task may learn too superficial text information to learn more general Semantic information . The classification task, which only needs the whole context sentence to be more similar to other sentences, will not have this problem .

The structure of the model is as follows ,Encoder Some extract information in any way , It can be skip-thogut Used in gru, You can also use the above CNN. Here and skip-gram Same with two sets of independent parameters encoder Respectively for input and target To extract information and get two fixed length output state. To maximize state Text information learned , The classifier takes the simplest operation here , That's the two one. state Do the vector inner product directly , And then the inner product does it directly binary classification.

Use two in the prediction encoder The input sentences are extracted separately , And then put the state As the text vector of model extraction

If you don't want to move the ground, you just put quick thought And also skip thought And put them together , Anyway Encoder Parts can be shared , See here for the complete code Github-Embedding-skip_thought

class EncoderBase(object):
def __init__(self, params):
self.params = params
self.init()

def init(self):
with tf.variable_scope('embedding', reuse=tf.AUTO_REUSE):
self.embedding = tf.get_variable(dtype = self.params['dtype'],
initializer=tf.constant(self.params['pretrain_embedding']),
name='word_embedding' )

def general_encoder(self, features):
encoder = ENCODER_FAMILY[self.params['encoder_type']]

seq_emb_input = tf.nn.embedding_lookup(self.embedding, features['tokens']) # batch_size * max_len * emb_size

encoder_output = encoder(seq_emb_input, features['seq_len'], self.params) # batch_size

return encoder_output

def vectorize(self, state_list, features):
with tf.variable_scope('inference'):
result={}
# copy through input for checking
result['input_tokenid']=tf.identity(features['tokens'], name='input_id')
token_table = tf.get_collection('token_table')[0]
result['input_token']= tf.identity(token_table.lookup(features['tokens']), name='input_token')

result['encoder_state'] = tf.concat(state_list, axis = 1, name ='sentence_vector')

return result

class QuickThought(EncoderBase):
def __init__(self, params):
super(QuickThought, self).__init__(params)

def build_model(self, features, labels, mode):
input_encode = self.input_encode(features)

output_encode = self.output_encode(features, labels, mode)

sim_score = tf.matmul(input_encode.state[0], output_encode.state[0], transpose_b=True) # [batch, batch] sim score

loss = self.compute_loss(sim_score)

def input_encode(self, features):
with tf.variable_scope('input_encoding', reuse=False):
encoder_output = self.general_encoder(features)

return encoder_output

def output_encode(self, features, labels, mode):
with tf.variable_scope('output_encoding', reuse=False):
if mode == tf.estimator.ModeKeys.PREDICT:
encoder_output = self.general_encoder(features)
else:
encoder_output=self.general_encoder(labels)

return encoder_output

def compute_loss(self, sim_score):
with tf.variable_scope('compute_loss'):
batch_size = sim_score.get_shape().as_list()[0]
sim_score = tf.matrix_set_diag(sim_score, np.zeros(batch_size))

# create targets: set element within diagonal offset to 1
targets = np.zeros(shape = (batch_size, batch_size))
offset = self.params['context_size']//2 ## offset of the diagonal
for i in chain(range(1, 1+offset), range(-offset, -offset+1)):
diag = np.diagonal(targets, offset = i)
diag.setflags(write=True)
diag.fill(1)

targets = targets/np.sum(targets, axis=1, keepdims = True)

targets = tf.constant(targets, dtype = self.params['dtype'])

losses = tf.nn.softmax_cross_entropy_with_logits(labels = targets,
logits = sim_score)

losses = tf.reduce_mean(losses)

return losses


Welcome to message and make complaints about it ～

【REF】

1. Rethinking Skip-thought: A Neighbourhood based Approach, Tang etc, 2017
2. Triming and Improving Skip-thought Vectors, Tang etc, 2017
3. Learning Generic Sentence Representations Using Convolutional Neural Netword, Gan etc, 2017
4. An Efficient Framework fir learning sentennce representations, Lajanugen etc, 2018
5. https://zhuanlan.zhihu.com/p/50443871

https://chowdera.com/2020/12/20201207083816904m.html