当前位置:网站首页>Sentiment classification using CNN

Sentiment classification using CNN

2020-12-08 02:37:40 Michael Amin

Reference resources Natural language processing based on deep learning

1. Reading data

Data files :

import numpy as np
import pandas as pd

data = pd.read_csv("yelp_labelled.txt", sep='\t', names=['sentence', 'label'])

data.head() # 1000 Data 

#  data  X  and   label  y
sentence = data['sentence'].values
label = data['label'].values

2. Data set splitting

#  Training set   Test set splitting 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sentence, label, test_size=0.3, random_state=1)

3. Text vectorization

  • Training tokenizer, Turn the text into ids Sequence
#  Text vectorization 
import keras
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=6000)
tokenizer.fit_on_texts(X_train) #  Training tokenizer
X_train = tokenizer.texts_to_sequences(X_train) #  Turn into  [[ids...],[ids...],...]
X_test = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index)+1 # +1  Because index 0, 0  It doesn't correspond to any word , be used for pad
  • pad ids Sequence , Make it the same length
maxlen = 100
# pad  Make sure that the length of each sentence is equal 
from keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(X_train, maxlen=maxlen, padding='post')
# post  Tail patch 0,pre  In the front 0
X_test = pad_sequences(X_test, maxlen=maxlen, padding='post')

4. establish CNN Model

from keras import layers
embeddings_dim = 150
filters = 64
kernel_size = 5
batch_size = 64


nn_model = keras.Sequential()
nn_model.add(layers.Embedding(input_dim=vocab_size, output_dim=embeddings_dim, input_length=maxlen))
nn_model.add(layers.Conv1D(filters=filters,kernel_size=kernel_size,activation='relu'))
nn_model.add(layers.GlobalMaxPool1D())
nn_model.add(layers.Dropout(0.3))
#  above  GlobalMaxPool1D  after , One dimension is missing , Customize below layers One more dimension 
nn_model.add(layers.Lambda(lambda x : keras.backend.expand_dims(x, axis=-1)))
nn_model.add(layers.Conv1D(filters=filters,kernel_size=kernel_size,activation='relu'))
nn_model.add(layers.GlobalMaxPool1D())
nn_model.add(layers.Dropout(0.3))
nn_model.add(layers.Dense(10, activation='relu'))
nn_model.add(layers.Dense(1, activation='sigmoid')) #  Two classification sigmoid,  Many classification  softmax

Reference article :
Embedding Layer details
Keras: GlobalMaxPooling vs. MaxPooling

  • Configuration model
nn_model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
nn_model.summary()
from keras.utils import plot_model
plot_model(nn_model, to_file='model.jpg') #  Draw model structure to file 
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 100, 150)          251400    
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 96, 64)            48064     
_________________________________________________________________
global_max_pooling1d_7 (Glob (None, 64)                0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 64)                0         
_________________________________________________________________
lambda_4 (Lambda)            (None, 64, 1)             0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 60, 64)            384       
_________________________________________________________________
global_max_pooling1d_8 (Glob (None, 64)                0         
_________________________________________________________________
dropout_8 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 10)                650       
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 11        
=================================================================
Total params: 300,509
Trainable params: 300,509
Non-trainable params: 0

5. Training 、 test

history = nn_model.fit(X_train,y_train,batch_size=batch_size,
             epochs=50,verbose=2,validation_data=(X_test,y_test))
# verbose  Whether the log information shows ,0 No display ,1 Show progress bar ,2 Don't show progress bar 
loss, accuracy = nn_model.evaluate(X_train, y_train, verbose=1)
print(" Training set :loss {0:.3f},  Accuracy rate :{1:.3f}".format(loss, accuracy))
loss, accuracy = nn_model.evaluate(X_test, y_test, verbose=1)
print(" Test set :loss {0:.3f},  Accuracy rate :{1:.3f}".format(loss, accuracy))

#  Draw a training curve 
from matplotlib import pyplot as plt
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()

Output :

Epoch 1/50
11/11 - 1s - loss: 0.6933 - accuracy: 0.5014 - val_loss: 0.6933 - val_accuracy: 0.4633
Epoch 2/50
11/11 - 0s - loss: 0.6931 - accuracy: 0.5214 - val_loss: 0.6935 - val_accuracy: 0.4633
Epoch 3/50
11/11 - 1s - loss: 0.6930 - accuracy: 0.5257 - val_loss: 0.6936 - val_accuracy: 0.4633
.... Omit 
11/11 - 0s - loss: 0.0024 - accuracy: 1.0000 - val_loss: 0.7943 - val_accuracy: 0.7600
Epoch 49/50
11/11 - 1s - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.7970 - val_accuracy: 0.7600
Epoch 50/50
11/11 - 0s - loss: 0.0027 - accuracy: 1.0000 - val_loss: 0.7994 - val_accuracy: 0.7600
22/22 [==============================] - 0s 4ms/step - loss: 9.0586e-04 - accuracy: 1.0000
 Training set :loss 0.001,  Accuracy rate :1.000
10/10 [==============================] - 0s 5ms/step - loss: 0.7994 - accuracy: 0.7600
 Test set :loss 0.799,  Accuracy rate :0.760

Training set :loss 0.001, Accuracy rate :1.000
Test set :loss 0.799, Accuracy rate :0.760
There is over fitting , The accuracy of training set is very high , The test set is poor

 Insert picture description here

  • Random testing
text = ["i am not very good.", "i am very good."]
x = tokenizer.texts_to_sequences(text)
x = pad_sequences(x, maxlen=maxlen, padding='post')
pred = nn_model.predict(x)
print(" forecast {} The categories of are :".format(text[0]), 1 if pred[0][0]>=0.5 else 0)
print(" forecast {} The categories of are :".format(text[1]), 1 if pred[1][0]>=0.5 else 0)

Output :

 forecast i am not very good. The categories of are : 0
 forecast i am very good. The categories of are : 1

版权声明
本文为[Michael Amin]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012080237241996.html