当前位置:网站首页>Urban sound8k sound classification in depth learning

Urban sound8k sound classification in depth learning

2020-11-13 04:19:30 mind_ programmonkey

I'm back again , Always wanted to blog . But I'm in a bad mood , I don't want to write the stock market Xiuxian summary . But since it was taught by the market , I feel that I have to sum up something after being taught a lesson , Just write a little bit of the last two paragraphs , Give yourself a wake-up call . Let's get to the bottom of the two paragraphs , Conduct Urbansound8k Voice classification of the actual battle .

The first paragraph :

Others fear my greed , I'm afraid of other people's greed !

The second paragraph :

I think it's very good , Encourage yourself to !!!

You can never make more money than you know

Except by luck

And the money that relies on luck to make money often can depend on strength to lose

It's a necessity

Every cent you make

It's all about your realization of the world

Every penny you lose

It's also a flaw in your perception of the world

The greatest justice in the world

It's when one's wealth is greater than one's cognition

There will be 1000 Ways to harvest you

Until your wealth and knowledge match .

Of course , Recently, there are also more happy things , Since the beginning of this style of hydrology , The number of fans is growing exponentially !!!

Okay , There's not much affectation here , Go straight ahead Urbansound8k Voice classification practice .

The whole thing is to use mfcc and cnn Fight the voice ,go go go!!!

One 、Urbansound8K Voice classification task

Urbansound8K It is a public data set widely used in automatic urban environmental sound classification research . This dataset contains 8732 A tagged sound clip (<=4s), contain 10 A classification : Air conditioning sound 、 Car whistle 、 Children playing 、 Barking of a dog 、 Borehole sound 、 Engine idling sound 、 Shot 、 Hand drill 、 Sirens and street music .

We need to achieve 10 The classification of speech : cold air machine , Car horn , Children play , Dog barking , drill hole , The engine is idling , Gun shot , A hand-held rock drill , Sirens , Street Music

The length of each recording is about 4s, Be placed in 10 individual fold In file .

below , It's to give the dataset of our voice classification task , That's our driving force !!!

incorrect , It's not the one above , It should be the following

Uh , It's not right either , It should be the following !!!

here , You may need datasets , I offer one to use wget Command download link

# wget Command download dataset 
!wget  https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz
--2020-03-15 05:03:42--  https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz
Resolving zenodo.org (zenodo.org)... 188.184.95.95
Connecting to zenodo.org (zenodo.org)|188.184.95.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6023741708 (5.6G) [application/octet-stream]
Saving to: ‘UrbanSound8K.tar.gz’

UrbanSound8K.tar.gz  87%[================>   ]   4.92G  34.4MB/s    eta 19s    --2020-03-15 05:03:42--  https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz
Resolving zenodo.org (zenodo.org)... 188.184.95.95
Connecting to zenodo.org (zenodo.org)|188.184.95.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6023741708 (5.6G) [application/octet-stream]
Saving to: ‘UrbanSound8K.tar.gz’

UrbanSound8K.tar.gz 100%[===================>]   5.61G  39.3MB/s    in 2m 27s  

2020-03-15 05:06:09 (39.2 MB/s) - ‘UrbanSound8K.tar.gz’ saved [6023741708/6023741708]

UrbanSound8K.tar.gz 100%[===================>]   5.61G  39.3MB/s    in 2m 27s  

2020-03-15 05:06:09 (39.2 MB/s) - ‘UrbanSound8K.tar.gz’ saved [6023741708/6023741708]

Be careful , I'm here because I'm jupyter notebook It's operated in , So I added !, If you operate on the command line , No need to add .

Two 、 Check it out. UrbanSound8k Data sets

Before the formal model training , Let's take a look at this dataset first

1. see csv Before document 5 The content of the line

#  See the former 5 All right 
data.head()
slice_file_name fsID start end salience fold classID class
0 100032-3-0-0.wav 100032 0.0 0.317551 1 5 3 dog_bark
1 100263-2-0-117.wav 100263 58.5 62.500000 1 5 2 children_playing
2 100263-2-0-121.wav 100263 60.5 64.500000 1 5 2 children_playing
3 100263-2-0-126.wav 100263 63.0 67.000000 1 5 2 children_playing
4 100263-2-0-137.wav 100263 68.5 72.500000 1 5 2 children_playing

2. Take a look at the sound distribution of each category

# Check the sound distribution in each folder 
appended = []
for i in range(1,11):
    appended.append(data[data.fold == i]['class'].value_counts())
    
class_distribution = pd.DataFrame(appended)
class_distribution = class_distribution.reset_index()
class_distribution['index'] = ["fold"+str(x) for x in range(1,11)]
class_distribution
index jackhammer air_conditioner street_music children_playing drilling dog_bark engine_idling siren car_horn gun_shot
0 fold1 120 100 100 100 100 100 96 86 36 35
1 fold2 120 100 100 100 100 100 100 91 42 35
2 fold3 120 100 100 100 100 100 107 119 43 36
3 fold4 120 100 100 100 100 100 107 166 59 38
4 fold5 120 100 100 100 100 100 107 71 98 40
5 fold6 68 100 100 100 100 100 107 74 28 46
6 fold7 76 100 100 100 100 100 106 77 28 51
7 fold8 78 100 100 100 100 100 88 80 30 30
8 fold9 82 100 100 100 100 100 89 82 32 31
9 fold10 96 100 100 100 100 100 93 83 33 32

3. visualization wav file

#  Read wav File functions 
def path_class(filename):
    excerpt = data[data['slice_file_name'] == filename]
    path_name = os.path.join('UrbanSound8K/audio', 'fold'+str(excerpt.fold.values[0]), filename)
    return path_name, excerpt['class'].values[0]
#  mapping wav function 
def wav_plotter(full_path, class_label):   
    rate, wav_sample = wav.read(full_path)
    wave_file = open(full_path,"rb")
    riff_fmt = wave_file.read(36)
    bit_depth_string = riff_fmt[-2:]
    bit_depth = struct.unpack("H",bit_depth_string)[0]
    print('sampling rate: ',rate,'Hz')
    print('bit depth: ',bit_depth)
    print('number of channels: ',wav_sample.shape[1])
    print('duration: ',wav_sample.shape[0]/rate,' second')
    print('number of samples: ',len(wav_sample))
    print('class: ',class_label)
    plt.figure(figsize=(12, 4))
    plt.plot(wav_sample) 
    return ipd.Audio(full_path)
#  Let's take an example of sound 
fullpath, label = path_class('100263-2-0-117.wav')
wav_plotter(fullpath,label)

After understanding the dataset , Then we can go with wav Happy to play together !!!

Two 、MFCC feature extraction

What we need to do next is to extract the sound MFCC features . If you want to know more about MFCC Words , reference MFCC

below , It mainly extracts sound files MFCC Features and label, Save it as npy file , Convenient for later processing .

As for why this is done , Because if you download a dataset , You'll find out wav The sound file is too big ,6G!!!

And after that npy The documents are just 3M many ,

1. extract wav Of MFCC features

#  Read wav Sound files , And extractors mfcc features , as well as label label , Save it 
bar = progressbar.ProgressBar(maxval=data.shape[0], widgets=[progressbar.Bar('$', '||', '||'), ' ', progressbar.Percentage()])
bar.start()
for i in range(data.shape[0]):
    
    fullpath, class_id = dc.path_class(data,data.slice_file_name[i])
    try:
        X, sample_rate = librosa.load(fullpath, res_type='kaiser_fast')
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    except Exception:
        print("Error encountered while parsing file: ", file)
        mfccs,class_id = None, None
    feature = mfccs
    label = class_id
    dataset[i,0],dataset[i,1] = feature,label
    
    bar.update(i+1)
||$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$|| 100%

2. preservation wav Of MFCC Characteristics and label

#  Put the sound file mfcc Characteristics and label preservation , Easy to import later 
np.save("dataset",dataset,allow_pickle=True)

3. Check out the saved npy file

l = np.load("dataset.npy",allow_pickle= True)
#  see npy The information in the document is shared by 8732 File information and 2 Column , The first category is mfcc features , The second column is label label 
l.shape
(8732, 2)
#  Look at the first 8730 A voice mfcc Information 
l[8730,0]
array([-3.44714210e+02,  1.26758143e+02, -5.61771663e+01,  3.60709288e+01,
       -2.06790388e+01,  8.23251959e+00,  1.27489714e+01,  9.64033889e+00,
       -8.98542590e+00,  1.84566301e+01, -1.04024313e+01,  2.07821493e-02,
       -6.83207553e+00,  1.16148172e+01, -3.84560777e+00,  1.42655549e+01,
       -5.70736889e-01,  5.26963822e+00, -4.74782564e+00,  3.52672016e+00,
       -7.85683552e+00,  3.22314076e+00, -1.02495424e+01,  4.20803645e+00,
        1.41565567e+00,  2.67714725e+00, -4.34362262e+00,  3.85769686e+00,
        1.73091054e+00, -2.37936884e+00, -8.23096181e+00,  2.16999653e+00,
        6.12071068e+00,  5.85898183e+00,  1.65499303e+00,  2.89231452e+00,
       -4.38354807e+00, -7.80225750e+00, -1.77907374e+00,  5.83541843e+00])
#  Look at the first 8730 A voice label Information 
l[8730,1]
'car_horn'

since , We have already followed MFCC You can play happily , So should we consider CNN What about it ?

3、 ... and 、 Deep learning CNN distinguish

1. Preprocess the data

import numpy as np
import pandas as pd

#  Import data 
data = pd.DataFrame(np.load("dataset.npy",allow_pickle= True))
data.columns = ['feature', 'label']

#  Process the data 
from sklearn.preprocessing import LabelEncoder

X = np.array(data.feature.tolist())
y = np.array(data.label.tolist())

#  Data segmentation 
from sklearn.model_selection import train_test_split
X,val_x,y,val_y = train_test_split(X,y)

#  On the label one-hot Handle 
lb = LabelEncoder()
from keras.utils import np_utils
y = np_utils.to_categorical(lb.fit_transform(y))
val_y = np_utils.to_categorical(lb.fit_transform(val_y))
Using TensorFlow backend.

The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x magic: more info.

2. The definition is simple CNN Model

#  Definition CNN
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 


num_labels = y.shape[1]
filter_size = 3

# build model
model = Sequential()
model.add(Dense(512, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

3. Training process

#  Training models 
model.fit(X, y, batch_size=64, epochs=32, validation_data=(val_x, val_y))
6549/6549 [==============================] - 10s 1ms/step - loss: 11.2826 - acc: 0.2003 - val_loss: 7.5983 - val_acc: 0.3417
Epoch 2/32
6549/6549 [==============================] - 1s 142us/step - loss: 6.0749 - acc: 0.2990 - val_loss: 2.1300 - val_acc: 0.2300
Epoch 3/32
6549/6549 [==============================] - 1s 142us/step - loss: 2.1298 - acc: 0.2886 - val_loss: 1.9270 - val_acc: 0.3601
Epoch 4/32
6549/6549 [==============================] - 1s 140us/step - loss: 1.9575 - acc: 0.3404 - val_loss: 1.8134 - val_acc: 0.3811
Epoch 5/32
6549/6549 [==============================] - 1s 152us/step - loss: 1.8316 - acc: 0.3758 - val_loss: 1.6505 - val_acc: 0.4530
Epoch 6/32
6549/6549 [==============================] - 1s 148us/step - loss: 1.7294 - acc: 0.4098 - val_loss: 1.5590 - val_acc: 0.5044
Epoch 7/32
6549/6549 [==============================] - 1s 149us/step - loss: 1.6061 - acc: 0.4463 - val_loss: 1.4071 - val_acc: 0.5479
Epoch 8/32
6549/6549 [==============================] - 1s 153us/step - loss: 1.5202 - acc: 0.4753 - val_loss: 1.2976 - val_acc: 0.5905
Epoch 9/32
6549/6549 [==============================] - 1s 155us/step - loss: 1.4394 - acc: 0.5065 - val_loss: 1.2583 - val_acc: 0.5868
Epoch 10/32
6549/6549 [==============================] - 1s 149us/step - loss: 1.3724 - acc: 0.5383 - val_loss: 1.1599 - val_acc: 0.6340
Epoch 11/32
6549/6549 [==============================] - 1s 130us/step - loss: 1.2737 - acc: 0.5593 - val_loss: 1.0785 - val_acc: 0.6583
Epoch 12/32
6549/6549 [==============================] - 1s 138us/step - loss: 1.2278 - acc: 0.5838 - val_loss: 1.0306 - val_acc: 0.6848
Epoch 13/32
6549/6549 [==============================] - 1s 141us/step - loss: 1.1638 - acc: 0.5989 - val_loss: 0.9763 - val_acc: 0.6958
Epoch 14/32
6549/6549 [==============================] - 1s 138us/step - loss: 1.1108 - acc: 0.6216 - val_loss: 0.9236 - val_acc: 0.7197
Epoch 15/32
6549/6549 [==============================] - 1s 133us/step - loss: 1.0715 - acc: 0.6254 - val_loss: 0.8937 - val_acc: 0.7320
Epoch 16/32
6549/6549 [==============================] - 1s 131us/step - loss: 1.0380 - acc: 0.6506 - val_loss: 0.8610 - val_acc: 0.7339
Epoch 17/32
6549/6549 [==============================] - 1s 135us/step - loss: 1.0015 - acc: 0.6642 - val_loss: 0.8241 - val_acc: 0.7426
Epoch 18/32
6549/6549 [==============================] - 1s 138us/step - loss: 0.9514 - acc: 0.6836 - val_loss: 0.7962 - val_acc: 0.7627
Epoch 19/32
6549/6549 [==============================] - 1s 134us/step - loss: 0.9312 - acc: 0.6903 - val_loss: 0.7593 - val_acc: 0.7787
Epoch 20/32
6549/6549 [==============================] - 1s 138us/step - loss: 0.9279 - acc: 0.6871 - val_loss: 0.7609 - val_acc: 0.7760
Epoch 21/32
6549/6549 [==============================] - 1s 139us/step - loss: 0.8756 - acc: 0.6974 - val_loss: 0.7506 - val_acc: 0.7755
Epoch 22/32
6549/6549 [==============================] - 1s 132us/step - loss: 0.8398 - acc: 0.7134 - val_loss: 0.7181 - val_acc: 0.7769
Epoch 23/32
6549/6549 [==============================] - 1s 133us/step - loss: 0.8275 - acc: 0.7204 - val_loss: 0.6903 - val_acc: 0.7952
Epoch 24/32
6549/6549 [==============================] - 1s 137us/step - loss: 0.8007 - acc: 0.7210 - val_loss: 0.6813 - val_acc: 0.8007
Epoch 25/32
6549/6549 [==============================] - 1s 132us/step - loss: 0.7845 - acc: 0.7377 - val_loss: 0.6573 - val_acc: 0.7971
Epoch 26/32
6549/6549 [==============================] - 1s 132us/step - loss: 0.7509 - acc: 0.7436 - val_loss: 0.6246 - val_acc: 0.8117
Epoch 27/32
6549/6549 [==============================] - 1s 134us/step - loss: 0.7419 - acc: 0.7424 - val_loss: 0.6113 - val_acc: 0.8145
Epoch 28/32
6549/6549 [==============================] - 1s 127us/step - loss: 0.7335 - acc: 0.7525 - val_loss: 0.6224 - val_acc: 0.8016
Epoch 29/32
6549/6549 [==============================] - 1s 131us/step - loss: 0.7146 - acc: 0.7563 - val_loss: 0.5810 - val_acc: 0.8278
Epoch 30/32
6549/6549 [==============================] - 1s 130us/step - loss: 0.6848 - acc: 0.7693 - val_loss: 0.5966 - val_acc: 0.8145
Epoch 31/32
6549/6549 [==============================] - 1s 121us/step - loss: 0.6806 - acc: 0.7652 - val_loss: 0.5640 - val_acc: 0.8360
Epoch 32/32
6549/6549 [==============================] - 1s 125us/step - loss: 0.6776 - acc: 0.7732 - val_loss: 0.5613 - val_acc: 0.8296





<keras.callbacks.History at 0x7f6d72f7da20>

Okay , A simple CNN The model has been built !

But do you think it's over !!! No , It's just a CNN, And we can take more than one CNN, People gather firewood and the fire is high !

Four 、 Multiple CNN Model

Many people are powerful , Young people yearn for power ?

1. Define the model structure

#  Model structure 
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
from keras.callbacks import LearningRateScheduler

num_labels = y_en.shape[1]
nets = 5

model = [0] *nets

# build model
for net in range(nets):
  model[net] = Sequential()


  model[net].add(Dense(512, input_shape=(40,)))
  model[net].add(Activation('relu'))
  model[net].add(Dropout(0.45))


  model[net].add(Dense(256))
  model[net].add(Activation('relu'))
  model[net].add(Dropout(0.45))


  model[net].add(Dense(num_labels))
  model[net].add(Activation('softmax'))



  model[net].compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='RMSprop')

2. Training network

#  Training network 
history = [0] * nets
epochs = 132
for j in range(nets):
    X_train2, X_val2, Y_train2, Y_val2 = X,val_x, y_en, val_y_en
    history[j] = model[j].fit(X,Y_train2, batch_size=256,
        epochs = epochs,   
        validation_data = (X_val2,Y_val2),  verbose=0)
    print("CNN {0:d}: Epochs={1:d}, Train accuracy={2:.5f}, Validation accuracy={3:.5f}".format(
        j+1,epochs,max(history[j].history['acc']),max(history[j].history['val_acc']) ))


CNN 1: Epochs=132, Train accuracy=0.92752, Validation accuracy=0.92023
CNN 2: Epochs=132, Train accuracy=0.92539, Validation accuracy=0.91870
CNN 3: Epochs=132, Train accuracy=0.92703, Validation accuracy=0.91947
CNN 4: Epochs=132, Train accuracy=0.92703, Validation accuracy=0.91450
CNN 5: Epochs=132, Train accuracy=0.92965, Validation accuracy=0.91794

3. Visualize the training process

#  Picture the training process 
net = -1
name_title = ['Loss','Accuracy']
fig=plt.figure(figsize=(64,64))
for i in range(0,2):
    ax=fig.add_subplot(8,8,i+1)
    plt.plot(history[net].history[list(history[net].history.keys())[i]], label = list(history[net].history.keys())[i] )
    plt.plot(history[net].history[list(history[net].history.keys())[i+2]],label = list(history[net].history.keys())[i+2] )
    plt.xlabel('Epochs', fontsize=18)
    plt.ylabel(name_title[i], fontsize=18)
    plt.legend()
    plt.show()

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-RlZe7EBP-1584282724126)(output_66_0.png)]

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-epIZLhlH-1584282724132)(output_66_1.png)]

4. Take a look at the metrics

#  Define evaluation indicators 
def acc(y_test,prediction):

    ### PRINTING ACCURACY OF PREDICTION
    ### RECALL
    ### PRECISION
    ### CLASIFICATION REPORT
    ### CONFUSION MATRIX
    cm = confusion_matrix(y_test, prediction)
    recall = np.diag(cm) / np.sum(cm, axis = 1)
    precision = np.diag(cm) / np.sum(cm, axis = 0)
    
    print ('Recall:', recall)
    print ('Precision:', precision)
    print ('\n clasification report:\n', classification_report(y_test,prediction))
    print ('\n confussion matrix:\n',confusion_matrix(y_test, prediction))
    
    ax = sns.heatmap(confusion_matrix(y_test, prediction),linewidths= 0.5,cmap="YlGnBu")
#  Look at the metrics and confusion matrix 
results = np.zeros( (val_x.shape[0],10) ) 
for j in range(nets):
  results = results  + model[j].predict(val_x)
results = np.argmax(results,axis = 1)
val_y_n = np.argmax(val_y_en,axis =1)
acc(val_y_n,results)
Recall: [0.98586572 0.92413793 0.92334495 0.81666667 0.91961415 0.96677741
 0.74576271 0.97377049 0.98201439 0.88356164]
Precision: [0.94897959 0.98529412 0.80792683 0.9141791  0.95016611 0.95409836
 0.93617021 0.94888179 0.95454545 0.87457627]

 clasification report:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97       283
           1       0.99      0.92      0.95       145
           2       0.81      0.92      0.86       287
           3       0.91      0.82      0.86       300
           4       0.95      0.92      0.93       311
           5       0.95      0.97      0.96       301
           6       0.94      0.75      0.83       118
           7       0.95      0.97      0.96       305
           8       0.95      0.98      0.97       278
           9       0.87      0.88      0.88       292

    accuracy                           0.92      2620
   macro avg       0.93      0.91      0.92      2620
weighted avg       0.92      0.92      0.92      2620


 confussion matrix:
 [[279   0   1   0   0   0   0   0   0   3]
 [  1 134   1   1   0   2   0   0   2   4]
 [  2   0 265   2   1   2   2   0   1  12]
 [  6   1  24 245   4   2   2   0   6  10]
 [  1   0   4   1 286   2   1  12   0   4]
 [  0   0   4   4   0 291   1   0   0   1]
 [  2   0  11  10   0   0  88   3   3   1]
 [  1   0   0   0   5   0   0 297   0   2]
 [  0   0   3   2   0   0   0   0 273   0]
 [  2   1  15   3   5   6   0   1   1 258]]

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-8YZpve5s-1584282724134)(output_69_1.png)]

Okay , This completes the Urbansound8k Voice classification deep learning practice , From the results , In fact, the effect is very good .

But in fact, I also did some machine learning XGBOOST Training , From the result of training , Maybe traditional machine learning XGBOOST The training effect should be better , As for why , I don't know either ,

Okay , The actual battle of voice classification is finished , In fact, the difficulty is not very high , Next time I have a chance to update a high difficulty , Equivalent to fine-grained sound classification , It will be more difficult , You can't just take MFCC+CNN, See you next week ! I hope you can push I , Let me make it even more !!!


 Insert picture description here

版权声明
本文为[mind_ programmonkey]所创,转载请带上原文链接,感谢