Write it at the front
The main content of this blog
- application MinMaxScaler Realize the normalization of feature data
- application StandardScaler Realize the standardization of feature data
Feature preprocessing
Definition
adopt Some conversion functions Integrate feature data Convert to a more suitable algorithm model The characteristic data process of
Feature preprocessing API
sklearn.preprocessing
Why normalization / Standardization ?
The characteristics of the The unit or size varies greatly , Or the variance of a feature is several orders of magnitude larger than that of other features , Easy to influence ( control ) Target result , Some algorithms cannot learn other features
normalization
Definition
Map data to by transforming the original data ( The default is [0,1]) Between
Act on each column ,max Is the maximum value of a column ,min Is the minimum value of a column , that X’’ For the end result ,mx,mi Default for the specified interval value mx by 1,mi by 0
API
- sklearn.preprocessing.MinMaxScaler (feature_range=(0,1)… )
- MinMaxScalar.fit_transform(X)
- X:numpy array Formatted data [n_samples,n_features]
- Return value : The transformed shape is the same array
- MinMaxScalar.fit_transform(X)
data
milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
38344,1.669788,0.134296,1
Code
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
def minmax_demo():
data = pd.read_csv("dating.txt")
print(data)
# 1、 Instantiate a converter class
transfer = MinMaxScaler(feature_range=(2, 3))
# 2、 call fit_transform
data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
print(" The result of normalization of minimum and maximum values :\n", data)
return None
result
Standardization
Definition
Transform the original data to mean value 0, The standard deviation is 1 Within the scope of
Act on each column ,mean Is the average ,σ As the standard deviation
API
- sklearn.preprocessing.StandardScaler( )
- After processing, all data in each column is clustered in the mean value 0 The standard deviation is 1
- StandardScaler.fit_transform(X)
- X:numpy array Formatted data [n_samples,n_features]
- Return value : The transformed shape is the same array
data
Same as the data used in the introduction
Code
from sklearn.preprocessing import StandardScaler
import pandas as pd
def stand_demo():
data = pd.read_csv("dating.txt")
print(data)
transfer = StandardScaler()
data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
print(" The result of Standardization :\n",data)
print(" The average value of each column of features :\n",transfer.mean_)
print(" The variance of each column characteristic :\n",transfer.var_)
return None