当前位置:网站首页>6 steps to teach you financial data mining preprocessing

6 steps to teach you financial data mining preprocessing

2022-06-23 18:01:31Chen Chen 135

There is no standard process for data preprocessing , It is usually different for different tasks and different attributes of data set . Let's take a look at the common six steps to complete data preprocessing .

Step 1: Import related modules

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

import yfinance as yf
yf.pdr_override()

Step 2: get data

symbol = 'TCEHY'
start = '2011-01-01'
end = '2021-03-31'
dataset = yf.download(symbol,start,end)

#  View the data 
dataset.head()

X = dataset[['Open', 'High', 'Low', 'Volume']].values
y = dataset['Adj Close'].values

 Characteristic structure 
dataset['Increase_Decrease'] = np.where(dataset['Volume'].shift(-1) > dataset['Volume'],1,0)
dataset['Buy_Sell_on_Open'] = np.where(dataset['Open'].shift(-1) > dataset['Open'],1,0)
dataset['Buy_Sell'] = np.where(dataset['Adj Close'].shift(-1) > dataset['Adj Close'],1,0)
dataset['Returns'] = dataset['Adj Close'].pct_change()
dataset = dataset.dropna()
dataset.head()

Step 3: Handling missing values

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])

Step 4: Classification data coding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])

  • Create virtual variables
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)

Step 5: Divide the training set and the test set

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(
         X , 
         Y , 
         test_size = 0.2, 
         random_state = 0)

Step 6: Feature Standardization

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

Ten Secrets of data transformation

Data transformation [1] Is to multiply each element of the dataset by a constant ; in other words , Count each number Transformation for , among , and All are real numbers. . Data transformation will change the distribution of data and the position of data points .

MinMaxScaler

>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler=MinMaxScaler(feature_range=(0,1))
>>> rescaledX=scaler.fit_transform(X)
>>> np.set_printoptions(precision=3) #  Set the accuracy of the output 
>>> rescaledX[0:5,:]
array([[0.009, 0.008, 0.009, 0.01 ],
       [0.01 , 0.009, 0.01 , 0.003],
       [0.01 , 0.009, 0.01 , 0.001],
       [0.01 , 0.009, 0.01 , 0.009],
       [0.01 , 0.009, 0.01 , 0.017]])

Standardizing Data

Data standardization [2]( Sometimes called z-score or standar score) Yes, rescaled to mean zero and standard deviation 1 The variable of . For standardized variables , In each case, the value on the standardized variable indicates that it is the same as the mean value of the original variable ( Or the standard deviation of the original variable ) The difference between the .

>>> from sklearn.preprocessing import StandardScaler
>>> scaler=StandardScaler().fit(X)
>>> rescaledX=scaler.transform(X)
>>> rescaledX[0:5,:]
array([[-1.107, -1.105, -1.109, -0.652],
       [-1.102, -1.102, -1.103, -0.745],
       [-1.103, -1.1  , -1.103, -0.764],
       [-1.099, -1.099, -1.102, -0.663],
       [-1.103, -1.101, -1.105, -0.564]])

Normalizing Data

Normalized data Is to scale the data to 0 To 1 Within the scope of .

>>> from sklearn.preprocessing import Normalizer
>>> scaler=Normalizer().fit(X)
>>> normalizedX=scaler.transform(X)
>>> normalizedX[0:5,:]
array([[1.439e-05, 1.454e-05, 1.433e-05, 1.000e+00],
       [4.104e-05, 4.107e-05, 4.089e-05, 1.000e+00],
       [6.540e-05, 6.643e-05, 6.540e-05, 1.000e+00],
       [1.627e-05, 1.627e-05, 1.612e-05, 1.000e+00],
       [9.142e-06, 9.222e-06, 9.082e-06, 1.000e+00]])

Binarizing Data

Two valued [3] It is the process of transforming the data features of any entity into binary vectors to make the classifier algorithm more efficient . In a simple example , Change the grayscale of the image from 0-255 The spectrum is converted into 0-1 Spectrum is binarization .

>>> from sklearn.preprocessing import Binarizer
>>> binarizer=Binarizer(threshold=0.0).fit(X)
>>> binaryX=binarizer.transform(X)
>>> binaryX[0:5,:]
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Mean Removal

De average method Is to remove the mean from each column or feature , The process of making it zero centered .

>>> from sklearn.preprocessing import scale
>>> data_standardized=scale(dataset)
>>> data_standardized.mean(axis=0)
array([ 0.000e+00,  0.000e+00, -8.823e-17, -1.765e-16, -8.823e-17,
        8.823e-17,  6.617e-17,  1.792e-17, -2.654e-17, -7.065e-18])

>>> data_standardized.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

One Hot Encoding

Hot coding alone [4] Is to convert a categorical variable into something that can be provided to ML Algorithms are processes in the form of better predictions .

>>> from sklearn.preprocessing import OneHotEncoder
>>> encoder=OneHotEncoder()
>>> encoder.fit(X)
OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

Label Encoding

Tag code For data with categorical variables and converting data into numbers .

  • fit
>>> from sklearn.preprocessing import LabelEncoder
>>> label_encoder=LabelEncoder()
>>> input_classes=['Apple','Intel','Microsoft','Google','Tesla']
>>> label_encoder.fit(input_classes)
>>> LabelEncoder()

>>> for i,companies in enumerate(label_encoder.classes_):
...     print(companies,'-->',i)
Apple --> 0
Google --> 1
Intel --> 2
Microsoft --> 3
Tesla --> 4

  • transform
labels=['Apple','Intel','Microsoft']
label_encoder.transform(labels)
array([0, 2, 3], dtype=int64)
label_encoder.inverse_transform(label_encoder.transform(labels))
array(['Apple', 'Intel', 'Microsoft'], dtype='<U9')

DictVectorizor

The word vector For data with labels and numbers . Besides , Word vectors can be used to extract data .

>>> from sklearn.feature_extraction import DictVectorizer
>>> companies = [{'Apple':180.25,'Intel':45.30,
                  'Microsoft':30.26,'Google':203.75,
                  'Tesla':302.18}] 
>>> vec = DictVectorizer()
>>> vec.fit_transform(companies).toarray()
array([[180.25, 203.75,  45.3 ,  30.26, 302.18]])

  • Get feature name
>>> vec.get_feature_names()
['Apple', 'Google', 'Intel', 'Microsoft', 'Tesla']

Polynomial Features

Polynomial features For generating polynomial features and interactive features . It also generates a new characteristic matrix data , The data is composed of all polynomial combinations of features whose degree is less than or equal to the specified degree .

>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(2) #  Quadratic interaction terms 
>>> poly.fit_transform(X)
array([[1.000e+00, 4.490e+00, 4.538e+00, ..., 2.000e+01, 1.395e+06,
        9.734e+10],
       ...,
       [1.000e+00, 7.857e+01, 7.941e+01, ..., 6.089e+03, 1.339e+08,
        2.944e+12]])

  • Intercept item
>>> poly = PolynomialFeatures(interaction_only=True) #  Do not keep intercept term 
>>> poly.fit_transform(X)
array([[1.000e+00, 4.490e+00, 4.538e+00, ..., 2.029e+01, 1.416e+06,
        1.395e+06],
       ...,
       [1.000e+00, 7.857e+01, 7.941e+01, ..., 6.196e+03, 1.363e+08,
        1.339e+08]])

Imputer

fill ( If we use the mean to fill in the missing value ), It replaces the missing value with the average value in the column or property data

​>>> from sklearn.preprocessing import Impute
>>> imputer = SimpleImputer()
>>> print(imputer.fit_transform(X, y))
[[4.490e+00 4.538e+00 4.472e+00 3.120e+05]
 ...
 [7.857e+01 7.941e+01 7.803e+01 1.716e+06]]
原网站

版权声明
本文为[Chen Chen 135]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/12/202112311949066484.html

随机推荐