当前位置:网站首页>Twelve cross validation techniques in graphical machine learning

Twelve cross validation techniques in graphical machine learning

2021-09-15 07:22:55 Data Studio

Hello everyone , I am cloud king !

Today, I'll take stock of the cross validators used in machine learning , Use the most intuitive graphical way to help people understand how they work .


Data set description

The data set comes from kaggle M5 Forecasting - Accuracy[1]

The task is to predict the unit sales of various products sold by Wal Mart in the United States as accurately as possible (demand). This article will use some of the data .

An example of this data is as follows .

The division of data sets needs to be operated according to the basic principle of cross validation . First, all data sets need to be divided into training set and test set , In the retraining set, cross validation is used to divide the training set and the validation set , As shown in the figure below .

First by date date Divide the test set and training set , As shown in the figure below .

Requirements for this demonstration , Created some new features , Finally, the following variables are screened and used .

Training columns: 
['sell_price', 'year', 'month', 'dayofweek', 'lag_7', 
'rmean_7_7', 'demand_month_mean', 'demand_month_max',
'demandmonth_max_to_min_diff', 'demand_dayofweek_mean', 
'demand_dayofweek_median', 'demand_dayofweek_max']

Set the following two global variables , And a method for storing the results of each cross validation score DataFrame

SEED = 888 #  To reproduce 
NFOLDS = 5 #  Set up K Discount number of discount verification 
stats = pd.DataFrame(columns=['K-Fold Variation','CV-RMSE','TEST-RMSE'])

Cross validation

Cross validation (Cross Validation) It is a common method in machine learning to establish models and verify model parameters . seeing the name of a thing one thinks of its function , It's the reuse of data , Cut the sample data , Combine them into different training sets and test sets . Use training sets to train models , Test set to evaluate the quality of the model .

Purpose of cross validation

  1. Get as much effective information as possible from limited learning data .
  2. Cross validation starts with learning samples from multiple directions , It can effectively avoid falling into local minimum .
  3. The over fitting problem can be avoided to some extent .

Types of cross validation

According to the method of segmentation , There are three types of cross validation :

The first is simple cross validation

First , Randomly divide the sample data into two parts ( such as :70% Training set of ,30% Test set of ), Then use the training set to train the model , Verify the model and parameters on the test set . Then disrupt the sample , Reselect training and test sets , Continue to train data and test models . Finally, the loss function is selected to evaluate the optimal model and parameters .

The second is K Crossover verification (K-Fold Cross Validation)

Different from the first method ,

K

Fold cross validation will randomly divide the sample data into

K

Share , Every random choice

K-1

As a training set , The rest 1 Make a test set . When this round is finished , Choose again at random

K-1

Training data . Several rounds ( Less than

K

) after , Choose the loss function to evaluate the optimal model and parameters .

The third is to leave a cross validation (Leave-one-out Cross Validation)

It's a special case of the second case , here

K

Equal to the number of samples

N

, So for

N

Samples , Each choice

N-1

Samples to train data , Leave a sample to verify the prediction of the model . This method is mainly used when the sample size is very small , For example, for ordinary moderate problems ,

N

Less than 50 when , Generally, leave one for cross validation .

Next, we will introduce... In detail by graphic method 12 A cross validation method , Main reference scikit-learn Official website [2] Introduce .

Cross validator

01 K Crossover verification -- No disruption

K

Fold cross validator KFold, Provide training / Verify the index to split the training / Validate the data in the set . Split the dataset into

K

Continuous folds ( By default, no changes are made ). Then use each fold as a verification , And the rest

k - 1

Fold to form a training set .

from sklearn.model_selection import KFold
KFold(n_splits= NFOLDS, shuffle=False, random_state=None) 
CV mean score:  23.64240, std: 1.8744.
Out of sample (test) score: 20.455980

This type of cross validation is not recommended for processing time series data , Because it ignores the coherence of the data . The actual test data is a period in the future .

As shown in the figure below , The black part is a fold that is used for validation , The yellow part is used for training

K - 1

A fold .

In addition, the data distribution map is 5 Each validation dataset in fold cross validation ( Black part ), And the combined distribution diagram of the data set actually used as the validation model .

02 K Crossover verification -- Disrupted

K Fold cross validator KFold Set parameters shuffle=True

from sklearn.model_selection import KFold
KFold(n_splits= NFOLDS, random_state=SEED, shuffle=True)
CV mean score:  22.65849, std: 1.4224.
Out of sample (test) score: 20.508801

In each iteration , One fifth of the data is still a validation set , But this time it's randomly distributed throughout the data . Same as before , Every example used as validation in one iteration will never be used as validation in another iteration .

As shown in the figure below , The black part is the data set used for verification , Obviously , The validation set data is disrupted .

03 Random permutation cross validation

Randomly arranged cross validators ShuffleSplit, An index is generated to split the data into training sets and validation sets .

Be careful : Contrary to other cross validation strategies , Random splitting does not guarantee that all folds will be different , Although for large data sets z It is very likely that .

from sklearn.model_selection import ShuffleSplit
ShuffleSplit(n_splits= NFOLDS, 
             random_state=SEED, 
             train_size=0.7, 
             test_size=0.2)
#  also 0.1 The data is not available 
CV mean score:  22.93248, std: 1.0090.
Out of sample (test) score: 20.539504

ShuffleSplit The entire data set will be randomly selected during each iteration , Generate a training set and a verification set .test_size and train_size Parameters control the size of validation and training set for each iteration . Because we sample from the entire data set in each iteration , So the value chosen in one iteration , You can select... Again in another iteration .

Because some data are not included in the training , This method is better than ordinary k Times faster cross validation .

As shown in the figure below , The black part is the data set used for verification , Orange is the data set used for training , The white part is the data set not included in the training and verification set .

04 layered K Crossover verification -- No disruption

layered

K

Fold cross validator StratifiedKFold.

Provide training / Verify the index to split the training / Validate the data in the set . This cross validation object is KFold A variant of , It returns a layered fold . Fold by preserving the sample percentage of each category .

from sklearn.model_selection import StratifiedKFold
StratifiedKFold(n_splits= NFOLDS, shuffle=False)
CV mean score: 	22.73248, std: 0.4955.
Out of sample (test) score: 20.599119

Just like the ordinary

K

The cross validation is similar to , But each fold contains about the same percentage of each target sample . Use classification better than regression .

There are a few points to note :

  • Generate validation set , Make each syncopation training / The distribution of the inclusion categories in the validation set is the same or as close as possible .
  • When shuffle=False when , The order dependencies in the dataset sorting will be preserved . in other words , Some validation sets come from classes k All samples are in y It's continuous .
  • The size of the generated validation set is consistent , That is, the minimum and maximum number of validation set data , At most, there is a difference of one sample .

As shown in the figure below , Without disruption , Verification set ( The black part in the picture ) The distribution is regular .

As can be seen from the data distribution diagram below ,5 The data density distribution curve of fold cross validation basically coincides , Note that although the divided samples are different , But its distribution is basically the same .

05 layered K Crossover verification -- Disrupted

For each goal , Fold the bag about the same percentage of the sample , But first, the data is disrupted . What needs to be noted here is , The split data method of cross validation is consistent , Just before splitting , First disrupt the arrangement of data , Then layered

K

Crossover verification .

from sklearn.model_selection import StratifiedKFold 
StratifiedKFold(n_splits= NFOLDS, random_state=SEED, 
                shuffle=True)
CV mean score: 	22.47692, std: 0.9594.
Out of sample (test) score: 20.618389

As shown in the figure below , Disrupted stratification K The validation set of fold cross validation is irregular 、 Randomly distributed .

The data distribution of this cross validation is consistent with the undisturbed stratification K Fold cross validation is basically consistent .

06 grouping K Crossover verification

With non overlapping groups

K

Iterator variant GroupKFold.

The same group does not appear in two different folds ( The number of different groups must be at least equal to the number of folds ). These folds are approximately balanced , Because the number of different groups in each fold is approximately the same .

From another specific column in the dataset ( year ) To define groups . Ensure that the same group is not in the training set and verification set at the same time .

The cross validator grouping is in method split In the parameter groups To reflect .

from sklearn.model_selection import GroupKFold
groups = train['year'].tolist()
groupfolds = GroupKFold(n_splits=NFOLDS)
groupfolds.split(X_train,Y_train, groups=groups)
CV mean score: 	23.21066, std: 2.7148.
Out of sample (test) score: 20.550477

As shown in the figure below , Due to data set ( It's not about 5 A whole year ( Group )), therefore 5 In cross validation , There is no guarantee that every time a validation set contains the same amount of data .

In the previous example , We use years as a group , In the next example, the month is used as the Group . You can clearly see the difference through the figure below .

from sklearn.model_selection import GroupKFold
groups = train['month'].tolist()
groupfolds = GroupKFold(n_splits=NFOLDS)
groupfolds.split(X_train,Y_train, groups=groups)
CV mean score: 	22.32342, std: 3.9974.
Out of sample (test) score: 20.481986

As shown in the figure below , Each iteration takes months as a group to get the verification set .

07 grouping K Crossover verification -- Leave one group

Leave a set of cross validators LeaveOneGroupOut.

Retain samples according to the array of integer groups provided by a third party . This set of information can be used to encode any predefined domain specific cross validation .

therefore , Each training set consists of all samples except those related to a specific group .

for example , The group can be the year of sample collection 、 Month, etc , Therefore, cross validation for time-based splitting is allowed .

from sklearn.model_selection import LeaveOneGroupOut
groups = train['month'].tolist()
n_folds = train['month'].nunique()
logroupfolds = LeaveOneGroupOut()
logroupfolds.split(X_train,Y_train, groups=groups)
CV mean score:  22.48503, std: 5.6201.
Out of sample (test) score: 20.468222

In each iteration , The models are trained with samples from all groups except one group . If you group by month , execute 12 Sub iteration .

The group can be seen from the figure below K Split data method of fold cross validation .

08 grouping K Crossover verification -- leave N Group

LeavePGroupsOut take P The group remains outside the cross validator , for example , The group can be the year of sample collection , Therefore, cross validation for time-based splitting is allowed .

LeavePGroupsOut and LeaveOneGroupOut The difference is that , The former uses all samples assigned to P Build test sets with different group values , The latter uses all samples assigned to the same group .

Through parameters n_groups Set the number of groups to exclude from the test split .

from sklearn.model_selection import LeavePGroupsOut
groups = train['year'].tolist()
lpgroupfolds = LeavePGroupsOut(n_groups=2)
lpgroupfolds.split(X_train,Y_train, groups=groups)
CV mean score: 	23.92578, std: 1.2573.
Out of sample (test) score: 90.222850

It can be seen from the figure below , because K=5,n_groups=2, Therefore, it is divided into 10 In this case , The validation set for each partition is different .

09 Randomly arranged groups K Crossover verification

Shuffle-Group(s)-Out Cross validation iterator GroupShuffleSplit

GroupShuffleSplit The iterator is ShuffleSplit and LeavePGroupsOut The combination of the two methods , And generate a random partition sequence , Each partition retains a subset of the group .

for example , The group can be the year of sample collection , Therefore, cross validation for time-based splitting is allowed .

LeavePGroupsOut and GroupShuffleSplit The difference between them is , The former uses size P All subsets of a unique group generate split , and GroupShuffleSplit Generate a user determined number of random validation splits , Each split has a unique group scale determined by the user .

for example , And LeavePGroupsOut(p=10) comparison , A less computationally intensive alternative is GroupShuffleSplit(test_size=10, n_splits=100).

Be careful : Parameters test_size and train_size Refers to the group , Not the sample , Like in ShuffleSplit In the same

Define group , The whole data set is randomly sampled in each iteration , To generate a training set and a verification set .

from sklearn.model_selection import GroupShuffleSplit
groups = train['month'].tolist()
rpgroupfolds = GroupShuffleSplit(n_splits=NFOLDS, train_size=0.7, 
                                 test_size=0.2, random_state=SEED)
rpgroupfolds.split(X_train,Y_train, groups=groups)
CV mean score:  21.62334, std: 2.5657.
Out of sample (test) score: 20.354134

As you can see from the diagram , To break off ( white ) Part is the unreachable data set , Every paragraph in every line ( Bounded by white space ) Validation set in ( black ) The proportion and position are consistent . The location of the validation set is different among different peers .

10 Time series cross validation

Time series data are characterized by the correlation between time close observations ( Autocorrelation ). However , Classic cross validation technology , for example KFold and ShuffleSplit Suppose the samples are independent and identically distributed , And it will lead to unreasonable correlation between training and test examples of time series data ( Produce bad estimates of generalization errors ).

therefore , stay “ future ” It is very important to evaluate the time series data of our model in observation , This is the least similar to the observation used for the training model . To achieve this , It provides a solution TimeSeriesSplit.

TimeSeriesSplit yes KFold A variation of the , It first returns

K

Fold into training sets and The first

K+1

Collapse as validation set . Please note that , Different from the standard cross validation method , Continuous training sets are supersets before them . Besides , It adds all remaining data to the first training partition , This zone is always used to train the model .

from sklearn.model_selection import TimeSeriesSplit
timeSeriesSplit = TimeSeriesSplit(n_splits= NFOLDS)
CV mean score: 	24.32591, std: 2.0312.
Out of sample (test) score: 20.999613

This method is recommended for time series data . In time series segmentation , The training set is usually divided into two parts . The first part is always the training set , The latter part is the validation set .

It can be seen from the figure below , The length of the validation set remains unchanged , The training set increases with each iteration .

11 Closed time series cross validation

This is a custom cross validation method . The function of this method is shown in the appendix at the end of the paper .

btscv = BlockingTimeSeriesSplit(n_splits=NFOLDS)
CV mean score: 		    22.57081, std: 6.0085.
Out of sample (test) score: 19.896889

As shown in the figure below , The training and validation sets are unique in each iteration . No value is used twice . The train set is always before validation . Due to training in fewer samples , It is also faster than other cross validation methods .

12 eliminate K Crossover verification

This is based on _BaseKFold A cross validation method . In each iteration , Before and after the training set , We will delete some samples .

cont = pd.Series(train.index)
purgedfolds=PurgedKFold(n_splits=NFOLDS,
                        t1=cont, pctEmbargo=0.0)
CV mean score: 		    23.64854, std: 1.9370.
Out of sample (test) score: 20.589597

As can be seen from the figure below , Some samples were deleted before and after the training set . And the method and basis of dividing training set and verification set are not disturbed KFold Agreement .

take embargo Set to greater than 0 Value , Additional samples will be deleted after the validation set .

cont = pd.Series(train.index)
purgedfolds=PurgedKFold(n_splits=NFOLDS,t1=cont,pctEmbargo=0.1)
CV mean score: 	23.87267, std: 1.7693.
Out of sample (test) score: 20.414387

As can be seen from the figure below , Not only some samples are deleted before and after the training set , Some samples are also deleted after the validation set , The size of these samples will depend on the parameters embargo Size .

Comparison of cross validation results

cm = sns.light_palette("green", as_cmap=True, reverse=True)
stats.style.background_gradient(cmap=cm)

appendix

Closed time series cross validation function

class BlockingTimeSeriesSplit():
    def __init__(self, n_splits):
        self.n_splits = n_splits
    
    def get_n_splits(self, X, y, groups):
        return self.n_splits
      
    def split(self, X, y=None, groups=None):
        n_samples = len(X)
        k_fold_size = n_samples // self.n_splits
        indices = np.arange(n_samples)

        margin = 0
        for i in range(self.n_splits):
            start = i * k_fold_size
            stop = start + k_fold_size
            mid = int(0.9 * (stop - start)) + start
            yield indices[start: mid], indices[mid + margin: stop]

eliminate K Fold cross validation function

from sklearn.model_selection._split import _BaseKFold
class PurgedKFold(_BaseKFold):
    '''
     Expand KFold Class to handle labels that span intervals 
     The overlapping test mark intervals are eliminated in the training set 
     Suppose the test set is continuous (shuffle=False), There is w/o The training sample 
    '''
    def __init__(self, n_splits=3, t1=None, pctEmbargo=0.1):
        if not isinstance(t1, pd.Series):
            raise ValueError('Label Through Dates must be a pd.Series')
        super(PurgedKFold,self).__init__(n_splits, shuffle=False, random_state=None)
        self.t1 = t1
        self.pctEmbargo = pctEmbargo

    def split(self,X,y=None,groups=None):
        X = pd.DataFrame(X)
        if (X.index==self.t1.index).sum()!=len(self.t1):
            raise ValueError('X and ThruDateValues must have the same index')
        indices = np.arange(X.shape[0])
        mbrg = int(X.shape[0] * self.pctEmbargo)
        test_starts=[(i[0],i[-1]+1) for i in np.array_split(np.arange(X.shape[0]), self.n_splits)]
        for i,j in test_starts:
            t0 = self.t1.index[i] #  Start of test set 
            test_indices = indices[i:j]
            maxT1Idx = self.t1.index.searchsorted(self.t1[test_indices].max())
            train_indices = self.t1.index.searchsorted(self.t1[self.t1<=t0].index)
            if maxT1Idx < X.shape[0]: #  The training set on the right has  embargo)
                train_indices = np.concatenate((train_indices, indices[maxT1Idx+mbrg:]))
            yield train_indices,test_indices

Reference material

[1]

Data sets : https://www.kaggle.com/c/m5-forecasting-accuracy

[2]

Cross validation : https://scikit-learn.org/stable/modules/classes.html

OK, Today's sharing is here !

This article is from WeChat official account. - data STUDIO(jim_learning) , author : Cloud King

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2021-09-07

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Data Studio]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909125824227l.html

随机推荐