# Machine learning | five methods of customer value prediction based on machine learning

2020-12-06 12:00:07

Project purpose ： Predicting the value of a client's transaction

Data sources ：https://www.kaggle.com/c/santander-value-prediction-challenge

The data content ：4459 The transaction value and attributes of the customer are known I don't know the specific content , It could be gender Age income Pay taxes and so on , Every user has 4993 Attributes step ：

• Data analysis
• Eigenvalue selection
• model
• debugging

### First of all, data analysis

Yes 4459 That's ok ,4993 Column , In fact 1845 As a float type ,3147 As a int type , Yes 1 As a object It should be for users id

### It is found that the number of eigenvalues is large

#### Preliminary treatment ： Take out the column of constants , Remove duplicate Columns

Data from 4993 Turn into 4732

Because there are too many eigenvalues , It's difficult to plot and analyze

#### Use all eigenvalues directly

Analysis of the predicted values needed , Observe the distribution of data The following figure on the left , Most of the data is on the left , do log Processing makes the data more Gaussian Right below . In general, the prediction of Gaussian distribution data is more accurate The reason is not very clear , Personal understanding is that if there is a larger value , The prediction is a little bit biased ,loss Changed a lot , It's not good for fitting .

### Method 1

There may be problems , Too few samples , It's possible to over fit . Let's look at the effect first .

First set up a 4 Layer of dnn The Internet See test_dnn.py

Analysis of prediction results

Test the test set

The measure is root mean square

computing method ：sqrt(( Predictive value - Original value )**2/ Sample size )

Rms=1.84

The following figure shows the distribution of prediction error

Result analysis ： The effect is not ideal , There is a big gap between the predicted value and the real value , There is a value that deviates very much

Cause analysis ：

1. The structure of the model is not ideal
2. Super parameter settings
3. Too few samples , Yes 200w But the sample only has 4000+, Over fitting is a serious problem stay 20 After iterations , There was a fit

### Method 2

#### Use lightgbm

Use it directly lightgbm library It works , But we still need to learn how to adjust parameters

See test_lightgbm.py

Analysis of prediction results

Test the test set

The measure is root mean square

Rms=1.35

Result analysis ： The effect is still not ideal , But compared to dnn good , And there are no very large offsets

Cause analysis ：

1. There is still over fitting
2. Model parameter settings

### Method 3

#### Use xgboost

Method is the same as above.

Predicted results

Rms=1.38

Result analysis ： The effect is still not ideal

Cause analysis ：

1. 2000 There are not enough iterations , The model has not yet converged
2. Model parameter settings

### Method 4

#### Use catboost

Method is the same as above.

Predicted results

Rms=1.47

Result analysis ： The effect is still not ideal

### Method 5

Using the idea of integrated learning , Mix the above methods

take 3 The results of each learner are summed according to the weight , Get the final result

Rms=1.36

Result analysis ：

Use 4 There are three methods to model the prediction target , among dnn Because there's so little data , Fitting happened a long time ago

Xgboost,lightgbm,catboost Better than dnn Is much better , But there are still biases in value forecasts . But combine kaggle Forum post of , Because of the characteristics of the data leak Under the circumstances That's a good prediction . Because of the large time requirement for parameter adjustment and modification, it will not be carried out , Here's just a verification , The verification result is Xgboost,lightgbm,catboost In a scenario with less data , The effect is very good .

https://chowdera.com/2020/12/20201206115530543m.html