当前位置:网站首页>Machine learning | five methods of customer value prediction based on machine learning

Machine learning | five methods of customer value prediction based on machine learning

2020-12-06 12:00:07 osc_ igulbmxy

Project purpose : Predicting the value of a client's transaction

Data sources :https://www.kaggle.com/c/santander-value-prediction-challenge

The data content :4459 The transaction value and attributes of the customer are known I don't know the specific content , It could be gender Age income Pay taxes and so on , Every user has 4993 Attributes 1.png step :

  • Data analysis
  • Eigenvalue selection
  • model
  • debugging

 

First of all, data analysis

Yes 4459 That's ok ,4993 Column , In fact 1845 As a float type ,3147 As a int type , Yes 1 As a object It should be for users id

 

It is found that the number of eigenvalues is large

Preliminary treatment Take out the column of constants , Remove duplicate Columns

Data from 4993 Turn into 4732

Because there are too many eigenvalues , It's difficult to plot and analyze

Use all eigenvalues directly

Analysis of the predicted values needed , Observe the distribution of data The following figure on the left , Most of the data is on the left , do log Processing makes the data more Gaussian Right below . In general, the prediction of Gaussian distribution data is more accurate The reason is not very clear , Personal understanding is that if there is a larger value , The prediction is a little bit biased ,loss Changed a lot , It's not good for fitting .

 

Method 1

There may be problems , Too few samples , It's possible to over fit . Let's look at the effect first .

First set up a 4 Layer of dnn The Internet See test_dnn.py

Analysis of prediction results

Test the test set

The measure is root mean square

computing method :sqrt(( Predictive value - Original value )**2/ Sample size )

Rms=1.84

The following figure shows the distribution of prediction error

Result analysis : The effect is not ideal , There is a big gap between the predicted value and the real value , There is a value that deviates very much

Cause analysis :

  1. The structure of the model is not ideal
  2. Super parameter settings
  3. Too few samples , Yes 200w But the sample only has 4000+, Over fitting is a serious problem stay 20 After iterations , There was a fit

7b947ee9d124afadbeb171d3a315f61e.png

 

Method 2

Use lightgbm

Use it directly lightgbm library It works , But we still need to learn how to adjust parameters

See test_lightgbm.py

Analysis of prediction results

Test the test set

The measure is root mean square

Rms=1.35

6b6f1a5d06a9608f3fdd3498379dacad.png

Result analysis : The effect is still not ideal , But compared to dnn good , And there are no very large offsets

Cause analysis :

  1. There is still over fitting
  2. Model parameter settings

9.png

 

Method 3

Use xgboost

Method is the same as above.

Predicted results

Rms=1.38

10.png

Result analysis : The effect is still not ideal

Cause analysis :

  1. 2000 There are not enough iterations , The model has not yet converged
  2. Model parameter settings

 

Method 4

Use catboost

Method is the same as above.

Predicted results

Rms=1.47

Result analysis : The effect is still not ideal

 

Method 5

Using the idea of integrated learning , Mix the above methods

take 3 The results of each learner are summed according to the weight , Get the final result

Rms=1.36

12.png

Result analysis :

Use 4 There are three methods to model the prediction target , among dnn Because there's so little data , Fitting happened a long time ago

Xgboost,lightgbm,catboost Better than dnn Is much better , But there are still biases in value forecasts . But combine kaggle Forum post of , Because of the characteristics of the data leak Under the circumstances That's a good prediction . Because of the large time requirement for parameter adjustment and modification, it will not be carried out , Here's just a verification , The verification result is Xgboost,lightgbm,catboost In a scenario with less data , The effect is very good .


版权声明
本文为[osc_ igulbmxy]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201206115530543m.html