# exercises for homework

2022-11-24 21:33:51

1. 下列关于Kmeans聚类算法的说法错误的是（D）

A. 对大数据集有较高的效率并且具有可伸缩性

B. 是一种无监督学习方法

C. KValue cannot be automatically obtain,初始聚类中心随机选择

D. 初始聚类中心的选择对聚类结果影响不大

1. Clustering algorithm research questions include：（多选）（ABC）

A. The final type distribution is reasonable

B. 快速聚类

C. 准确度高

D. Can automatically identify the number of clustering center

#### 3、简述K-meansAlgorithm of clustering implementation process.

① 随机初始化K个中心点;

② Calculate the unknown sample points respectively to theK个中心点的距离D;

③ The unknown sample points to be classed as withDThe center of the value the most hours of the same category;

④ 计算这KA classification of cluster mean respectively as thisKA variety of new center;

⑤ 重复第①-④步,Until the new center in line with the center of the old,则迭代停止,Will last the clustering as the optimal clustering results.

#### 4、请简述K-means算法的优缺点.

① 原理比较简单,实现也是很容易,收敛速度快;

② 聚类效果较优;

③ 算法的可解释度比较强;

④ Mainly need to adjust the parameters of the clustering center is just a numberk.

① K值的选取不好把握;

② 对于不是凸的数据集比较难收敛;

③ 如果各隐含类别的数据不平衡,比如各隐含类别的数据量严重失衡,或者各隐含类别的方差不同,则聚类效果不佳;

④ 采用迭代方法,The result could be a local optimum;

⑤ 对噪音和异常点比较敏感.

5、Please briefly what are the evaluation index and method can be used to evaluate the clustering algorithm.**

① SSE：误差的平方和,Calculation after every clustering sample points in the class and the clustering center of the sum of the squares of the distance,
The smaller the value represents the better clustering effect.This method is simple and crude,Initial selection for central point will be trapped in local optimal solution.

② 肘部法：Through mapping classes within the sample point to the sum of the squares of the center distance andKThe value of the line chart to determine the bestK值,
Eventually determine the current sample set is divided intoKA clustering center is the best clustering effect.

6.在K-means算法中,Which of the following can be used to obtain the global optimal solution：（D）

① 尝试为不同的质心（centroid）Run the initialization algorithm

② 调整迭代的次数

③ 找到集群的最佳数量

A. ②和③

B. ①和③

C. ①和②

D. 以上所有

To improve the method of choosing initial class center can improveK-means算法的聚类效果,获得全局最优解.
So try to different mass center initialization is actually looking for the best initial class center in order to achieve the global optimal;
And the number of iterations too few might not be able to obtain the global optimal solution,So by adjusting the number of iterations is required to obtain the global optimal solution;

7.下列关于SVMThe application of scenario is true（多选）：（ABC）

A. SVMPerformed very well in binary classification problems

B. SVM能够解决多分类问题

C. SVMTo solve the problem of return

8.下列关于SVMThe hard and soft intervals is wrong：（B）

A. Hard intervals in the separable linear sample performance will be good

B. Hard interval is sensitive to outliers is not

C. Soft interval in linear inseparable in the sample do better

D. Soft need to limit interval between violation and model complexity between

#### 9、Please briefly in linear inseparable samplesSVM引入核函数的目的,Common kernel functions and their usage scenarios and effect.

① We meet with undivided linear sample,Common practice is to put the sample feature mapped to high-dimensional space to.
But in the undivided linear sample,一律映射到高维空间,Then the dimension size will be high to horror.

But on kernel function to it in a low dimensional calculation,And will essentially the classification effect of（Using the inner product）Performance on the high-dimensional,
So avoid the complex calculation in directly in the high-dimensional space,真正解决了SVM线性不可分的问题.

1.下列选项中,关于KNNAlgorithm is not correct：（D）

A. Can find out the sample under test as close aK个样本

B. sklearn中的KNeighborsClassifierDefaults to using Euclidean distance measure

C. 实现过程相对简单,But can be interpreted not strong

D. 效率很高

2.下列选项中,描述不正确的是.（B ）

A. 获取到的sklearnThe built-in data sets are generally dictionary format

B. 可以通过类似于sklearn.datasets.fetch_*Way to obtain corresponding to small data set

C. Through access to data sets ofdata和targetAttribute to obtain the corresponding characteristic value and the target

D. Through access to data sets offeature_names和target_namesAttribute to obtain the corresponding characteristic and target names

3.关于train_test_split(data)分割后的数据,Reception mode is right（B ）

A. x_train, y_train, x_test, y_test

B. x_train, x_test, y_train, y_test

C. x_test, y_test, x_train, y_train

D. 无所谓,Whatever how to receive is ok

#### 4、简述K近邻算法的优缺点.

① Simple algorithm theory to understand,容易实现;

② 支持多分类,准确性较高;

③ 计算精度高,不受异常值的影响;

① 计算复杂度高,运算量大;

② 对KValue value is sensitive;

③ Predicted results can be interpreted not strong.

#### 5、Briefly describes features in engineering commonly used methods for pretreatment and the difference between them.

The characteristics of the commonly used pretreatment methods including normalization、标准化.

#### 6、简述KNeighbor algorithm application scenario for the.

② KNeighbor algorithm is suitable for the numerical characteristics of small medium scale application scenario for the data;

③ KNeighbor algorithm that can be used for binary classification scenario,Also can be used for classification more scenes.

## 8. Use the iris data set trainingKNN分类模型.

1. 使用sklearnThe built-in iris data set;
2. 对数据集进行划分,Validation set proportion can customize,Assurance procedures with each data set are the same;
3. Use the appropriate characteristics of pretreatment method of original data processing;
4. Using cross validation, and grid search for super parameter tuning(包括但不限于K值);
5. 评估训练好的模型;
6. To obtain the best model on the test set accuracy.
7. To obtain the best performances in the cross validation model and its parameters.
8. 如果条件允许,And students trained model contrast each other,See who has the most model accuracy,Think about and discuss.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# 1.获取鸢尾花数据

# 2.数据集划分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=666)

# 3.特征预处理
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 4. 实例化一个估计器
estimator = KNeighborsClassifier()

# 5 交叉验证,网格搜索
param_grid = {"n_neighbors": [1, 3, 5, 7, 9], "p": [1, 2, 10, 20]}
estimator = GridSearchCV(estimator, param_grid=param_grid, cv=5)

# 6. The training of the model and tuning
estimator.fit(x_train, y_train)

# 7. 得出预测值
y_pre = estimator.predict(x_test)
print("预测值是:\n", y_pre)

# 8. 计算模型的准确率
score = estimator.score(x_test, y_test)
print("准确率为:\n", score)

# 9. Given the best model of cross validation accuracy and its parameters
print("在交叉验证中,得到的最好结果是:\n", estimator.best_score_)
print("在交叉验证中,得到的最好的模型是:\n", estimator.best_estimator_)


9.下列选项中,About the generalization error is not correct：（C）

A. The generalization error is the error of the model on the new sample

B. Error refers to the model of the difference between actual output and the sample of true label

C. Machine learning is hoping to get the purpose of generalization error model

10.下列哪种方法可以用来缓解过拟合的产生：（B ）

A. 增加更多的特征

B. 正则化

C. 增加模型的复杂度

D. 以上都是

11.关于正则化,下列说法中正确的是（A）

A. L1正则化得到的解更加稀疏

B. L2正则化技术又称为 Lasso Regularization

C. L2正则化得到的解更加稀疏

D. L2正则化能防止过拟合,提升模型的泛化能力,但L1Regular changes less than this

12.关于特征选择,下列对Ridge回归和LassoRegression is true：（B）

A. RidgeRegression is suitable for the feature selection

B. LassoRegression is suitable for the feature selection

C. Both applied to feature selection

D. 以上说法都不对

13.线性回归中,我们可以使用正规方程（Normal Equation）来求解系数,The following statement about the normal equation is true（多选）：（ABC）

A. 不需要选择学习率

B. While the number of features a lot of,Computational cost will increase

C. Does not require an iterative training

A. The stochastic gradient descent method is used every time a sample data to iterative weight

B. The gradient descent method of computation increases with an increase in the number of samples

A. Model load can let we will have been trained model with more data seamless

B. The preservation of the model can be usedsklearn中的externals.joblib.dump()方法来完成

#### 16、Choose the commonly used method of model、Their characteristics and application scenario.

① The choice of model set aside method、交叉验证法和自助法.

② 留出法直接将数据集D划分为两个互斥的集合,Respectively as the training set and validation set;Cross validation method is the data setD划分为k个大小相似的互斥子集,Each choose one or more of these for validation set,剩下的为训练集;Self-help is randomly fromD中挑选一个样本,The copy in the training set,然后再将该样本放回初始数据集D中,When make the sample at the next sampling is still likely to be selected to,The sampling process repeatmTime to get the training set to model for training again,But not been drawn samples can be as a test set.

③ Set aside method generally applicable to the large amount of data scenarios,简单省时,But will sacrifice a small part of the accuracy of;Cross validation method in small amount of data of the scene also apply,Can let us make full use of only data screening more reliable model.

#### 17、Briefly describes what is linear regression,We can use linear regression to solve what kind of problem.

① 线性回归是利用数理统计中的回归分析,Is widely used to determine the two or more variables quantitative relationship of interdependence between a statistical analysis method is.

② 表现形式：只有一个自变量的情况称为简单回归（形如： y = w x + b y = w x + b ）,大于一个自变量情况的叫做多元回归（形如： y = w 1 x 1 + w 2 x 2 + ⋯ + b y = w_{1} x_{1} + w_{2} x_{2} + \cdots + b ）.

③ Such a statistical model is one or more is known as the regression coefficients of the model parameters of a linear combination.In this case, the linear is the independent variable x x 和因变量 y y 已知的情况下,Solving a set of regression coefficient w 1 , w 2 , ⋯   , w n w_{1}, w_{2}, \cdots, w_{n} 去拟合 x x y y The most primitive relationship,使得 y y 能够用 w 1 x 1 + w 2 x 2 + ⋯ + b w_{1} x_{1} + w_{2} x_{2} + \cdots + b 的方式去近似.

④ 在机器学习中,Everything you need to determine the quantitative relationship between the return of the problem can use linear regression to solve,Including but not limited to housing forecast、Stock returns prediction scenarios such as.

#### 18、Loss of linear regression model to measure method,And how to optimize the loss.

① 若使用 X ( x 0 , x 1 , x 2 , ⋯   . , x n ) X(x_{0}, x_{1}, x_{2},\cdots., x_{n}) 表示特征值,使用 W ( w 0 , w 1 , w 2 , ⋯   . , w n ) W(w_{0}, w_{1}, w_{2},\cdots., w_{n}) 表示权重系数,使用 y y Representation model of real target,Then we can use the least square method to measure the error of the model（损失）为：
J ( W ) = 1 2 ∑ i = 1 n ( w i x i − y ) 2 J(W) = \frac{1}{2}\sum_{i=1}^n (w_ix_i - y)^2

② Loss function of the optimization can be directly using the normal equation to solve the optimal parameter,Can also through the gradient descent method for solving.

#### 19、Briefly under fitting and over fitting reasons and solutions.

① 训练次数少;

② 模型过于简单.

① 增加训练次数;

② 增加模型的复杂度,Such as increasing polynomial characteristics;

① Training sample characteristics too much;

② 模型过于复杂.

① The original data intensify data cleaning;

② Increasing the number of samples training,Until the sample size is greater than characteristic number;

③ 使用正则化;

④ 对特征进行筛选,减少特征维度;

#### 20、What are the paper in the code in the actual development level about fitting method can be used to relieve and.

① 使用Ridge岭回归,It is used withL2正则化的线性回归模型,Model is able to produce a smooth weight coefficient,Make some of the characteristics of weight coefficient decreases,The influence of part feature on the model.

② 使用Lasso回归,It is used withL1正则化的线性回归模型,To generate sparse weight coefficient for the model,Some characteristics make the weight coefficient of direct zero,Eliminate some characteristics of the influence degree of the model,实现了特征选择.

③ Using elastic network,它是使用了L1正则化和L2A linear combination of the regularized form,继承了L1正则化和L2The advantage of regularization,By adjusting the coefficient of the linear combination allows us to get the effect of different model.

④ 使用Early Stopping,指定一个阈值,When the model in the process of training if the validation error is less than the threshold,The timely stop model continue training.

## 21. Using normal equation、The stochastic gradient descent optimization method of linear regression model and ridge regression model to completeBoston房价的预测.

1. 使用sklearn内置的Boston房价数据集;
2. 对数据集进行划分,Validation set proportion can customize,Assurance procedures with each data set are the same;
3. Use the appropriate characteristics of pretreatment method of original data processing;
4. Use a model selection of ridge regression model;
5. Evaluation of trained each model,And the effect comparison,Think about and discuss.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV, Ridge
from sklearn.metrics import mean_squared_error

def linear_reg_model():
# 1.获取数据

# 2.数据基本处理
# 2.1 分割数据
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=8)

# 3.特征工程-标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 机器学习-线性回归-正规方程求解
estimator1 = LinearRegression()
estimator1.fit(x_train, y_train)

print("From the normal equation models bias is:\n", estimator1.intercept_)
print("Normal equation coefficient of the model is:\n", estimator1.coef_)

# 模型评估-正规方程求解
y_pre = estimator1.predict(x_test)
ret = mean_squared_error(y_test, y_pre)
print("Normal equations solving the optimal solution model of mean square error (mse) as:\n", ret)

# 机器学习-线性回归-Stochastic gradient descent to solve
estimator2 = SGDRegressor(max_iter=1000)
estimator2.fit(x_train, y_train)

print("Stochastic gradient descent to solve the model bias is:\n", estimator2.intercept_)
print("Stochastic gradient descent to solve the model coefficient is:\n", estimator2.coef_)

# 模型评估-Stochastic gradient descent to solve
y_pre = estimator2.predict(x_test)
ret = mean_squared_error(y_test, y_pre)
print("Stochastic gradient descent from the solving models of mean square error (mse):\n", ret)

# 机器学习-线性回归-岭回归
# estimator = Ridge(alpha=1.0)
estimator3 = RidgeCV(alphas=(0.001, 0.01, 0.1, 1, 10, 100))
estimator3.fit(x_train, y_train)

print("Ridge regression model with cross validation of bias is that:\n", estimator3.intercept_)
print("A coefficient of ridge regression model with cross validation is:\n", estimator3.coef_)

# 模型评估-岭回归
y_pre = estimator3.predict(x_test)
ret = mean_squared_error(y_test, y_pre)
print("Ridge regression model with cross validation of mean square error (mse):\n", ret)
print("The best model parameters in cross validationalpha为：\n", estimator3.alpha_)



22.假设有N个样本,一半用于训练,一半用于测试.若增大N值,则训练误差和测试误差之间的差距会如何变化？（B）

A. 增大

B. 减小

If add data,Can effectively relieve a fitting,Reducing the gap between the training sample and test sample errors.

23.下列选项中,About logistic regression is not correct：（B）

A. Logistic regression is a classification algorithm

B. Logistic regression using the ideas of the return

C. 逻辑回归是一个分类模型

D. 逻辑回归使用sigmoidDid the results of regression function as the activation function mapping

24.The following is about the evaluation methods of classification model to describe error：（B）

A. We tend to through multiple evaluation index comprehensive evaluation classification model

B. Accuracy is the accurate rate of

C. Precision rate and recall rate with reference to the sample is not the same as the overall

D. AUCIs only applicable to evaluate the classification of the binary classification scenario model

25.About the samples of the following categories is not balanced scenario description right is：（A）

A. Sample classification imbalance will affect the result of the classification model

B. Sample classification unbalanced situations we don't have a better solution

C. Undersampling is copied category small number of samples to proceed with the expansion of the sample set

D. A sample would cause a loss of some information for data set

26.关于信息增益,决策树分裂节点,下列说法中正确的是（多选）（BC）

A. 纯度高的节点需要更多的信息去区分

B. 信息增益可以用“entroy(前) - entroy(后)”获得

26.We want to training the decision tree model in large data sets on,In order to use less time,可以：（C）

A. 增加树的深度

B. 增大学习率

C. 减少树的深度

D. 减少树的数量

27.Hypothesis model of training sample classification is very unbalanced,Major categories hold on the training data99%,Now your model on the training set is characterized by99%的准确率,那么下面说法正确的是（多选）？（AC）

A. Accuracy is not suited to measure unbalanced category problem

B. Suitable for measuring accuracy imbalance category problem

C. Suitable for measuring precision and recall rate unbalanced category problem

D. Precision and recall rate is not suitable for measuring unbalance category problem

28.在以下哪种情况下,Information gain rate is preferable to the information gain？（A）

A. When much attribute category number

B. When small amount special attribute category

C. The number of categories and properties has nothing to do

#### 29、Briefly describes the characteristics of logistic regression.

① 它是分类算法;
② It is a generalized linear model;
③ 它使用sigmoidThe results of the regression function to mapping operations,On the final value range[-1, 1]之中;
④ Used in a lot of two classification scenario,And the effect is outstanding.


#### 30、Loss function and the optimization method of logistic regression.

① Logistic regression loss function is logarithmic likelihood loss,Sample through ascension that category of the corresponding output probability value,To reduce the loss.

② The optimization of logistic regression method and linear regression similar,Using the gradient descent method can quickly locate the optimal solution.


#### 31、Description of unbalanced situations of sample classification model assessment method and sampling method.

① Sample classification unbalanced situations,How well we use cannot measure model accuracy,Then we can by using confusion matrix calculation accuracy、召回率、R1-scoreSuch as indicators to the comprehensive evaluation model is good or bad,In the second classification scenario,我们还可以绘制ROCCurve get modelAUCThe stand or fall of value to estimate the model.

② Sample classification is not balanced scenarios will larger influence on the results of the classification model and evaluation,At this moment we can owe sampling and the sampling method to restructure the data set.Normally we would preferred sampling;And undersampling because lose some data,可能会导致模型欠拟合,所以一般很少使用.


#### 32、Describes the principle of decision tree and the build process.

The decision tree byID3算法（信息增益）、C4.5算法（信息增益率）和CART算法（基尼指数）To calculate the importance of each characteristic under the condition of different degree,使用树形结构,Each node represents a judgment conditions,每个分支代表一个判断结果的输出,每个叶节点代表一种分类结果,Is a composed of multiple judgment node tree decision logic.


#### 33、分别简述ID3算法、C4.5算法、CART算法的实现和优缺点.

① ID3USES the information gain as the judgment basis characteristic importance degree,信息增益越大,Representatives from the greater the amount of information.The greater the degree of importance.But in many categories of the characteristics of information gain when,The result is often not accurate.

② C4.5算法继承了ID3算法的优势,And reduced its disadvantages,Size in measuring characteristics of importance,On the basis of the original information gain divided by the characteristics of the inherent value of（Intrisic Value）,Characteristics of the inherent value and characteristics of the number of categories related,The greater the number of categories,固有值越大,So on the basis of the information gain made a“惩罚”,Let the class too many features“信息增益”Not too big and the result is not accurate.
Calculation formula of information gain rate as：Gain_ratio(D, a) = Gain(D, a) / IV(a)

C4.5Algorithm internal implementation after pruning,This also brings it must limit the use of,After pruning operation requires a number of decisions, including completely build is completed to traverse each node,Through calculation of the price the complexity of the algorithm is the importance of each node degree,Will not important node deleted again,需要消耗大量内存.

③ CARTAlgorithm using more simple method to calculate the gini coefficient to measure the importance of the characteristics of the degree of,基尼指数越小,On behalf of the greater the amount of information,The higher the degree of importance.It can be used to solve classification problems can also be used to solve the problem of return.Introducing the logarithmic operation on the calculation,Computation more convenient.

Gini value from the data setD中随机抽取两个样本,其类别不一样的概率（样本被分错的概率）,计算公式为：

G i n i ( D ) = 1 − ∑ k = 1 n p k 2 Gini(D) = 1 - \sum_{k=1}^n p_k^2

The gini index, also known as the gini purity,The probability of said samples selected * 样本被分错的概率,计算公式为：
G i n i i n d e x ( D , a ) = ∑ v = 1 V D v D G i n i ( D v ) Gini_index(D, a) = \sum_{v=1}^V \frac{D^v}{D} Gini(D^v)

CARTAlgorithm efficiency is high, another reason is that it builds the tree is a binary tree,Simplify the structure of the tree.

#### 34、The purpose of the paper feature extraction in the characteristics of engineering and the commonly used method.

① Feature extraction is to convert some conform to the law of characteristic data into algorithm model can more easily identify numeric type,Especially in some text processing,tfidfTechnology is often used to.

② Commonly used feature extraction, feature extraction and text feature extraction methods have a dictionary.Dictionary feature extraction is the precondition of the characteristics of the data you are dictionary format,Otherwise can't conversion characteristics;Text feature extraction may need to be done first, the text segmentation processing,The extraction technology of wider use isTFIDF,By calculating the importance degree of the word to represent text characteristic.


#### 35. 下图中有8A watermelon and respective characteristic value,Please according to the decision tree of theID3算法计算出：Which characteristics for existing watermelon,Can be used to determine priority it's a good melon.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gmLaANmQ-1668820611429)(H:/sias资料/3-配套资料/阶段3-人工智能机器学习/02_机器学习算法day07/02_机器学习算法day07/03-其他资料/机器学习day07习题/机器学习前8Version day practice have the answer/images/watermelon.jpg)]

Overall entropy as：
H ( "好瓜" ) = − 8 17 l o g 2 ( 8 17 ) − 9 17 l o g 2 ( 9 17 ) = 0.998 H(\text{"好瓜"}) = -\frac{8}{17}log_2(\frac{8}{17}) -\frac{9}{17}log_2(\frac{9}{17}) = 0.998
If now known“色泽”这一特征：

“青绿”Of category accounted for：6/17,“浅白”为：5/17,“乌黑”为：6/17

H ( "青绿" ) = − 3 6 l o g 2 ( 3 6 ) − 3 6 l o g 2 ( 3 6 ) = 1 H(\text{"青绿"}) = -\frac{3}{6}log_2(\frac{3}{6}) -\frac{3}{6}log_2(\frac{3}{6}) = 1

H ( "浅白" ) = − 1 5 l o g 2 ( 1 5 ) − 4 5 l o g 2 ( 4 5 ) = 0.722 H(\text{"浅白"}) = -\frac{1}{5}log_2(\frac{1}{5}) -\frac{4}{5}log_2(\frac{4}{5}) = 0.722

H ( "乌黑" ) = − 4 6 l o g 2 ( 4 6 ) − 2 6 l o g 2 ( 2 6 ) = 0.918 H(\text{"乌黑"}) = -\frac{4}{6}log_2(\frac{4}{6}) -\frac{2}{6}log_2(\frac{2}{6}) = 0.918

G ( "好瓜" ∣ "色泽" ) = 0.998 − 6 17 × 1 − 5 17 × 0.722 − 6 17 × 0.918 = 0.109 G(\text{"好瓜"}|\text{"色泽"}) = 0.998 - \frac{6}{17} \times 1 - \frac{5}{17} \times 0.722 - \frac{6}{17} \times 0.918 = 0.109

G ( "好瓜" ∣ "根蒂" ) = 0.143 G(\text{"好瓜"}|\text{"根蒂"}) = 0.143

G ( "好瓜" ∣ "敲声" ) = 0.141 G(\text{"好瓜"}|\text{"敲声"}) = 0.141

G ( "好瓜" ∣ "纹理" ) = 0.381 G(\text{"好瓜"}|\text{"纹理"}) = 0.381

G ( "好瓜" ∣ "脐部" ) = 0.289 G(\text{"好瓜"}|\text{"脐部"}) = 0.289

G ( "好瓜" ∣ "触感" ) = 0.006 G(\text{"好瓜"}|\text{"触感"}) = 0.006

36.符号集a、b、c、d,它们相互独立,The corresponding probability respectively1/2、1/4、1/8、1/16,其中包含Is the smallest symbols in the information：（A）

A. a

B. b

C. c

D. d

According to the formula of information：
H ( x i ) = − l o g 2 p i H(x_i) = -log_2 p_i

H ( a ) = − l o g 2 ( 1 2 ) = 1   b i t H(a) = -log_2(\frac{1}{2}) = 1\ bit

H ( b ) = − l o g 2 ( 1 4 ) = 2   b i t H(b) = -log_2(\frac{1}{4}) = 2\ bit

H ( c ) = − l o g 2 ( 1 8 ) = 3   b i t H(c) = -log_2(\frac{1}{8}) = 3\ bit

H ( d ) = − l o g 2 ( 1 16 ) = 4   b i t H(d) = -log_2(\frac{1}{16}) = 4\ bit

Is the smallest symbols in the informationa.

#### 37. Please describe logistic regression what similarities and differences compared with the linear regression.

Logistic regression and linear regression, of course, also have in common,首先我们可以认为二者都使用了极大似然估计来对训练样本进行建模.线性回归使用最小二乘法,实际上就是在自变量x与超参数 θ \theta 确定,因变量y服从正态分布的假设下,使用极大似然估计的一个化简;The logistic regression by logarithmic likelihood function L ( θ ) = ∏ i = 1 N P ( y i ∣ x i ; θ ) = ∏ i = 1 N ( π ( x i ) ) y i ( 1 − π ( x i ) ) 1 − y i L(\theta) = \prod_{i=1}^N P(y_i|x_i;\theta) = \prod_{i=1}^N (\pi (x_i))^{y_i} (1-\pi (x_i))^{1-y_i} 的学习,得到最佳参数 θ \theta .另外,二者在求解超参数的过程中,都可以使用梯度下降的方法,这也是监督学习中一个常见的相似之处.

38.You use a random forest generated hundreds of trees（T1, T2, …, Tn）,Then the prediction results of the trees make a comprehensive,下列说法正确的是：（D）

1、Each tree is through all the data is constructed from the subset of

2、Each tree learning sample data is by random is back on the sampling

3、Each tree is characteristic by a subset of the data set and a subset of the building of

4、Each tree is through all of the data to construct

A. 1和2

B. 2和4

C. 1、2和3

D. 2和3

#### 39、Briefly describes what is integrated learning,解决了什么样的问题.

① Integrated learning just as its name implies,Is the use of multiple weak learning to build strong learning,To achieve better generalization ability performance;
② Integrated study of the existing two mature thought：bagging和boosting.baggingFocus on solving the problem of fitting model,提升模型的泛化能力;boostingFocus on the solution to the model owes fitting problem,Enhance generalization ability of the model and optimization of the data set.


#### 40、简述bagging算法的思想.

bagging是Boostrap Aggregating的缩写,For self-help method.它使用到了Boostrap sampling（Back on the sampling method of random）的思想,Each base study using training data is differentiated,But the training data are from the same overall,Finally comprehensive all base of learning results to determine the final output the result of the integration algorithm.例如随机森林（RandomForest）就是采用了baggingIdeas of integrated learning algorithm,It USES the base of learning is the decision tree.


#### 41、Description of the training sample of random forest why random sampling.

Random sampling is to make the decision tree learning to the characteristics of the data there is a gap of,This is to avoid weak classifier had strong correlation between.If the decision tree learning to the characteristics of the data are the same,So build up every tree is the same,It is with usbagging的初衷背道而驰了.


#### 42、Outlining the random forest algorithm（RF）的优缺点.

优点：
① 训练可以高度并行化,对于大数据时代的大样本训练速度有优势.This is the main advantages of.

② 由于可以随机选择决策树节点划分特征,这样在样本特征维度很高的时候,仍然能高效的训练模型.

③ 在训练后,可以给出各个特征对于输出的重要性

④ 由于采用了随机采样,训练出的模型的方差小,泛化能力强.

⑤ 对部分特征缺失不敏感.

① 在某些噪音比较大的样本集上,RF模型容易陷入过拟合.

② 取值划分比较多的特征容易对RF的决策产生更大的影响,从而影响拟合的模型的效果.


43.Assume that you are dealing with class attribute characteristics,并且没有查看分类变量在测试集中的分布.现在你想将One Hot Encoding（OHE）Applied to the class attribute characteristics of.那么在训练集中将 OHE 应用到分类变量可能要面临的困难是什么？：（D）

A. 分类变量所有的类别没有全部出现在测试集中

B. 类别的频率分布在训练Set and test set is different

C. 训练集和测试集通常会有一样的分布

D. A 和 B 都正确

A、B Items are correct,如果类别在测试集中出现,但没有在训练集中出现,OHE Will not be performed category code,It's going to be used OHE-hot The main difficulties.选项 B 同样也是正确的,在应用 OHE 时,如果训练集和测试集的频率分布不相同,我们需要多加小心,This may cause the final result is biased.


#### 44. Please specify what is bias and variance in the machine learning.

① 偏差指的是由所有采样得到的大小为mAll models of training data training build the output of the average deviation between the output and real model.

② 方差指的是由所有采样得到的大小为m的训练数据集训练出的所有模型的输出的方差.

The above definition is accurate, 但不够直观, In order to more clearly understand the bias and variance, We use an example of a shot to further describe the difference between the two and contact.

If a fire is a machine learning model to predict a sample. 射中靶心位置代表预测准确, 偏离靶心越远代表预测误差越大. 我们通过nTime sampling getn个大小为m的训练样本集合, 训练出n个模

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3aqN1et5-1668820611430)(H:/sias资料/3-配套资料/阶段3-人工智能机器学习/02_机器学习算法day07/02_机器学习算法day07/03-其他资料/机器学习day07习题/机器学习前8Version day practice have the answer/images/variance_bias.jpg)]

Set and test set is different

C. 训练集和测试集通常会有一样的分布

D. A 和 B 都正确

A、B Items are correct,如果类别在测试集中出现,但没有在训练集中出现,OHE Will not be performed category code,It's going to be used OHE-hot The main difficulties.选项 B 同样也是正确的,在应用 OHE 时,如果训练集和测试集的频率分布不相同,我们需要多加小心,This may cause the final result is biased.


#### 44. Please specify what is bias and variance in the machine learning.

① 偏差指的是由所有采样得到的大小为mAll models of training data training build the output of the average deviation between the output and real model.

② 方差指的是由所有采样得到的大小为m的训练数据集训练出的所有模型的输出的方差.

The above definition is accurate, 但不够直观, In order to more clearly understand the bias and variance, We use an example of a shot to further describe the difference between the two and contact.

If a fire is a machine learning model to predict a sample. 射中靶心位置代表预测准确, 偏离靶心越远代表预测误差越大. 我们通过nTime sampling getn个大小为m的训练样本集合, 训练出n个模

[外链图片转存中…(img-3aqN1et5-1668820611430)]

The best we can hope for the results is the outcome of the upper left corner, Shooting the result is accurate and focus, To explain the deviation of the model and variance are small; The center of the YouShangTu while shooting results around the bull's eye, But distribution is more dispersed, Describe a smaller variance but larger deviation; 同理, Lower left diagram model variance smaller, 偏差较大; Upper specification model variance is bigger, Deviation is larger.

https://chowdera.com/2022/328/202211242127009787.html