Catalog

• Entropy and information gain
• One 、 entropy （Entropy）
• Two 、 Conditional entropy （Conditional Entropy）
• 3、 ... and 、 Joint entropy （Joint Entropy）
• Four 、 Relative entropy （Relative Entropy）
• 4.1 Properties of relative entropy
• 5、 ... and 、 Cross entropy （Cross Entropy）
• 6、 ... and 、 Relative entropy 、 The relationship between cross entropy and entropy
• 7、 ... and 、 Information gain （Information Gain）
• 8、 ... and 、 Information gain ratio （Information Gain Ratio）
• Nine 、 A graph shows you entropy and information gain

to update 、 More comprehensive 《 machine learning 》 The updated website of , What's more python、go、 Data structure and algorithm 、 Reptiles 、 AI teaching is waiting for you ：https://www.cnblogs.com/nickchen121/p/11686958.html

# One 、 entropy （Entropy）

The measurement of uncertainty in entropy representation of random variables . Assumed discrete random variable $$X$$ You can get it. $$n$$ It's worth , The probability distribution is

$P(X=x_i)=p_i, \quad i = 1,2,\ldots,n$

be $$X$$ Entropy is defined as

$H(X) = -\sum_{i=1}^n p_i log{p_i}$

Because entropy only depends on $$X$$ The distribution of , And $$X$$ The value of itself doesn't matter , So entropy can also be defined as

$H(p) = -\sum_{i=1}^n p_i log{p_i}$

The greater the entropy , The greater the uncertainty of the random variable , also $$0\geq{H(p)}\leq\log{n}$$.

When a random variable takes only two values $$0$$ and $$1$$ When ,$$X$$ The distribution of is

$P(X=1)=p, \quad P(x=0)=1-p, \quad 0\geq{p}\leq{1}$

Entropy is

$H(p) = -p\log_2 p-(1-p) \log_2(1-p)$

The random variable is Bernoulli distribution , The curve of entropy versus probability is shown in the figure below

import numpy as np
from math import log
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

p = np.arange(0.01, 1, 0.01)
entro = -p*np.log2(p) - (1-p)*np.log2(1-p)

plt.plot(p, entro)
plt.title(' The relationship between entropy and probability in Bernoulli distribution ', fontproperties=font)
plt.xlabel('p')
plt.ylabel('H(p)')
plt.show()

When $$p=0$$ and $$p=1$$ The entropy is $$0$$, At this point, there is no uncertainty in random variables ; When $$p=0.5$$ It's the largest entropy , Random variables are the most uncertain .

# Two 、 Conditional entropy （Conditional Entropy）

Suppose there are random variables $$(X,Y)$$, The joint probability is

$p(X=x_i,Y=y_i), \quad i=1,2,\ldots,n; \quad j=1,2,\ldots,m$

Conditional entropy $$H(Y|X)$$ I have a random variable $$X$$ Is a random variable $$Y$$ uncertainty , Defined as

$H(Y|X) = \sum_{i=1}^n P(X=x_i) H(Y|X=x_i)$

Through the formula, conditional entropy can be understood as the amount of information obtained when acquiring another information on the basis of knowing a certain information .

When the probabilities in entropy and conditional entropy are obtained by data estimation , The corresponding entropy and conditional entropy are called empirical entropy respectively （empirical entropy） And empirical conditional entropy （empirical conditional entropy）.

# 3、 ... and 、 Joint entropy （Joint Entropy）

Suppose there are random variables $$(X,Y)$$, The joint probability is

$p(X=x_i,Y=y_i)=p_{ij}, \quad i=1,2,\ldots,n; \quad j=1,2,\ldots,m$

Joint entropy measures the uncertainty of a joint distributed stochastic system , It's defined as

$H(X,Y) = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i,Y=y_j)}$

So we can simplify the joint entropy

\begin{align} H(X,Y) & = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i,Y=y_j)} \\ & = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i)\log{p(Y=y_i|X=x_i)}} \\ & = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i)} -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(Y=y_i|X=x_i)}\\ & = -\sum_{i=1}^n p(X=x_i) \log{p(X=x_i)} -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(Y=y_i|X=x_i)} \\ & = H(X) + H(Y|X) \end{align}

The same is true $$H(X,Y)=H(Y)+H(X|Y)$$, In other words, joint entropy represents a stochastic system with two random variables , You can first observe a random variable to get information , After that, we can observe the amount of information of the second random variable on the basis of having this amount of information , And no matter which random variable is observed first, it has no effect on the amount of information .

In the same way, we can get one that contains $$n$$ A random system of independent random variables $$(X_1,X_2,\ldots,X_n)$$ The joint entropy of is

$H(X_1,X_2,\ldots,X_n) = \sum_{i=1}^n H(X_i)$

It can be found that even if it contains $$n$$ No matter which random variable is observed first, there is no influence on the amount of information .

# Four 、 Relative entropy （Relative Entropy）

Relative entropy is sometimes called KL The divergence (Kullback–Leibler divergence).

set up $$p(x)$$、$$q(x)$$ It's a discrete random variable $$X$$ Two probability distributions of the values in , be $$p$$ Yes $$q$$ The relative entropy of is ：

\begin{align} DKL(p||q) & = \sum_{i=1}^n p(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = E_{p(X=x_i)}\log{\frac{p(X=x_i)}{q(X=x_i)}} \end{align}

## 4.1 Properties of relative entropy

1. If $$p(x)$$ and $$q(x)$$ The two distributions are the same , So relative entropy equals 0
2. $$DKL(p||q)≠DKL(q||p)$$, Relative entropy is asymmetric
3. $$DKL(p||q)≥0$$（ utilize Jensen Inequality can be proved ）

\begin{align} DKL(p||q) & = \sum_{i=1}^n p(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = - \sum_{i=1}^n p(X=x_i)\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - E_{p(X=x_i)}\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & \geq -\log{E_{p(X=x_i)}}\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - \log\sum_{i=1}^n p(X=x_i)\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - \log\sum_{i=1}^n q(X=x_i) \end{align}

among $$\sum_{i=1}^n q(X=x_i)=1$$, Obtain evidence $$DKL(p||q)≥0$$
4. Relative entropy can be used to measure the difference between two probability distributions , The meaning of the above formula is to find $$p$$ And $$q$$ The logarithmic difference between is $$p$$ Expectations on

# 5、 ... and 、 Cross entropy （Cross Entropy）

Definition ： Two probability distributions based on the same time measure $$p(x)$$ and $$q(x)$$ The cross entropy of means , When based on a “ Unnatural ”（ be relative to “ True distribution ”$$p(x)$$ for ） Probability distribution of $$q(x)$$ When encoding , The average number of bits needed to uniquely identify an event in the time set （ Using a non real distribution $$q(x)$$ The amount of effort required by the specified strategy to eliminate system uncertainty ）.

Suppose that random variables $$X$$ You can get it. $$n$$ It's worth . Now there are two probability distributions for the sample set $$p(X=x_i)$$ and $$q(X=x_i)$$, among $$p(X=x_i)$$ For real distribution ,$$q(X=x_i)$$ Untrue distribution . If we use the real distribution $$p(X=x_i)$$ To measure the expectation of the encoding length needed to identify another sample （ Average encoding length ） by :

\begin{align} H(p) & = \sum_{i=1}^n p(X=x_i)\log{\frac{1}{p(X=x_i)}} \\ & = - \sum_{i=1}^n p(X=x_i)\log{p(X=x_i)} \end{align}

If you use a non real distribution $$q(X=x_i)$$ To represent the real distribution $$p(X=x_i)$$ The average code length of , It is ：

$H(p,q) = \sum_{i=1}^n p(X=x_i)\log{\frac{1}{q(X=x_i)}}$

Because with $$q(X=x_i)$$ To encode samples from the distribution $$q(X=x_i)$$, therefore $$H(p,q)$$ The probability in is $$p(X=x_i)$$, At this point will be $$H(p,q)$$ It's called cross entropy .

for instance . Consider a random variable $$X$$, True distribution $$p(X)=({\frac{1}{2}},{\frac{1}{4}},{\frac{1}{8}},{\frac{1}{8}})$$, Untrue distribution $$q(X)=({\frac{1}{4}},{\frac{1}{4}},{\frac{1}{4}},{\frac{1}{4}})$$, be $$H(p)=1.75bits \text{ The shortest average code length }$$, Cross entropy

$H(p,q)={\frac{1}{2}}\log_24+{\frac{1}{4}}\log_24+{\frac{1}{8}}\log_24+{\frac{1}{8}}\log_24=2bits$

From this we can see that according to the unreal distribution $$q(X=x_i)$$ The average code length obtained is larger than that according to the real distribution $$p(X=x_i)$$ The average code length obtained , But is it an example or is it always going to be like this ？

# 6、 ... and 、 Relative entropy 、 The relationship between cross entropy and entropy

Let's simplify the formula of relative entropy .

\begin{align} DKL(p||q) & = \sum_{i=1}^np(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = \sum_{i=1}^np(X=x_i)\log{p(X=x_i)}−p(X=x_i)\log{q(X=x_i}) \end{align}

If the simultaneous entropy formula and the cross entropy formula

\begin{align} entropy & = H(p) \\ & = −\sum_{i=1}^np(X=x_i)\log{p(X=x_i)} \end{align}

\begin{align} Cross entropy & = H(p,q) \\ & = \sum_{i=1}^n p(X=x_i)\log{\frac{1}{q(X=x_i)}} \\ & = −\sum_{i=1}^np(X=x_i)\log{q(X=x_i)} \end{align}

You can launch

$DKL(p||q)=H(p,q)−H(p)$

From the above formula, it can be concluded that when the non real distribution is used $$q(x)$$ The average code length obtained is larger than the true distribution $$p(x)$$ The number of bits more than the average code length is the relative entropy .

Again because $$DKL(p||q)≥0$$, be $$H(p,q)≥H(p)$$, When $$p(x)=q(x)$$ when , So the cross entropy equals entropy .
And when the $$H(p)$$ Is a constant （ notes ： In machine learning , The distribution of training data is fixed ）, Minimize relative entropy $$DKL(p||q)$$ It's equivalent to minimizing cross entropy $$H(p,q)$$ It is also equivalent to maximum likelihood estimation .

# 7、 ... and 、 Information gain （Information Gain）

Suppose there are random variables $$(X,Y)$$, Information gain represents characteristics $$X$$ The information that makes the class $$Y$$ The degree to which information uncertainty is reduced .

features $$A$$ Pair training set $$D$$ The information gain of is recorded as $$g(D,A)$$, The information gain can be defined as a set $$D$$ Empirical entropy and characteristics of $$A$$ Given the conditions $$D$$ Entropy of empirical conditions $$H(D|A)$$ The difference between the

$g(D,A) = H(D) - H(D|A)$

among $$H(D)$$ Represents a pair of data sets $$D$$ The uncertainty of classification ;$$H(D|A)$$ It means in the feature $$A$$ Given the conditions for the dataset $$D$$ The uncertainty of classification ;$$g(D,A)$$ Show due to features $$A$$ And make the data set $$D$$ The degree to which the uncertainty of classification is reduced . So it can be found that for datasets $$D$$ for , Information gain depends on features , Different features often have different information gain , Features with large information gain have stronger classification ability .

# 8、 ... and 、 Information gain ratio （Information Gain Ratio）

Suppose there are random variables $$(X,Y)$$, features $$A$$ The data set $$D$$ The information gain ratio is recorded as $$g_R(D,A)$$, Defined as

$g_R(D,A) = {\frac{g(D,A)}{H_A(D)}}$

And the characteristic entropy $$H_A(D) = -\sum_{i=1}^n {\frac{D_i}{D}} \log_2 {\frac{D_i}{D}}$$,$$n$$ Is the characteristic $$A$$ The number of values .

# Nine 、 A graph shows you entropy and information gain

Suppose there are random variables $$(X,Y)$$,$$H(X)$$ Express $$X$$ The entropy of ,$$H(Y)$$ Express $$Y$$ The entropy of ,$$H(X|Y)$$ It means known $$Y$$ when $$X$$ The conditional entropy ,$$H(Y|X)$$ It means known $$X$$ when $$Y$$ The conditional entropy ,$$I(X,Y)$$ Indicates information gain ,$$H(X,Y)$$ The joint entropy of representation .