Catalog

  • Entropy and information gain
  • One 、 entropy (Entropy)
  • Two 、 Conditional entropy (Conditional Entropy)
  • 3、 ... and 、 Joint entropy (Joint Entropy)
  • Four 、 Relative entropy (Relative Entropy)
    • 4.1 Properties of relative entropy
  • 5、 ... and 、 Cross entropy (Cross Entropy)
  • 6、 ... and 、 Relative entropy 、 The relationship between cross entropy and entropy
  • 7、 ... and 、 Information gain (Information Gain)
  • 8、 ... and 、 Information gain ratio (Information Gain Ratio)
  • Nine 、 A graph shows you entropy and information gain


to update 、 More comprehensive 《 machine learning 》 The updated website of , What's more python、go、 Data structure and algorithm 、 Reptiles 、 AI teaching is waiting for you :https://www.cnblogs.com/nickchen121/p/11686958.html


Entropy and information gain

One 、 entropy (Entropy)

The measurement of uncertainty in entropy representation of random variables . Assumed discrete random variable \(X\) You can get it. \(n\) It's worth , The probability distribution is


\[P(X=x_i)=p_i, \quad i = 1,2,\ldots,n \]


be \(X\) Entropy is defined as


\[H(X) = -\sum_{i=1}^n p_i log{p_i} \]


Because entropy only depends on \(X\) The distribution of , And \(X\) The value of itself doesn't matter , So entropy can also be defined as


\[H(p) = -\sum_{i=1}^n p_i log{p_i} \]


The greater the entropy , The greater the uncertainty of the random variable , also \(0\geq{H(p)}\leq\log{n}\).

When a random variable takes only two values \(0\) and \(1\) When ,\(X\) The distribution of is


\[P(X=1)=p, \quad P(x=0)=1-p, \quad 0\geq{p}\leq{1} \]


Entropy is


\[H(p) = -p\log_2 p-(1-p) \log_2(1-p) \]


The random variable is Bernoulli distribution , The curve of entropy versus probability is shown in the figure below

import numpy as np
from math import log
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

p = np.arange(0.01, 1, 0.01)
entro = -p*np.log2(p) - (1-p)*np.log2(1-p)

plt.plot(p, entro)
plt.title(' The relationship between entropy and probability in Bernoulli distribution ', fontproperties=font)
plt.xlabel('p')
plt.ylabel('H(p)')
plt.show()

png

When \(p=0\) and \(p=1\) The entropy is \(0\), At this point, there is no uncertainty in random variables ; When \(p=0.5\) It's the largest entropy , Random variables are the most uncertain .

Two 、 Conditional entropy (Conditional Entropy)

Suppose there are random variables \((X,Y)\), The joint probability is


\[p(X=x_i,Y=y_i), \quad i=1,2,\ldots,n; \quad j=1,2,\ldots,m \]


Conditional entropy \(H(Y|X)\) I have a random variable \(X\) Is a random variable \(Y\) uncertainty , Defined as


\[H(Y|X) = \sum_{i=1}^n P(X=x_i) H(Y|X=x_i) \]


Through the formula, conditional entropy can be understood as the amount of information obtained when acquiring another information on the basis of knowing a certain information .

When the probabilities in entropy and conditional entropy are obtained by data estimation , The corresponding entropy and conditional entropy are called empirical entropy respectively (empirical entropy) And empirical conditional entropy (empirical conditional entropy).

3、 ... and 、 Joint entropy (Joint Entropy)

Suppose there are random variables \((X,Y)\), The joint probability is


\[p(X=x_i,Y=y_i)=p_{ij}, \quad i=1,2,\ldots,n; \quad j=1,2,\ldots,m \]


Joint entropy measures the uncertainty of a joint distributed stochastic system , It's defined as


\[H(X,Y) = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i,Y=y_j)} \]


So we can simplify the joint entropy


\[\begin{align} H(X,Y) & = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i,Y=y_j)} \\ & = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i)\log{p(Y=y_i|X=x_i)}} \\ & = -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(X=x_i)} -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(Y=y_i|X=x_i)}\\ & = -\sum_{i=1}^n p(X=x_i) \log{p(X=x_i)} -\sum_{i=1}^n \sum_{j=1}^m p(X=x_i,Y=y_j) \log{p(Y=y_i|X=x_i)} \\ & = H(X) + H(Y|X) \end{align} \]


The same is true \(H(X,Y)=H(Y)+H(X|Y)\), In other words, joint entropy represents a stochastic system with two random variables , You can first observe a random variable to get information , After that, we can observe the amount of information of the second random variable on the basis of having this amount of information , And no matter which random variable is observed first, it has no effect on the amount of information .

In the same way, we can get one that contains \(n\) A random system of independent random variables \((X_1,X_2,\ldots,X_n)\) The joint entropy of is


\[H(X_1,X_2,\ldots,X_n) = \sum_{i=1}^n H(X_i) \]


It can be found that even if it contains \(n\) No matter which random variable is observed first, there is no influence on the amount of information .

Four 、 Relative entropy (Relative Entropy)

Relative entropy is sometimes called KL The divergence (Kullback–Leibler divergence).

set up \(p(x)\)、\(q(x)\) It's a discrete random variable \(X\) Two probability distributions of the values in , be \(p\) Yes \(q\) The relative entropy of is :


\[\begin{align} DKL(p||q) & = \sum_{i=1}^n p(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = E_{p(X=x_i)}\log{\frac{p(X=x_i)}{q(X=x_i)}} \end{align} \]


4.1 Properties of relative entropy

  1. If \(p(x)\) and \(q(x)\) The two distributions are the same , So relative entropy equals 0
  2. \(DKL(p||q)≠DKL(q||p)\), Relative entropy is asymmetric
  3. \(DKL(p||q)≥0\)( utilize Jensen Inequality can be proved )


\[\begin{align} DKL(p||q) & = \sum_{i=1}^n p(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = - \sum_{i=1}^n p(X=x_i)\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - E_{p(X=x_i)}\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & \geq -\log{E_{p(X=x_i)}}\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - \log\sum_{i=1}^n p(X=x_i)\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - \log\sum_{i=1}^n q(X=x_i) \end{align} \]


among \(\sum_{i=1}^n q(X=x_i)=1\), Obtain evidence \(DKL(p||q)≥0\)
4. Relative entropy can be used to measure the difference between two probability distributions , The meaning of the above formula is to find \(p\) And \(q\) The logarithmic difference between is \(p\) Expectations on

5、 ... and 、 Cross entropy (Cross Entropy)

Definition : Two probability distributions based on the same time measure \(p(x)\) and \(q(x)\) The cross entropy of means , When based on a “ Unnatural ”( be relative to “ True distribution ”\(p(x)\) for ) Probability distribution of \(q(x)\) When encoding , The average number of bits needed to uniquely identify an event in the time set ( Using a non real distribution \(q(x)\) The amount of effort required by the specified strategy to eliminate system uncertainty ).

Suppose that random variables \(X\) You can get it. \(n\) It's worth . Now there are two probability distributions for the sample set \(p(X=x_i)\) and \(q(X=x_i)\), among \(p(X=x_i)\) For real distribution ,\(q(X=x_i)\) Untrue distribution . If we use the real distribution \(p(X=x_i)\) To measure the expectation of the encoding length needed to identify another sample ( Average encoding length ) by :


\[\begin{align} H(p) & = \sum_{i=1}^n p(X=x_i)\log{\frac{1}{p(X=x_i)}} \\ & = - \sum_{i=1}^n p(X=x_i)\log{p(X=x_i)} \end{align} \]


If you use a non real distribution \(q(X=x_i)\) To represent the real distribution \(p(X=x_i)\) The average code length of , It is :


\[H(p,q) = \sum_{i=1}^n p(X=x_i)\log{\frac{1}{q(X=x_i)}} \]


Because with \(q(X=x_i)\) To encode samples from the distribution \(q(X=x_i)\), therefore \(H(p,q)\) The probability in is \(p(X=x_i)\), At this point will be \(H(p,q)\) It's called cross entropy .

for instance . Consider a random variable \(X\), True distribution \(p(X)=({\frac{1}{2}},{\frac{1}{4}},{\frac{1}{8}},{\frac{1}{8}})\), Untrue distribution \(q(X)=({\frac{1}{4}},{\frac{1}{4}},{\frac{1}{4}},{\frac{1}{4}})\), be \(H(p)=1.75bits \text{ The shortest average code length }\), Cross entropy


\[H(p,q)={\frac{1}{2}}\log_24+{\frac{1}{4}}\log_24+{\frac{1}{8}}\log_24+{\frac{1}{8}}\log_24=2bits \]


From this we can see that according to the unreal distribution \(q(X=x_i)\) The average code length obtained is larger than that according to the real distribution \(p(X=x_i)\) The average code length obtained , But is it an example or is it always going to be like this ?

6、 ... and 、 Relative entropy 、 The relationship between cross entropy and entropy

Let's simplify the formula of relative entropy .


\[\begin{align} DKL(p||q) & = \sum_{i=1}^np(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = \sum_{i=1}^np(X=x_i)\log{p(X=x_i)}−p(X=x_i)\log{q(X=x_i}) \end{align} \]


If the simultaneous entropy formula and the cross entropy formula


\[\begin{align} entropy & = H(p) \\ & = −\sum_{i=1}^np(X=x_i)\log{p(X=x_i)} \end{align} \]



\[\begin{align} Cross entropy & = H(p,q) \\ & = \sum_{i=1}^n p(X=x_i)\log{\frac{1}{q(X=x_i)}} \\ & = −\sum_{i=1}^np(X=x_i)\log{q(X=x_i)} \end{align} \]


You can launch


\[DKL(p||q)=H(p,q)−H(p) \]


From the above formula, it can be concluded that when the non real distribution is used \(q(x)\) The average code length obtained is larger than the true distribution \(p(x)\) The number of bits more than the average code length is the relative entropy .

Again because \(DKL(p||q)≥0\), be \(H(p,q)≥H(p)\), When \(p(x)=q(x)\) when , So the cross entropy equals entropy .
And when the \(H(p)\) Is a constant ( notes : In machine learning , The distribution of training data is fixed ), Minimize relative entropy \(DKL(p||q)\) It's equivalent to minimizing cross entropy \(H(p,q)\) It is also equivalent to maximum likelihood estimation .

7、 ... and 、 Information gain (Information Gain)

Suppose there are random variables \((X,Y)\), Information gain represents characteristics \(X\) The information that makes the class \(Y\) The degree to which information uncertainty is reduced .

features \(A\) Pair training set \(D\) The information gain of is recorded as \(g(D,A)\), The information gain can be defined as a set \(D\) Empirical entropy and characteristics of \(A\) Given the conditions \(D\) Entropy of empirical conditions \(H(D|A)\) The difference between the


\[g(D,A) = H(D) - H(D|A) \]


among \(H(D)\) Represents a pair of data sets \(D\) The uncertainty of classification ;\(H(D|A)\) It means in the feature \(A\) Given the conditions for the dataset \(D\) The uncertainty of classification ;\(g(D,A)\) Show due to features \(A\) And make the data set \(D\) The degree to which the uncertainty of classification is reduced . So it can be found that for datasets \(D\) for , Information gain depends on features , Different features often have different information gain , Features with large information gain have stronger classification ability .

8、 ... and 、 Information gain ratio (Information Gain Ratio)

Suppose there are random variables \((X,Y)\), features \(A\) The data set \(D\) The information gain ratio is recorded as \(g_R(D,A)\), Defined as


\[g_R(D,A) = {\frac{g(D,A)}{H_A(D)}} \]


And the characteristic entropy \(H_A(D) = -\sum_{i=1}^n {\frac{D_i}{D}} \log_2 {\frac{D_i}{D}}\),\(n\) Is the characteristic \(A\) The number of values .

Nine 、 A graph shows you entropy and information gain

 A graph shows you entropy and information gain

Suppose there are random variables \((X,Y)\),\(H(X)\) Express \(X\) The entropy of ,\(H(Y)\) Express \(Y\) The entropy of ,\(H(X|Y)\) It means known \(Y\) when \(X\) The conditional entropy ,\(H(Y|X)\) It means known \(X\) when \(Y\) The conditional entropy ,\(I(X,Y)\) Indicates information gain ,\(H(X,Y)\) The joint entropy of representation .