# Data analysis model Chapter 1

2020-12-20 09:16:35

# One . Basic introduction

During my college years , For the data analysis model course , My feeling up and down is the basic statistical Discrete Mathematics , This course will tell you some of the most basic statistical knowledge, very close to some of our high school mathematics knowledge , This is advanced data analysis .
Data analysis model and advanced data analysis are deep learning, that is, the basic course of artificial intelligence is also under the category of data science , It's my major, too . You can take a look at another blog I wrote — Data analysis model Catalog , So as to have a more intuitive understanding of this course .
Here, I would like to recommend the reading materials recommended by the teacher for this course , Students interested in self-study can refer to .
Ross, S.M. (2014) Introduction to Probability and Statistics for Engineers and Scientists, 5th ed. Academic Press.

Students who are not good at English can also find their Chinese version on the Internet for self-study .

# Two . Model

What is a model , In the field of Data Science , A model is a mathematical or logical expression or equation . To put it bluntly, you can also understand it as an input , It gives you the output you want .
There is no absolute right or wrong with the model , But for different purposes , They can be called relatively useful or relatively useless .

• Example ：
1. A model A（ A mathematical model of aircraft information ） Can represent the relative dimensions of the wing and fuselage
2. A model B（ Alternative models ,alternative model） It can more accurately show the flow behavior of the aircraft in the air .* Alternative models ： It's the opposite of the original model .

If we look at the proportion of airplanes , Or the shape of an airplane , We use models A, If you study the state of an airplane in flight , We choose the model A The opposite model of .

Let's take a look at some of the most basic models used in data science , The purpose here is to give a general introduction to the common models , Not in depth , I'll talk about it in detail later , I just want to give you a rough understanding of

One . Classification model （ classifier ）

As a classification function model, the classifier distinguishes the data according to several classes, so as to predict the future data types . Classifiers such as decision trees , Logical regression , And neural networks in deep learning , We will all talk about . Of course, there are only men and women in this picture , We can also call this classifier binary classifier（ Binary classifier ）. According to the independent variable X( For example, the hair is long , The larynx , clothes ), Through your model or mathematical logic equation (y=f(X)) To predict the dependent variable y（ for example y=0 For girls ,y=1 For boys ）.

Two . Probability classifier

Probability classifier , Similar to the traditional classifier , The difference is only the use of probability to identify which category it belongs to , That is to calculate the probability of each category .

3、 ... and . The regression model

Through your series of data （ The independent variables X And dependent variables Y）, You get an equation ( Model ), Bring in the arguments to the equation （ Model ） You get the value of the predicted dependent variable in . As shown in the picture, if everyone has his own hourly salary , The highest hourly salary 100 element , Then you can use the equation and the person's job information to predict other people's hourly pay . The equation here is called the regression equation （ The regression equation is , Linear regression equation , Logistic regression equation , The penalty regression equation （ Ridge return , Lasso is back ） I'll explain it later ）. Why is it always emphasized here X Values and Y value ,X Values and Y Value as your data (data) To find models that fit or interpret the data . Classifier or probability classifier and regression model need to have X Values and Y Value to find the corresponding specific model , Example , If you have a set of data that has X Values and Y The value corresponds to 2x+1=y This function or model , We all know that the function is a one variable linear equation , How to find a=2,b=1, This 2x+1=y The equation of degree of one variable , We need to have relevant X Values and Y Value data to find the parameters of the model . So use your data X Values and Y Value to find a specific model suitable for interpreting the data we have . Such methods are called supervised learning in deep learning , Use the computer to find the model with less prediction error .

Four . Cluster class model

Generally, the data of cluster model has no dependent variable y Of , In deep learning, dependent variables y It's also called a tag, and it can be called a truth value . In other words, the cluster class model is based on the independent variable x To classify by model . Compare the classifiers we mentioned earlier , It's different from the cluster , To find a specific classifier, we need dependent variables y. In deep learning , The way to find a model with dependent and independent variables is called supervised learning , The way to find a model only through independent variables is called unsupervised learning . For the cluster class model , According to the example above , We can base it on human gender （ As a dependent variable x） To sort it out （ For example, the larynx ）, Because we don't have dependent variables y value , We can't tell it's a man （ forecast y The value is 1） It's a woman. （ forecast y The value is 0）. But we can present it in the form of images , Male dots are clustered in a region A, Women's dots will cluster in another area B（ So it's called the cluster class model ）. So when we predict based on the independent variable, if the corresponding point is at A We think it's male , On the contrary, it's women . In deep learning , It is difficult to train the cluster model to find the parameters and find the corresponding model , It's hard to say that learning is not supervised . Well, let's start with that , Let's review the common models here , Just know what they do . The details will be mentioned later .

5、 ... and . prediction model （Forecasting）
The prediction model is very similar to the previous model , It's all prediction . But the difference is , All of the above models predict a value , The prediction model focuses more on predicting the change of a value or a series of values . For example, forecasting house prices , Your concern is whether the house price will rise or fall tomorrow , How much more , How much lower , Or the long-term range of changes in housing prices . Yes, of course , For example, you can use a classifier as your prediction model , For example, whether you need to buy , Or sell . But the key point is , Give information to the past , You need to be as accurate as possible for a range of future data .

6、 ... and . Anomaly detection model （Anomaly Detection）
Detect abnormal data based on normal data . for instance , A person usually transfers money every day 100 block , One night he suddenly spent or transferred money 1W block . Then the behavior is detected as abnormal by the model . Another example , If a software that records the number of steps is used by the elderly , This software records how many meters old people usually go out for a walk every day . But suddenly one day, the software recorded how many meters the old man didn't go out to walk , Then the system will think whether the old man is ill or fell down .

7、 ... and . Recommend system model （Recommended system）
For example, you visit some treasure , The model predicts your preferences based on what you normally care about , And then provide you with related products .

These are some of the most basic and common models , Let's take you to know , More in-depth knowledge will be discussed later .

Here are some common statistical terms ：

1. The overall (population)： To put it bluntly, it's about one thing , You have a lot of relevant measurable data , To find your model . The amount of data is often infinite for us . These data are called aggregates
2. sample (sample)： It's picking out the whole population with an exhaustive amount of data to find your model . So the lack of data sometimes makes the model you find unsaturated, that is, it is not accurate enough to predict new data , When the data is large enough, the model you find is oversaturated, that is, your model is only limited to predicting the data you select , There's no way to predict new data . Of course, here is just a brief introduction to the concepts of unsaturated and supersaturated , This leads to the model being unsaturated , And there are many reasons for oversaturation , Here, press no table first . In short, the sample is the data selected from the population .
3. Model (model)： Here's a wordy sentence , The philosophical meaning of the model is the interpretation of the data ( Of course, there are many models that we can't explain ), The computer interpretation of the model is a set of mathematical and logical expressions , We usually use data to find models that are neither right nor wrong , Only relatively useful and useless . When you have a bunch of data , You want to find a model that can interpret and predict this kind of data , You have two problems to solve ,1. What does this model look like ,2. What are the parameters of this model? In other words, find a specific or suitable model for your data under the model . When you solve these two problems, you can think that you have found the model . For example, it was mentioned before 2x+1=y,a=2,b=1, When you already know that this model is a linear equation of one variable , that a and b These two parameters can be found in your data . Then how do we know that the model is a linear equation of one variable , earlier , Mathematicians use a generalized function to try to interpret all types of data , Approximately simulate all types of data by changing the parameter or exponential change of the generalization function , For example, generalized generalized functions , The additive model , And polynomial functions and so on . Using data to find models is the legendary machine learning generalization explanation , It's just in deep learning , The model is a neural network ( You can think of it as a universal model, which is a generalized model , Can interpret or simulate a large number of different types of data , For example, house prices can be predicted , Waste sorting and other different projects ). Some data have X Values and Y value , The model found from this data is called supervised learning , And there's data to write, just X It's worth nothing Y value , The model found from this data is called unsupervised learning . People who do data mining generally know , Finding and recording data is a lot of labor and money , The data is just X Values are very common . In this case, it is necessary to use the data relationship or change the data to find some other things . The example is like the male and female cluster model mentioned above . Another example ： If you were given a lot of bird photos （X value ） But I won't tell you what kind of bird it is （ nothing Y value ）, We can use rotating photos to create other types of data creation Y value , So our model can tell if the photo has been processed or reversed , That's changing the data . Of course, this method can also solve the problem of supersaturation , As for why it will be mentioned later .

Terms for the form of data ：
1 Nominal categorical data (categorical nominal data): discrete , Co., LTD. , Disordered numbers , for example ： Gender , nationality
2. Sorted data in order (categorical ordinal data): discrete , Co., LTD. , Ordered numbers , for example ： educational level （ Small , first , high , university ）
3. Discrete values (numeric discrete): Data in digital form , enumerable , Countable Qing , There are poor numbers . for example ： Several positive integers , Several negative integers , How many employees in the company are not full 18 year .
4. Continuous values (numeric continue): Data in digital form , countless , innumerable , Infinity . for example , Greater than 0 The real number , Less than 0 The real number , height （1 rice 789321）, weight , length . Categorical data are generally qualitative , And numerical data are usually quantitative .

# 3、 ... and . Random variables and probability distributions

Random sampling from the sample to get the sample , That's our data , Using this sample ( data ) Find the right model .
One . A random variable (Random Variables):
If you roll two dice and the sum is 7 Probability , We actually pay more attention to whether there are two dice and whether they are 7, As for what it is (1,6),(2,5),(3,4),(5,2),(6,1),(4,3) It's not the purpose of our experiment . there (1,6),(2,5),(3,4),(5,2),(6,1),(4,3) It's our random variable . Let's simply calculate the sum of the next two dice 7 Probability ：P{X=7}=P{(1,6),(2,5),(3,4),(5,2),(6,1),(4,3)}=1/6 * 1/6 * 6=6/36.
X=7 It's the purpose of our experiment that determines that our random variable is (1,6),(2,5),(3,4),(5,2),(6,1),(4,3)
When you roll two dice, the sum is 12 The probability of , namely X=12 when , P{X=12}=P{(6,6)}=1/36. Our random variable is (6,6)
It's up to our experimental purpose to determine our random variable , It's not random variables that determine the purpose of our experiment . So why are these variables called random variables , Here we need to explain what randomness is , The randomness of variables is due to these three points -------- The measurement error of the experiment ( The measured variables have errors ), Measurement factors ( The value of some variables is not because of error, but because of the negligence of some factors ) And random sampling ( Random sampling from the population ). Then a variable with this randomness is a random variable .
In the field of discrete and continuous Mathematics , To put it simply , Random variables are our data samples , And these random variables may obey or conform to a certain probability distribution ( Model ).

Two . A probability distribution (Probability distribution)
As mentioned above , Probability distribution is a model to explain our data, that is, random variables . How to write it : P(X=x), x∈X,X It's equivalent to the whole ,x For the sample .
For example, roll a dice and roll 1 Is the probability that 1/6, namely P(X=1)=1/6. Its probability distribution is actually x Axis ,x∈X={1,2,3,4,5,6}, Its corresponding y Axis P(X) Values are 1/6, 6 Discrete points .

nature 1:
P(X=x)∈[0,1], For all x∈X, Satisfy :
∑ x ∈ X P ( X = x ) = 1   \sum_{x∈X} P(X=x) =1\, xXP(X=x)=1
nature 2:
P(X∈A1∪A2)=P(X∈A1)+P(X∈A2)-P(X∈A1∩A2)
∪ Combine (union set), ∩ intersection (intersection set), Set is a high school mathematics concept, and I will not repeat it here .

nature 3:
joint probability (joint probability)
When we study 2 Or 2 When there are more than random variables , We calculate its joint probability .
If there are two sets of random variables here (RVs), X and Y
X={1,2,3}, Y={1,2}, that
X x Y={ {1,1},{2,1},{3,1},{1,2},{2,2},{3,2}}, So we can define it as P(X=x, Y=y) ∈[0,1].
about P(X=x, Y=y) ∈[0,1], all x∈X,y∈Y, Satisfy :
∑ x ∈ X , y ∈ Y P ( X = x , Y = y ) = 1   \sum_{x∈X,y∈Y} P(X=x, Y=y) =1\, xX,yYP(X=x,Y=y)=1
If X and Y If these two kinds of variables do not affect each other
P ( X = x , Y = y ) = P ( X = x ) ∗ P ( Y = y ) P(X=x, Y=y) = P(X=x)*P(Y=y) P(X=x,Y=y)=P(X=x)P(Y=y)
Roll two dice with us ( Throw these dice and you get two numbers , The probability of these two events does not affect each other ) And for 7 Similar algorithms , When X=x when ,Y=y The probability that these two things will happen together without interfering with each other .

So if X and Y These two kinds of variables ( Two types of events ) When they interfere with each other , At this time, we need to calculate according to the meaning of the question . for instance , Suppose the dice A Let's set it as X, dice B Let's set it as Y. We still want to calculate the sum of two dice 7 Probability , But with a small condition , When the dice A Throw out 1,2,3 We can only roll dice with these three numbers B. So we can only have (1,6),(3,4),(2,5) These three . So at this time P(X=x,Y=y)=3 * 1/6 * 1/6 * 1/6. here P(X=1,2,3)= 3 * 1/6, that P(Y)=1/6 * 1/6, Not just 1/6.

nature 4:
Edge probability (marginal probability)
P ( X = x ) = ∑ y ∈ Y P ( X = x , Y = y )   P(X=x)= \sum_{y∈Y} P(X=x, Y=y) \, P(X=x)=yYP(X=x,Y=y)
P(X=x) It's called marginal probability , That is all Y=y,X=x Probability

nature 5：
Conditional probability (conditional probability)
P ( X = x ∣ Y = y ) = P ( X = x , Y = y ) P ( Y = y )   = P ( X = x , Y = y ) ∑ x ∈ X P ( X = x , Y = y )     P(X=x | Y=y)= \frac{P(X=x,Y=y)}{P(Y=y)}\ = \frac{P(X=x,Y=y)}{\sum_{x∈X} P(X=x, Y=y)\,}\, P(X=xY=y)=P(Y=y)P(X=x,Y=y) =xXP(X=x,Y=y)P(X=x,Y=y)
P(X=x | Y=y) For conditional probability , The translation means , When given Y=y The probability of , that X=x Probability .
Note here that if two random variables are also independent of each other , that
P ( X = x ∣ Y = y ) = P ( X ) . P(X=x | Y=y)= P(X). P(X=xY=y)=P(X).
for instance : Here's the picture

Don't get tangled up P(X=1,Y=1) Why is 0.05, We default that these values are correct , namely P(X=1,Y=1)=0.05, P(X=2,Y=1)=0.15, wait .
that ,
P(Y=1)=0.05+0.15+0.1=0.3 ( Edge probability ).
P(X=1|Y=1)=P(X=1,Y=1)/P(Y=1)=0.05/0.7 ( Conditional probability ).

When it's under the same distribution , Its random variables don't interfere with each other , We call these variables independent and identically distributed (independent and identically distributed, Abbreviation i.i.d) For example, the two sieves mentioned above , The probability distribution of throwing the first sieve is the same as that of throwing the second sieve , And random variables X and Y Independent to each other , be X,Y by i.i.d, P(X=1)=P(Y=1)=1/6.
That is, if X1,X2 yes i.i.d, that P(X1=x)=P(X2=x), For all x1,x2∈X. let me put it another way , You can also think that they have the same marginal probability .

3、 ... and . Continuous random variables （continuous random variables )
The random variables we talked about before are discrete random variables , For example, dice can only have positive integers 1,2,3,4,5,6 this 6 Random variables . This time we call our random variable a real number , namely X It's consistent with the probability density function (probability density function, Abbreviation pdf) p(x)
that , The pdf, For all x∈X, Satisfy :
1 > p ( x ) > = 0 , and ∫ X p ( x ) d x = 1 1>p(x)>=0, and \int_{X}^{}p(x)\mathrm{d}x=1 1>p(x)>=0, and Xp(x)dx=1
If X stay (a,b) Within the interval , Then for ：
P ( a < X < b ) = ∫ a b p ( x ) d x P(a<X<b)=\int_{a}^{b}p(x)\mathrm{d}x P(a<X<b)=abp(x)dx

If (x0- δ / 2 \delta/2 δ/2<X<x0+ δ / 2 \delta/2 δ/2), So its probability image is the shadow part , That's the probability of it ：

In fact, it is the calculus of continuous function , Calculate its shadow area ,P(x0- δ / 2 \delta/2 δ/2<X<x0+ δ / 2 \delta/2 δ/2)=P(x0+ δ / 2 \delta/2 δ/2)-P(x0- δ / 2 \delta/2 δ/2), Large area minus small area . We can write it in a more beautiful way ：
A δ = ( x 0 − δ / 2 , x 0 + δ / 2 ) A_{\delta}=(x0-\delta/2,x0+\delta/2) Aδ=(x0δ/2,x0+δ/2)
P ( x ∈ A δ ) = ∫ x 0 − δ / 2 x 0 + δ / 2 p ( x ) d x = [ ∫ p ( x ) d x ] x = x 0 + δ / 2 − [ ∫ p ( x ) d x ] x = x 0 − δ / 2 ≈ x 0 ∗ δ P(x∈A_{\delta})=\int_{x0-\delta/2}^{x0+\delta/2}p(x)\mathrm{d}x=[\int p(x)\mathrm{d}x]_{x=x0+\delta/2}-[\int p(x)\mathrm{d}x]_{x=x0-\delta/2} \approx x0*\delta P(xAδ)=x0δ/2x0+δ/2p(x)dx=[p(x)dx]x=x0+δ/2[p(x)dx]x=x0δ/2x0δ
So when δ \delta δ Tend to be 0 when ：

1. A δ A_{\delta} Aδ Tend to be x0.
2. P(x∈ A δ A_{\delta} Aδ) Tend to be 0, The shadow area is just a millionth , because dx Very small , infinitesimal . When A δ A_{\delta} Aδ=x0 when ,P(X=x0)= ∫ x 0 x 0 f ( x ) d x \int_{x0}^{x0}f(x)\mathrm{d}x x0x0f(x)dx=0, The value of this trace is actually 0.

Two continuous random variables X and Y：
1. Edge probability
that x The marginal probability of is p ( x ) = ∫ p ( x , y ) d y p(x)=\int p(x,y)\mathrm{d}y p(x)=p(x,y)dy
if P(X∈A)= ∫ A ∫ p ( x , y ) d y d x \int_{A}\int p(x,y)\mathrm{d}y\mathrm{d}x Ap(x,y)dydx, When you figure out p(x) when , We've got it x Probability distribution of , because x∈A, We still have to use antimissile to find correspondence A The area of .

2. joint probability
Two continuous variables X and Y The joint probability of ：
if X,Y Independence is ：
P ( X ∈ A , Y ∈ B ) = ∫ B ∫ A p ( x , y ) d x d y P(X∈A,Y∈B)=\int_{B}\int_{A}p(x,y)\mathrm{d}x \mathrm{d}y P(XA,YB)=BAp(x,y)dxdy
Another example : X1,X2,…Xn Being independent of each other is
P ( X 1 ∈ a 1 , X 2 ∈ a 2 , . . . . , X n ∈ a n ) = ∫ A n ∫ A n − 1 . . . . . ∫ A 1 p ( x 1 , x 2 , . . . . , x n ) d x 1 d x 2... d x n P(X1∈a1,X2∈a2,....,Xn∈an)=\int_{An}\int_{An-1}.....\int_{A1}p(x1,x2,....,xn)dx1dx2...dxn P(X1a1,X2a2,....,Xnan)=AnAn1.....A1p(x1,x2,....,xn)dx1dx2...dxn
if X,Y Not independent , That needs to be calculated according to the meaning of the question , It's similar to the joint probability of discrete variables .

3. Conditional probability
P(X|Y)= p ( x , y ) p ( y ) = p ( x , y ) ∫ p ( x , y ) d x \frac{p(x,y)}{p(y)}=\frac{p(x,y)}{\int p(x,y)dx} p(y)p(x,y)=p(x,y)dxp(x,y), The denominator of the formula is y The marginal probability of .
if X,Y Independent , be :
P(X|Y)=P(X)= ∫ p ( x , y ) d y {\int p(x,y)dy} p(x,y)dy

Four . Cumulative continuity equation (Cumulative distribution functions, Abbreviation cdf)
As a matter of fact, we have already dealt with the accumulation of , It's the anti missile shadow area of a continuous function , The addition of a number of elements .

For continuous variables, it's cdf by ：
P ( X < = x ) = ∫ − ∞ x p ( x ) d x P(X<=x)=\int_{-\infty}^{x} p(x)\mathrm{d}x P(X<=x)=xp(x)dx
For a discrete variable, its cumulative equation is ：
P ( X < = x ) = ∑ P ( x )   P(X<=x)= \sum_{} P(x) \, P(X<=x)=P(x)
Here's a long winded sentence ：
P ( X > x ) = 1 − P ( X < = x ) P(X>x)=1-P(X<=x) P(X>x)=1P(X<=x)

In Statistics ,Q( p ) ={P(X<=x)=p}, If Q(p=1/2) It's in the middle (median), If Q(p=1/4) It's the first quartile (first quartile), If Q(p=3/4) It's the third quartile (third quartile).

# Four . Conclusion

Self study students can have a look at Ross, S.M. (2014) Introduction to Probability and Statistics for Engineers and Scientists, 5th ed. Academic Press. The first 4 Chapter .

https://chowdera.com/2020/12/20201220091429820u.html