As a data person , Is it often surrounded by various nouns , Do you have a vague understanding of many of these concepts . Some words are only one word away , But they mean completely different , Today, let's learn about some common concepts and their relationships in data warehouse construction and data analysis .
The structure of this paper is shown in the figure below ：
One 、 Analysis of common concepts in data warehouse
Entity refers to the subject attached , Is an object we analyze , For example, we analyze the sales of goods , For example, what is the sales volume of Huawei's mobile phones in the past six months , Huawei mobile phone is an entity ; We analyze user activity , A user is an entity . Of course, entities can also exist in reality , For example, virtual business objects , Activities , Members, etc. can be regarded as an entity .
Entities exist for business analysis , As a filtered dimension of analysis , Have properties that describe yourself , It has analyzable value .
Dimension is the angle of view , Analyze business data , From what angle , What kind of dimension is established . Therefore, dimension is a quantity used to analyze data , For example, you need to analyze the sales of products , You can choose to analyze by commodity category , This constitutes a dimension , Put all commodity categories together , It constitutes a dimension table .
A metric is a numeric value on a business process node . Like sales , Price , Cost, etc. .
The measures in the fact table can be divided into three categories ： Completely additive , Half can add , Do not add .
Completely additive metrics are the most flexible , The most useful , For example, sales , Sales, etc , You can summarize any dimension ;
Semi additive measures can summarize some dimensions , But you can't aggregate all dimensions , Difference is a common semi additive measure , In addition to the time dimension , You can add across all dimensions ;
The other is completely non additive , for example ： ratio . For this kind of non additive metric , A good way is , Store as many fully additive components as possible that are not additive , And before calculating the final non additive fact , Summarize these components into the final result set .
4. Particle size
Granularity is the unit of measurement in a business process , For example, goods are measured by piece record , Or is it measured by batch records .
In the construction of digital warehouse , We say this is a fact table of user granularity , Then each row of data in the table is a user , No duplicate users ; For example, there is a table of sales granularity , Then each row in the table is a sales record .
Choosing an appropriate granularity level is an important key content for the construction of data warehouse , When designing data granularity , Generally, the following factors need to be considered ：
Type of analysis to accept 、 The minimum granularity of acceptable data and the amount of data that can be stored ;
The higher the level of granularity is defined , The less you can do a more detailed analysis in the warehouse ;
If storage resources are limited , You can only use higher data granularity ;
The data granularity partition strategy must ensure ： The granularity of data can really meet the needs of users' decision analysis , This is the most important criterion in the data granularity partition strategy .
Dimension is data access logic （ How to access data ）, such as Number to fetch yes 10 The average height of boys under the age of , This is the statistical caliber .
An indicator is a measure of caliber , That is the final result . For example, the order volume in the last seven days , Purchase conversion rate of a promotion, etc .
An indicator is specific to the calculation and implementation , It mainly consists of the following parts ：
Index processing logic , such as count ,sum, avg
dimension , For example, by Department 、 Regional index statistics , Corresponding sql Medium group by
Business limits / Modifiers , For example, different payment channels are used to calculate the corresponding indicators , Order refund rate paid by wechat , Order refund rate paid by Alipay . Corresponding sql Medium where.
besides , Indicators themselves can also be derived 、 Derive more indicators , Based on these characteristics , Indicators can be classified ：
- Atomic index ： Basic business facts , No business restrictions 、 There is no dimension . For example, the order quantity in the order table 、 The total amount of the order is an atomic indicator ;
Indicators more concerned by business parties , It has practical business meaning , You can take the indicators of data directly . For example, the store is near 1 The daily order payment amount is a derivative indicator , It will be displayed directly on the products to the merchants .
But this index can't be taken directly from the unified middle layer of the data warehouse （ Because there are no existing fact fields , Generally, they provide wide watches ）. There needs to be a bridge between the middle tier of the warehouse and the business side's index requirements , So there are derived indicators
- Derived indicators ： dimension + Modifiers + Atomic index . The shop is near 1 The store is the dimension of the payment amount of orders per day , near 1 Day is a modifier of time type , Payment amount is an atomic indicator ;
dimension ： Observe the angle of each index ;
Modifiers ： One or more values of the dimension , For example, under the dimension of gender , Men and women are 2 A modifier .
- Derivatives ： For example, the conversion rate of a promotion activity is a derivative indicator , Because of the need Promotion number index and Promotion order quantity index To calculate .
Labels are set artificially 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、 Bai Fu beauty 、 lolita . For ambiguous labels , We can distinguish labels internally , such as ： Apple , We can define apple as fruit , Apple phone means mobile phone .
8. Natural bond
A key consisting of attributes that already exist in reality , It is unique in the business concept , And has certain business implications , For example, commodities ID, staff ID.
From a warehouse Perspective , Identifiers from business systems are natural keys , For example, the employee number in the business library .
9. Persistent key
Remain permanent and will not change . Sometimes called a supernatural persistent bond . For example, ID number is persistent key. .
Difference between natural key and persistent key ： For example, I see , For example, the company's employees leave their jobs and rejoin them , His natural key, employee number, has changed , But his persistent key ID number is unchanged .
10. Surrogate key
Is a key that has no business meaning . Proxy keys have many other names ： Meaningless key 、 Integer key 、 Unnatural bond 、 Manual key 、 Synthetic bond, etc .
Proxy keys are simply expressed as integers produced in a sequential sequence . Page of product line 1 The row proxy key is 1, Then the proxy key in the next line is 2, Do so . The proxy key is only used to connect the dimension table and the fact table .
11. Degenerate dimension
Degenerate dimension , It's a dimension keyword that looks like a fact table , But in fact, there is no corresponding dimension table , That is, dimension attributes are stored in the fact table , This dimension column stored in the fact table is called degenerate dimension . Like other dimensions stored in dimension tables , The degenerate dimension can also be used to filter and query the fact table 、 Realize aggregation operation, etc .
So how to define the degenerate dimension ？ For example, orders id, This dimension is of great magnitude , There is no need to use a dimension table to store , When we query or filter data, we need , So this redundancy is in the fact table , This is called degenerate dimension ,citycode We will also be redundant in the fact table , however It has a corresponding dimension table , So it's not a degenerate dimension .
12. Run in
This is a common concept in data analysis , Drilling down can be understood as increasing the level of dimension , So that we can Observe data from coarse-grained to fine-grained , For example, when analyzing product sales , You can observe more fine-grained data from year to month to day along the time dimension . You can drill down from the year dimension to the month dimension 、 Dimension of day, etc .
13. Scroll up
I see. Run in , The first volume is easy to understand , The two are opposite operations , therefore Volume up can be understood as deleting some layers of dimension , The operation of observing data from fine-grained to coarse-grained or aggregating and summarizing data upward along the dimension level .
Two 、 Relationship between data warehouse concepts
1. Entity table , Fact table , Relationship between dimension tables
stay Kimball There are dimensions and facts in dimension modeling , stay Inmon There are entities and relationships in paradigm modeling , If we separate the two modeling methods, these concepts are easier to understand . However, there are many hybrid modeling methods , The two modeling methods are combined , These concepts are not easy to remember and confused , Especially the fact table and entity table , What are the differences and connections between them , Let's look at their respective concepts first ：
Dimension table ： The dimension table can be seen as a window for users to analyze a fact , The data in it should be a description of all aspects of the facts , For example, time dimension table , Regional dimension table , Dimension table is an analytical perspective of fact table .
Fact table ： In fact, a fact table determines a fact through the combination of various dimensions and some index values , For example, through the time dimension , Regional organization dimension , The fact that the index value can be used to determine the value of some indexes at a certain time and place . Each piece of data in the fact table is obtained by the intersection of the data and indicator values of several dimension tables .
Entity table ： An entity table is a table of actual objects , The data in the entity table must be the objective thing data , For example, all kinds of goods , It exists objectively , So you can design it as an entity table . Real time tables only describe things , There are no concrete facts , Therefore, some people call the entity table a fact table without facts .
for instance ： For example, there are apple phones in the mobile phone Mall , Huawei mobile phones and other mobile phones of various brands and models , These data can form a Mobile entity table , But there is no measurable data in the table . One day the Apple phone sold 15 platform , Huawei mobile phones sold 20 platform , These mobile phone sales data are facts , Form a Fact table . So you can use it Date dimension table and Regional dimension table Carry out various dimensional analysis of this fact table .
2. Difference between indicators and labels
- Different concepts
indicators Is used to define 、 A standard or way of evaluating and describing a particular thing . such as ： Number of new users 、 Cumulative users 、 User activity rate is an indicator to measure user development ;
label It's artificial 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、 Bai Fu beauty 、 lolita .
- Different composition
Index name It is the naming of the characteristics of matter matter matter and quantity ; The indicator value is the indicator at a specific time 、 regional 、 Quantitative performance under , Such as human weight , The indicator name is body weight , The value of the indicator is 120 Jin ;
Tag name Usually adjectives or adjectives + The structure of nouns , Labels are generally not quantifiable , Usually isolated , In addition to the basic class label , Labels processed by certain algorithms generally have no units and dimensions . If will exceed 200 Those who weigh five kilograms are called fat people .
- Different categories
Classification of indicators ：
Calculation logic according to indicators , Indicators can be divided into atomic indicators 、 Derived indicators 、 There are three types of derivative indicators ;
According to different descriptions of events , It is divided into process indicators and outcome indicators ;
Classification of labels ：
According to the variability of labels, they are divided into static labels and dynamic labels ;
According to the label reference and evaluation indicators , It can be divided into qualitative label and quantitative label ;
indicators The best application is monitoring 、 analysis 、 Evaluation and modeling .
label The best application is annotation 、 Characterization 、 Classification and feature extraction .
In particular, it should be noted that , Because the annotation of the result is also a label , So in the application scenario of algorithms related to natural language processing and machine learning , Tags are of great value for supervised learning , Just a simple indicator is difficult to achieve . The indicators are in task allocation 、 Role of performance management , The label can't do it .
3. Differences and relationships between dimensions and indicators
Dimension is the observation angle of data , That is, from which angle to analyze the problem , Look at problems .
The indicator is to calculate the value of this result on the basis of dimension .
A dimension is generally a discrete value , For example, each independent date or region in the time dimension , Therefore, in statistics , You can aggregate records with the same dimension , Apply aggregate function to accumulate 、 mean value 、 Maximum 、 Minimum value and other aggregate calculation .
Indicators are aggregated general calculations , That is, the result of aggregation operation , It is generally a continuous value .
4. The difference between the use of natural keys and proxy keys in the data warehouse
The data warehouse toolbox says The unique primary key of a dimension table should be a proxy key, not a natural key . Sometimes modelers are reluctant to give up using natural keys , Because they want to query the fact table with operational code , You don't want to connect with the dimension table . However , Multidimensional keys with business implications should be avoided , Because whatever assumptions we make may eventually become invalid , Because we can't control the change of Business Library .
Therefore, each connection between dimension table and fact table in data warehouse should be based on integer proxy key with no actual meaning . Avoid using natural keys as the primary keys of dimension tables .