当前位置:网站首页>Analysis and relationship of various nouns in data warehouse, such as indicator label, dimension measure, natural key agent key, etc

Analysis and relationship of various nouns in data warehouse, such as indicator label, dimension measure, natural key agent key, etc

2021-09-15 08:26:41 Learn big data in five minutes

As a data person , Is it often surrounded by various nouns , Do you have a vague understanding of many of these concepts . Some words are only one word away , But they mean completely different , Today, let's learn about some common concepts and their relationships in data warehouse construction and data analysis .

The structure of this paper is shown in the figure below :

One 、 Analysis of common concepts in data warehouse

1. Entity

Entity refers to the subject attached , Is an object we analyze , For example, we analyze the sales of goods , For example, what is the sales volume of Huawei's mobile phones in the past six months , Huawei mobile phone is an entity ; We analyze user activity , A user is an entity . Of course, entities can also exist in reality , For example, virtual business objects , Activities , Members, etc. can be regarded as an entity .

Entities exist for business analysis , As a filtered dimension of analysis , Have properties that describe yourself , It has analyzable value .

2. dimension

Dimension is the angle of view , Analyze business data , From what angle , What kind of dimension is established . Therefore, dimension is a quantity used to analyze data , For example, you need to analyze the sales of products , You can choose to analyze by commodity category , This constitutes a dimension , Put all commodity categories together , It constitutes a dimension table .

3. Measure

A metric is a numeric value on a business process node . Like sales , Price , Cost, etc. .

The measures in the fact table can be divided into three categories : Completely additive , Half can add , Do not add .

  1. Completely additive metrics are the most flexible , The most useful , For example, sales , Sales, etc , You can summarize any dimension ;

  2. Semi additive measures can summarize some dimensions , But you can't aggregate all dimensions , Difference is a common semi additive measure , In addition to the time dimension , You can add across all dimensions ;

  3. The other is completely non additive , for example : ratio . For this kind of non additive metric , A good way is , Store as many fully additive components as possible that are not additive , And before calculating the final non additive fact , Summarize these components into the final result set .

4. Particle size

Granularity is the unit of measurement in a business process , For example, goods are measured by piece record , Or is it measured by batch records .

In the construction of digital warehouse , We say this is a fact table of user granularity , Then each row of data in the table is a user , No duplicate users ; For example, there is a table of sales granularity , Then each row in the table is a sales record .

Choosing an appropriate granularity level is an important key content for the construction of data warehouse , When designing data granularity , Generally, the following factors need to be considered :

  1. Type of analysis to accept 、 The minimum granularity of acceptable data and the amount of data that can be stored ;

  2. The higher the level of granularity is defined , The less you can do a more detailed analysis in the warehouse ;

  3. If storage resources are limited , You can only use higher data granularity ;

  4. The data granularity partition strategy must ensure : The granularity of data can really meet the needs of users' decision analysis , This is the most important criterion in the data granularity partition strategy .

5. CAL

Dimension is data access logic ( How to access data ), such as Number to fetch yes 10 The average height of boys under the age of , This is the statistical caliber .

6. indicators

An indicator is a measure of caliber , That is the final result . For example, the order volume in the last seven days , Purchase conversion rate of a promotion, etc .

An indicator is specific to the calculation and implementation , It mainly consists of the following parts :

  • Index processing logic , such as count ,sum, avg

  • dimension , For example, by Department 、 Regional index statistics , Corresponding sql Medium group by

  • Business limits / Modifiers , For example, different payment channels are used to calculate the corresponding indicators , Order refund rate paid by wechat , Order refund rate paid by Alipay . Corresponding sql Medium where.

besides , Indicators themselves can also be derived 、 Derive more indicators , Based on these characteristics , Indicators can be classified :

  • Atomic index : Basic business facts , No business restrictions 、 There is no dimension . For example, the order quantity in the order table 、 The total amount of the order is an atomic indicator ;

Indicators more concerned by business parties , It has practical business meaning , You can take the indicators of data directly . For example, the store is near 1 The daily order payment amount is a derivative indicator , It will be displayed directly on the products to the merchants .

But this index can't be taken directly from the unified middle layer of the data warehouse ( Because there are no existing fact fields , Generally, they provide wide watches ). There needs to be a bridge between the middle tier of the warehouse and the business side's index requirements , So there are derived indicators

  • Derived indicators dimension + Modifiers + Atomic index . The shop is near 1 The store is the dimension of the payment amount of orders per day , near 1 Day is a modifier of time type , Payment amount is an atomic indicator ;

dimension : Observe the angle of each index ;

Modifiers : One or more values of the dimension , For example, under the dimension of gender , Men and women are 2 A modifier .

  • Derivatives : For example, the conversion rate of a promotion activity is a derivative indicator , Because of the need Promotion number index and Promotion order quantity index To calculate .

7. label

Labels are set artificially 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、 Bai Fu beauty 、 lolita . For ambiguous labels , We can distinguish labels internally , such as : Apple , We can define apple as fruit , Apple phone means mobile phone .

8. Natural bond

A key consisting of attributes that already exist in reality , It is unique in the business concept , And has certain business implications , For example, commodities ID, staff ID.

From a warehouse Perspective , Identifiers from business systems are natural keys , For example, the employee number in the business library .

9. Persistent key

Remain permanent and will not change . Sometimes called a supernatural persistent bond . For example, ID number is persistent key. .

Difference between natural key and persistent key : For example, I see , For example, the company's employees leave their jobs and rejoin them , His natural key, employee number, has changed , But his persistent key ID number is unchanged .

10. Surrogate key

Is a key that has no business meaning . Proxy keys have many other names : Meaningless key 、 Integer key 、 Unnatural bond 、 Manual key 、 Synthetic bond, etc .

Proxy keys are simply expressed as integers produced in a sequential sequence . Page of product line 1 The row proxy key is 1, Then the proxy key in the next line is 2, Do so . The proxy key is only used to connect the dimension table and the fact table .

11. Degenerate dimension

Degenerate dimension , It's a dimension keyword that looks like a fact table , But in fact, there is no corresponding dimension table , That is, dimension attributes are stored in the fact table , This dimension column stored in the fact table is called degenerate dimension . Like other dimensions stored in dimension tables , The degenerate dimension can also be used to filter and query the fact table 、 Realize aggregation operation, etc .

So how to define the degenerate dimension ? For example, orders id, This dimension is of great magnitude , There is no need to use a dimension table to store , When we query or filter data, we need , So this redundancy is in the fact table , This is called degenerate dimension ,citycode We will also be redundant in the fact table , however It has a corresponding dimension table , So it's not a degenerate dimension .

12. Run in

This is a common concept in data analysis , Drilling down can be understood as increasing the level of dimension , So that we can Observe data from coarse-grained to fine-grained , For example, when analyzing product sales , You can observe more fine-grained data from year to month to day along the time dimension . You can drill down from the year dimension to the month dimension 、 Dimension of day, etc .

13. Scroll up

I see. Run in , The first volume is easy to understand , The two are opposite operations , therefore Volume up can be understood as deleting some layers of dimension , The operation of observing data from fine-grained to coarse-grained or aggregating and summarizing data upward along the dimension level .

Two 、 Relationship between data warehouse concepts

1. Entity table , Fact table , Relationship between dimension tables

stay Kimball There are dimensions and facts in dimension modeling , stay Inmon There are entities and relationships in paradigm modeling , If we separate the two modeling methods, these concepts are easier to understand . However, there are many hybrid modeling methods , The two modeling methods are combined , These concepts are not easy to remember and confused , Especially the fact table and entity table , What are the differences and connections between them , Let's look at their respective concepts first :

  1. Dimension table : The dimension table can be seen as a window for users to analyze a fact , The data in it should be a description of all aspects of the facts , For example, time dimension table , Regional dimension table , Dimension table is an analytical perspective of fact table .

  2. Fact table : In fact, a fact table determines a fact through the combination of various dimensions and some index values , For example, through the time dimension , Regional organization dimension , The fact that the index value can be used to determine the value of some indexes at a certain time and place . Each piece of data in the fact table is obtained by the intersection of the data and indicator values of several dimension tables .

  3. Entity table : An entity table is a table of actual objects , The data in the entity table must be the objective thing data , For example, all kinds of goods , It exists objectively , So you can design it as an entity table . Real time tables only describe things , There are no concrete facts , Therefore, some people call the entity table a fact table without facts .

for instance : For example, there are apple phones in the mobile phone Mall , Huawei mobile phones and other mobile phones of various brands and models , These data can form a Mobile entity table , But there is no measurable data in the table . One day the Apple phone sold 15 platform , Huawei mobile phones sold 20 platform , These mobile phone sales data are facts , Form a Fact table . So you can use it Date dimension table and Regional dimension table Carry out various dimensional analysis of this fact table .

2. Difference between indicators and labels

  • Different concepts

indicators Is used to define 、 A standard or way of evaluating and describing a particular thing . such as : Number of new users 、 Cumulative users 、 User activity rate is an indicator to measure user development ;

label It's artificial 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、 Bai Fu beauty 、 lolita .

  • Different composition

Index name It is the naming of the characteristics of matter matter matter and quantity ; The indicator value is the indicator at a specific time 、 regional 、 Quantitative performance under , Such as human weight , The indicator name is body weight , The value of the indicator is 120 Jin ;

Tag name Usually adjectives or adjectives + The structure of nouns , Labels are generally not quantifiable , Usually isolated , In addition to the basic class label , Labels processed by certain algorithms generally have no units and dimensions . If will exceed 200 Those who weigh five kilograms are called fat people .

  • Different categories

Classification of indicators

Calculation logic according to indicators , Indicators can be divided into atomic indicators 、 Derived indicators 、 There are three types of derivative indicators ;

According to different descriptions of events , It is divided into process indicators and outcome indicators ;

Classification of labels

According to the variability of labels, they are divided into static labels and dynamic labels ;

According to the label reference and evaluation indicators , It can be divided into qualitative label and quantitative label ;

indicators The best application is monitoring 、 analysis 、 Evaluation and modeling .

label The best application is annotation 、 Characterization 、 Classification and feature extraction .

In particular, it should be noted that , Because the annotation of the result is also a label , So in the application scenario of algorithms related to natural language processing and machine learning , Tags are of great value for supervised learning , Just a simple indicator is difficult to achieve . The indicators are in task allocation 、 Role of performance management , The label can't do it .

3. Differences and relationships between dimensions and indicators

Dimension is the observation angle of data , That is, from which angle to analyze the problem , Look at problems .

The indicator is to calculate the value of this result on the basis of dimension .

A dimension is generally a discrete value , For example, each independent date or region in the time dimension , Therefore, in statistics , You can aggregate records with the same dimension , Apply aggregate function to accumulate 、 mean value 、 Maximum 、 Minimum value and other aggregate calculation .

Indicators are aggregated general calculations , That is, the result of aggregation operation , It is generally a continuous value .

4. The difference between the use of natural keys and proxy keys in the data warehouse

The data warehouse toolbox says The unique primary key of a dimension table should be a proxy key, not a natural key . Sometimes modelers are reluctant to give up using natural keys , Because they want to query the fact table with operational code , You don't want to connect with the dimension table . However , Multidimensional keys with business implications should be avoided , Because whatever assumptions we make may eventually become invalid , Because we can't control the change of Business Library .

Therefore, each connection between dimension table and fact table in data warehouse should be based on integer proxy key with no actual meaning . Avoid using natural keys as the primary keys of dimension tables .

版权声明
本文为[Learn big data in five minutes]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909140155414q.html

随机推荐