当前位置:网站首页>The essential skills of programmers: Clickhouse

The essential skills of programmers: Clickhouse

2020-12-17 16:22:33 HANGGE_ IT

 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ClickHouse It is an open source columnar database which has attracted much attention in recent years , Mainly used for data analysis (OLAP) field . At present, the domestic community is hot , Large factories have followed up the large-scale use of :

Today's headline   For internal use ClickHouse To do user behavior analysis , There are thousands of them inside ClickHouse node , Single cluster is the largest 1200 node , The total amount of data is dozens of PB, Increasing raw data 300TB about .

tencent   For internal use ClickHouse Do game data analysis , And established a set of monitoring and maintenance system for it .

Ctrip   From inside 18 year 7 Access trial started in January , at present 80% Business is running in ClickHouse On . More than one billion data are added every day , Nearly a million query requests .

Well quickly   It's also used internally ClickHouse, The total amount of storage is about 10PB, Add... Every day 200TB, 90% The query is less than 3S.

Beyond seas ,Yandex There are hundreds of nodes inside for user click behavior analysis ,CloudFlare、Spotify The head company is also using .

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In the community ,github star The number is growing at an alarming rate .

image-20191217212708825.png

stay DB-engines On the list , As shown in the red curve below .ClickHouse It's a short time to open source , But it's growing fast . 

image.png

why ClickHouse Has received so much attention , Got the favor of the community , It has also been applied by many large factories ????

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 brief introduction

Clickhouse Is a distributed real-time analytical columnar database service . High performance , Open the box , Enterprise feature support . Applied to traffic analysis , Advertising marketing analysis , behavior analysis , Crowd Division , Customer portrait , agile BI, The data mart , network monitoring , Business scenarios such as distributed services and link monitoring .

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I'm sure you will pay attention to the following :

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 Ultimate performance

Give full play to the advantages of multi-core parallel (SIMD Efficient instruction set 、 Vectorization execution engine ) And with the help of Distributed Technology , Accelerated computing provides real-time analysis capabilities . Open source benchmark The display is faster than the traditional method 100~1000 times , Provide 50MB~200MB/s High throughput real-time import capability .

2 Simple and flexible

Provide perfection SQL Support , It's easy to get started ; Provide json、map、array Such as flexible data type adaptation business changes rapidly ; At the same time, it supports approximate calculation 、 Probability data structure should deal with massive data processing .

3 Low cost

With the help of a well-designed column 、 Efficient data compression algorithm , Provide up to 10 Times the compression ratio , Greatly improve the single machine data storage and computing capabilities , Greatly reduce the cost of use , It is an excellent solution to build massive data warehouse .

4 Flexible architecture

Support single node 、 Single copy 、 multi-node 、 Multiple copies, multiple architectures

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Just to summarize clickhouse Characteristics , It is convenient for you to quickly understand clickhouse

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Ø  1. True column oriented DBMS

Ø  2. Efficient data compression

Ø  3. Data stored on disk

Ø  4. Multi core parallel processing

Ø  5. Distributed processing on multiple servers

Ø  6.SQL Grammar support

Ø  7. Vectorization engine

Ø  8. Real time data update

Ø  9. Indexes

Ø  10. Suitable for online query

Ø  11. Support approximate estimation calculation

Ø  12. Support nested data structures

Ø  13 Support array as data type

Ø  14. Support for limiting query complexity and quotas

Ø  15. Replication data replication and support for data integrity

---------------------------------------------------

Ø  1. Do not support things .

Ø  2. I won't support it Update/Delete operation .

Ø  3. Support limited operating systems .

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2 Go a little deeper

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2.1 Storage layer

ClickHouse from OLAP The scene needs to start , A new efficient column storage engine is developed , And realize the orderly storage of data 、 primary key 、 Sparse index 、 data Sharding、 data Partitioning、TTL、 Master / slave replication and other rich functions . All of the above functions are ClickHouse Extremely fast analytical performance lays the foundation .

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The column type storage

Unlike row storage, which stores the data of each row continuously , Column storage stores the data of each column continuously . The example is as follows :

image.png

Compared to row storage , Columnar storage has many excellent features in analysis scenarios .

1) As mentioned earlier , Analysis scenarios often require a large number of rows to be read, but a few columns . In the bank deposit mode , Data is stored continuously in rows , All column data is stored in a block in , The columns not involved in the calculation are listed in IO Read it all out , The read operation was seriously magnified . And in column storage mode , Just read the columns involved in the calculation , It's greatly reduced IO cost, Speed up queries .

2) The data in the same column belongs to the same type , The compression effect is remarkable . Column memory often has a compression ratio as high as ten times or even higher , Save a lot of storage space , Reduced storage costs .

3) Higher compression ratio means smaller data size, It takes less time to read the corresponding data from the disk .

4) Free choice of compression algorithm . Different columns of data have different data types , The applicable compression algorithm is not the same . It can be used for different column types , Choose the most appropriate compression algorithm .

5) High compression ratio , It means that the same size of memory can hold more data , System cache better .

Official data show , By using column storage , In some analysis scenarios , To be able to obtain 100 Times or higher acceleration effect

Data is stored in order

ClickHouse Support when creating tables , Specifies that the data is processed according to certain columns sort by.

After ordering , Guaranteed the same sort key Data is continuously stored on disk , And put it in order . In the process of equivalence 、 Range query ,where The data of conditional hits are closely stored in one or more consecutive Block in , Instead of being scattered in any number of Block, Dramatically reduce the need for IO Of block Number . in addition , continuity IO Can also make full use of the operating system page cache The prefetchability of , Reduce page fault.

primary key

ClickHouse Support primary key index , It takes each column of data according to index granularity( Default 8192 That's ok ) division , Every index granularity The first line at the beginning of is called a mark That's ok . The primary key index stores the mark The line corresponds to primary key Value .

about where The condition contains primary key Query for , Through the binary search of the primary key index , Can directly locate the corresponding index granularity, Avoid full table scanning to speed up queries .

But here's the thing :ClickHouse The primary key index of MySQL When the database is different , It's not used for weight removal , Even if primary key The same line , It can also exist in the database at the same time . To achieve the effect of de duplication , It needs to be combined with a specific table engine ReplacingMergeTree、CollapsingMergeTree、VersionedCollapsingMergeTree Realization , We'll read it in more detail in a future article series .

Sparse index

ClickHouse Support the creation of any number of sparse indexes for any column . Among them is indexed value It can be arbitrary legal SQL Expression, It's not limited to column value Index itself . It's called sparse index , Because it's essentially about a complete index granularity( Default 8192 That's ok ) Statistical information , It doesn't record the location of each line in the file . Currently supported sparse index types include :

  • minmax: With index granularity In units of , Stores the evaluated value of the specified expression min、max value ; In equivalence and range queries, it can help to quickly skip blocks that do not meet the requirements , Reduce IO.
  • set(max_rows): With index granularity In units of , Storing the specified expression distinct value aggregate , It is used to quickly determine whether the equivalent query hits the block , Reduce IO.
  • ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): take string Conduct ngram After word segmentation , structure bloom filter, Can optimize equivalence 、like、in Etc .
  • tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): And ngrambf_v1 similar , The difference is not to use ngram Carry out word segmentation , It's about word segmentation through punctuation .
  • bloom_filter([false_positive]): Build... On the specified column bloom filter, Used to accelerate equivalence 、like、in And so on .

data Sharding

ClickHouse Support stand-alone mode , It also supports distributed cluster mode . In distributed mode ,ClickHouse Will divide the data into multiple slices , And distributed to different nodes . Different fragmentation strategies are dealing with different SQL Pattern when , Each has its own advantages .ClickHouse Provides a wealth of sharding Strategy , Let the business choose according to the actual needs .

1) random Randomly slice : The write data is randomly distributed to a node in the distributed cluster .

2) constant Fixed segments : Write data is distributed to a fixed node .

3)column value Fragmentation : According to the value of a column hash Fragmentation .

4) Custom expression slicing : Specify any legal expression , According to the calculated value of the expression hash Fragmentation .

Data fragmentation , Give Way ClickHouse It can make full use of the large-scale parallel computing power of the whole cluster , Quickly return query results .
what's more , Various slicing functions , Open the imagination space for business optimization . For example hash sharding Under the circumstances ,JOIN Computing can avoid data shuffle, Directly local local join; Support customization sharding, Can be for different businesses and SQL Pattern Customize the most appropriate fragmentation strategy ; Use custom sharding function , By setting reasonable sharding expression It can solve the problem of data skew between slices .

in addition ,sharding Mechanism makes ClickHouse It can be extended horizontally and linearly , Building large-scale distributed clusters , So it has the ability to process massive data .

data Partitioning

ClickHouse Support PARTITION BY Clause , When creating a table, you can specify data partition operation according to any legal expression , Such as through toYYYYMM() Partition data by month 、toMonday() Partition the data by the day of the week 、 Yes Enum Type of column directly for each value as a partition, etc .

data Partition stay ClickHouse There are two main applications in :

  • stay partition key On the partition cutting , Query only the necessary data . agile partition expression Set up , So that it can be based on SQL Pattern Set up the partition , Maximum fit business characteristics .
  • Yes partition Conduct TTL management , Obsolete data partitions .

data TTL

In the analysis scenario , The value of data decreases over time , Most businesses only keep data from recent months for cost reasons ,ClickHouse adopt TTL Provides data lifecycle management capabilities .

ClickHouse Support several different granularity of TTL:

1) Column level TTL: When part of the data in a column has expired , Will be replaced with the default value ; When the full column of data has expired , The column will be deleted .

2) Row level TTL: When a line has expired , The line will be deleted directly .

3) Division level TTL: When the partition expires , The partition will be deleted directly .

High throughput write capability

ClickHouse Adoption class LSM Tree Structure , The data is written in the background regularly Compaction. By class LSM tree Structure ,ClickHouse When importing data, it's all sequential append Write , The data segment cannot be changed after writing , Backstage compaction It's also multiple segments merge sort Write back to disk in post order . The characteristics of sequential writing , Make full use of the throughput of the disk , Even in the HDD Also has excellent write performance on .

Officially open benchmark The test shows that 50MB-200MB/s Write throughput of , According to each line 100Byte Estimate , About equal to 50W-200W strip /s Write speed of .

Limited support delete、update

In the analysis scenario , Delete 、 Update operations are not core requirements .ClickHouse There is no direct support for delete、update operation , But in disguise it supports mutation operation , The grammar is alter table delete where filter_expr, alter table update col=val where filter_expr.

At present, the main limitation is to delete 、 The update operation is asynchronous , Need backstage compation Before it takes effect .

Primary backup synchronization

ClickHouse High availability is provided through primary and secondary replication , Support seamless upgrade and other operation and maintenance operations under the primary and standby architecture . And compared with other systems, its implementation has its own characteristics :

1) Under default configuration , Any copy is in active Pattern , It can provide external query service ;

2) You can configure the number of copies at will , The number of copies can be from 0 One to any number ;

3) Different shard It can be configured to not provide the number of copies , Used to solve individual problems shard Query hot issues of ;

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2.2 Computing layer

ClickHouse Very detailed work has been done in the computing layer , Squeeze as much hardware as you can , Increase query speed . It realizes multi-core parallel in a single machine 、 Distributed computing 、 Vectorization execution and SIMD Instructions 、 Code generation and other important technologies .

Multicore parallelism

ClickHouse Divide the data into multiple partition, Every partition Further divided into a number of index granularity, And then through multiple CPU The core processes part of it separately to realize parallel data processing .

In this design , Single Query You can use the whole machine CPU. Extreme parallel processing power , Greatly reduces query latency .

Distributed computing

In addition to excellent single machine parallel processing ability ,ClickHouse It also provides linear scalable distributed computing capabilities .ClickHouse The query is automatically disassembled into multiple task Send it to the cluster , And then we do multi machine parallel processing , And finally bring the results together .

In the case of multiple copies ,ClickHouse Provides a variety of query Distribution strategy :

  • Random distribution : In more than one replica Choose one randomly ;
  • lately hostname principle : Select the one that is closest to the current machine hostname node , Conduct query Send out . In a particular network topology , It can reduce network latency . And to ensure that query Send to fixed replica machine , Make the most of the system cache.
  • in order: Try to distribute them one by one in a specific order , The current one replica Is not available , Put it off to the next one replica.
  • first or random: stay In Order In mode , When the first one replica Is not available , all workload It's going to be piled up to the second one Replica, Lead to unbalanced load .first or random That solved the problem : When the first one replica Is not available , Randomly choose one other replica, To ensure the rest of replica Load balancing between . In addition, it is crossing region Copy the scene , By setting the first replica Based on region Copy inside , Can significantly reduce network latency .

Vectorization execution and SIMD

ClickHouse It's not just storing data in columns , And calculate by column . Tradition OLTP Databases are usually calculated by row , The reason is that the transaction processing is mainly based on point search ,SQL A small amount of calculation , The benefits of implementing these technologies are not obvious . But in the analysis scenario , Single SQL The amount of computation involved can be enormous , Treating each row as a basic unit can lead to severe performance losses :

1) Call the corresponding function for each row of data , Function call cost is high ;

2) The storage layer stores data in columns , Also organized in memory by columns , But the computing layer is treated as rows , Can't make full use of CPU cache Pre reading ability , cause CPU Cache miss serious ;

3) Process by line , Can't use efficient SIMD Instructions ;

ClickHouse The vector execution engine is implemented (Vectorized execution engine), For column data in memory , One batch Call once SIMD Instructions ( Instead of calling once per line ), It not only reduces the number of function calls 、 To reduce the cache miss, And it can give full play to SIMD The parallelism of instructions , The calculation time is greatly reduced . Vector execution engine , Usually it can bring several times the performance improvement .

Dynamic code generation Runtime Codegen

In the classic database implementation , Volcano model is usually used to calculate the expression , That is, the query will be converted into one by one operator, such as HashJoin、Scan、IndexScan、Aggregation etc. . To connect different operators ,operator Using a unified interface between , such as open/next/close. These virtual functions of the parent class are implemented within each operator , In the analysis scenario, a single SQL It's usually hundreds of millions of lines to process , The call overhead of virtual functions is no longer negligible . in addition , Within each operator, we have to consider many variables , For example, column type 、 Column size、 The number of columns, etc , There are a lot of if-else Branch judgment leads to CPU Branch prediction failure .

ClickHouse Realized Expression Grade runtime codegen, Dynamically according to the present SQL Generate code directly , Then compile and execute . As shown in the example below , about Expression Generate code directly , It not only eliminates a large number of virtual function calls ( That is, there are many function pointer Call to ), And because of the parameter type of the expression at run time 、 The number is known , It also eliminates unnecessary if-else Branch judgment .

image-20191216205308005.png

Approximate calculation

The approximate calculation is at the cost of losing some accuracy of the results , Greatly improve query performance . In massive data processing , The value of approximate calculation is more obvious .

ClickHouse A variety of approximate calculation functions are realized :

  • Approximate estimate distinct values、 Median , Quantile and other aggregate functions ;
  • Build table DDL Support SAMPLE BY Clause , Support data sampling processing ;

Complex data type support

ClickHouse It also provides array、json、tuple、set And other composite data types , Support the business schema Flexible changes to .

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3 A little closer

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3.1 install

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

clickhouse Flexible architecture , It can be installed in a single node , It can also be installed in multiple nodes ! The operation is very convenient , Multi node installation , Install separately on multiple machines Clickhouse, Then install and press it to do cluster configuration

In the virtual machine Linux01 Installation on :

1) install curl Tools 
yum install -y curl
2) add to clickhouse Of yum Mirror image 
curl -s https://packagecloud.io/install/repositories/altinity/clickhouse/script.rpm.sh | sudo bash
3) Check the mirror image 
[root@doit04 yum.repos.d]# yum list | grep clickhouse                        
clickhouse-client.x86_64                    20.5.4.40-1.el7            @Altinity_clickhouse
clickhouse-common-static.x86_64             20.5.4.40-1.el7            @Altinity_clickhouse
4) install clickhouse Server and client of 
yum install -y clickhouse-server clickhouse-client
5) Start server 
service clickhouse-server start 
6) Start the interactive client 
clickhouse-client  -m

  In the virtual machine Linux02  Installation on :

1) install curl Tools 
yum install -y curl
2) add to clickhouse Of yum Mirror image 
curl -s https://packagecloud.io/install/repositories/altinity/clickhouse/script.rpm.sh | sudo bash
3) Check the mirror image 
[root@doit04 yum.repos.d]# yum list | grep clickhouse                        
clickhouse-client.x86_64                    20.5.4.40-1.el7            @Altinity_clickhouse
clickhouse-common-static.x86_64             20.5.4.40-1.el7            @Altinity_clickhouse
4) install clickhouse Server and client of 
yum install -y clickhouse-server clickhouse-client
5) Start server 
service clickhouse-server start 
6) Start the interactive client 
clickhouse-client  -m

...... And so on ......

 

[root@linux01 flink-1.11.2]# clickhouse-client  -m
ClickHouse client version 20.8.3.18.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.8.3 revision 54438.

linux01 :)  Operate here clickhouse
SHOW DATABASES

┌─name───────────────────────────┐
│ _temporary_and_external_tables │
│ db_ch_mysql                    │
│ db_doit19                      │
│ default                        │
│ system                         │
└────────────────────────────────┘
CREATE DATABASE db1
Ok.
0 rows in set. Elapsed: 0.007 sec. 
USE  db1 ;

 create  table tb_t1 ... ;

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3.2 engine

Watch engine (table engine). If the MySQL Familiar words , Maybe you should have heard of InnoDB and MyISAM Storage engine . Different storage engines provide different storage mechanisms 、 Index method 、 Lock level and other functions , It can also be called Table type .ClickHouse Provides a rich table engine , These different table engines also represent different Table type . For example, what features does a data table have 、 How data is stored and how it is loaded

effect

  • Decide where and how the table is stored
  • What queries are supported and how to support
  • Concurrent data access
  • Use of index
  • Can multithreaded requests be executed
  • Copy data
Engine classification Engine name
MergeTree series MergeTree 、ReplacingMergeTree 、SummingMergeTree 、 AggregatingMergeTree CollapsingMergeTree 、 VersionedCollapsingMergeTree 、GraphiteMergeTree
Log series TinyLog 、StripeLog 、Log
Integration Engines Kafka 、MySQL、ODBC 、JDBC、HDFS
Special Engines Distributed 、MaterializedView、 Dictionary 、Merge 、File、Null 、Set 、Join 、 URL View、Memory 、 Buffer

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

clickhouse It provides a lot of engines , Here are two typical engine examples :

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3.2.1 MergeTree engine

MergeTree The table engine of the series is ClickHouse The core of the data storage function . They provide most of the functionality for resilient and high-performance data retrieval : Column store , Custom partition , Sparse primary index , Auxiliary data, skip index, etc .

basic MergeTree A table engine can be considered a single node ClickHouse Instance's default table engine , Because it's universal and practical in all kinds of use cases .

For production use ,ReplicatedMergeTree It's the only way , Because it's routine MergeTree All the functions of the engine increase high availability . An additional benefit is that when data is extracted Automatic data De duplication , So if there is a network problem during the insertion process , The software can safely try again .

MergeTree All other engines in the series add additional functionality to certain use cases . Usually , It is implemented as other data operations in the background .

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster](
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2) 
ENGINE = MergeTree()ORDER BY expr[PARTITION BY expr][PRIMARY KEY expr][SAMPLE BY expr][TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...][SETTINGS name=value, ...]

VersionedCollapsingMergeTree 

This is the engine uid Primary key , With uid For sorting fields , We can keep or delete the specified version of the data

CREATE TABLE tb_vscmt
(
    uid UInt64,
    name String,
    age UInt8,
    sign Int8,
    version UInt8
)
ENGINE = VersionedCollapsingMergeTree(sign, version)
ORDER BY uid;

3.2.2 Distributed engines

Use ReplicatedMergeTree and Distributed The engine builds our distributed tables

--  Each machine needs to build its own replica table, It also needs to be built Distributed table As a route 

create table dm.delphi_membership_properties_replica
(
  membership_id  int,  -- comment ' members id',
  membership_uid String, -- comment ' members uid',
  business_group_id int, -- comment ' Merchant id',
  business_group_uid String , --comment ' Merchant uid',
  business_group_name String, -- comment ' Merchant name ',
  business_id Nullable(int), -- comment ' store id',
  business_uid Nullable(String), -- comment ' store uid',
  business_name Nullable(String), -- comment ' store name',
  membership_source String, -- comment ' Membership sources ',
  created_at DateTime,
  calendar_date Date,
  last_visited_date Date, -- comment ' Last visit time ',
  membership_level int, -- comment ' Membership level ',
  customer_type String, -- comment ' Member type : New members / Loyal members / Regular members / Forget about members / Lost members , It is calculated according to the last visit time and merchant configuration ',
  visit_count int, -- comment ' Number of visits ',
  consumptions_count Nullable(int), -- comment ' Number of consumption ',
  consumptions_original_amount Nullable(Decimal128(2)), -- comment ' The total amount of consumption : Original amount ',
  consumptions_amount Nullable(Decimal128(2)), -- comment ' The total amount of consumption : Amount paid ',
  average_consume Nullable(Decimal128(2)), -- comment ' Average consumption amount : Original amount / Number of consumption ',
  account_id int, -- comment ' user id',
  account_uid String, -- comment ' user uid',
  account_phone String, -- comment ' User's mobile phone ',
  age Nullable(int), -- comment ' Age ',
  birthday Nullable(String), -- comment ' Birthday ',
  birthday_month Nullable(int), -- comment ' Birthday month ',
  birthday_day Nullable(int), -- comment ' Birthday ',
  birthday_year Nullable(int), -- comment ' Birthday year ',
  zodiac String, -- comment ' The constellation ',
  name Nullable(String), -- comment ' full name ',
  gender int, -- comment ' Gender ',
  profession Nullable(String), -- comment ' occupation ',
  country Nullable(String), -- comment ' Country ',
  province Nullable(String), -- comment ' Province ',
  city Nullable(String), -- comment ' City ',
  region Nullable(String), -- comment ' Business circle ',
  head_img_url Nullable(String), -- comment ' Head portrait ',
  wechat_name Nullable(String), -- comment ' Wechat name ',
  wechat_city Nullable(String), -- comment ' Wechat city ',
  wechat_country Nullable(String), -- comment ' Wechat countries ',
  wechat_province Nullable(String), -- comment ' Wechat Province ',
  wechat_head_img_url Nullable(String), -- comment ' Wechat Avatar ',
  wechat_groupid int, -- comment ' Wechat group ',
  wechat_remark Nullable(String), -- comment ' Wechat notes '
  insert_time DateTime DEFAULT now(), --  Data insertion time 
  insert_date Date DEFAULT toDate(now()) --  Data insertion date 
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/delphi_membership_properties_replica', '{replica}')
order by (business_group_uid, calendar_date, created_at, membership_uid);

create table dm.delphi_membership_properties as dm.delphi_membership_properties_replica
ENGINE = Distributed(ck_cluster, dm, delphi_membership_properties_replica, rand());

delphi_membership_properties_replica It's the local table on each machine ,delphi_membership_properties It's a distributed table , Compare the creation of the next two tables engine The difference between .

ReplicatedMergeTree

Add... To the name of the table engine  Replicated  Prefix , It means a duplicate table .ReplicatedMergeTree Parameters

  • zoo_path — ZooKeeper The path to the table in .
  • replica_name — ZooKeeper The name of the replica where the table is located in .
 
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/delphi_membership_properties_replica', '{replica}')
order by (business_group_uid, calendar_date, created_at, membership_uid);

These parameters can contain placeholders for macro substitutions , That's the part of braces . They will be replaced by the configuration file ‘macros’ The value of that part of the configuration

 
<yandex>
    <macros>
        <replica>172.31.59.118</replica>
        <shard>01</shard>
        <layer>01</layer>
    </macros>
</yandex>

“ZooKeeper The path to the table in ” Be unique for every replicable table . Tables on different slices have different paths . In this case , The path consists of the following parts :

  • /clickhouse/tables/  It's a public prefix , The official recommendation .

  • {layer}-{shard}  It's a piece identification part

  • table_name  It's the watch in ZooKeeper The name of the . Make it relate to ClickHouse It's better to have the same table name in . Here it is clearly defined , Follow ClickHouse Table names are different , It's not going to be RENAME Statement modification

Be careful :

  1. The replica is table level , Not the entire server level . therefore , The server can have both replicated and non replicated tables .
  2. DDL Statements are executed only on a single server , Will not be copied

 

Distributed

Distributed(logs, default, hits[, sharding_key])

Distributed engine parameters : Cluster name in the server configuration file , Remote database name , Remote table name , Data fragmentation key ( Optional ). The concept of data partition key is the principle according to which data is allocated to the specific partition when data is inserted .

In the table structure above :

ENGINE = Distributed(ck_cluster, dm, delphi_membership_properties_replica, rand());

It means that it will be from ck_cluster In the cluster dm.delphi_membership_properties_replica Read data from .

The name of the cluster is when the cluster is set up metrika.xml Configured in the file , For details, see the configuration of the cluster building part . Any number of clusters can be configured in the configuration .

To view clusters , You can use “system.clusters” surface .

clickhouse_system_clusters

Through the distributed engine, clusters can be used just like local servers . however , Clusters are not automatically expanded : The cluster configuration must be written into the server configuration file .

So we can use it clickhouse Create a distributed table , Realize distributed database function , And the data in the table also supports data backup

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3.3 Integrate

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

clickhouse Many integration engines and table functions are implemented in ,clickhouse It's very convenient to integrate with other data sources

Function Description
file Creates a File-engine table.
merge Creates a Merge-engine table.
numbers Creates a table with a single column filled with integer numbers.
remote Allows you to access remote servers without creating a Distributed-engine table.
url Creates a Url-engine table.
mysql Creates a MySQL-engine table.
jdbc Creates a JDBC-engine table.
odbc Creates a ODBC-engine table.
hdfs Creates a HDFS-engine table.

To integrate mysql For example ; Use the following table creation statement to specify MySQL engine ,ck You can read mysql Data in the

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
) ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password'[, replace_query, 'on_duplicate_clause']);
CREATE TABLE hdfs_engine_table (name String, value UInt32) ENGINE=HDFS('hdfs://hdfs1:9000/other_storage', 'TSV')

This example is CK Direct load HDFS Structured data on

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3.4 function

clickhouse A large number of functions are provided in , It also supports table functions , Higher order function , Link function , Set function ...... For us CK Data analysis brings great convenience !

3.4.1 Date class function

function purpose give an example result
toYear() Take the date or time the year of the date toYear(toDateTime(‘2018-12-11 11:12:13’)) toYear(toDate(‘2018-12-11’)) return 2018 return 2018
toMonth() Take the date or time the month of the date toMonth(toDateTime(‘2018-12-11 11:12:13’)) toMonth(toDate(‘2018-12-11’)) return 12 return 12
toDayOfMonth() Take the date or time the day of the date (1-31) toMonth(toDayOfMonth(‘2018-12-11 11:12:13’)) toMonth(toDayOfMonth(‘2018-12-11’)) return 11 return 11
toDayOfWeek() Take the date or time of the day ( Monday is 1, Sunday is 7). toDayOfWeek(toDateTime(‘2018-12-11 11:12:13’)) toDayOfWeek(toDate(‘2018-12-11’)) return 2 return 2
toHour() Take the hour of time and date toHour(toDateTime(‘2018-12-11 11:12:13’)) return 11
toMinute() Take the minutes of time and date toMinute(toDateTime(‘2018-12-11 11:12:13’)) return 12
toSecond() Take the seconds of time and date toSecond(toDateTime(‘2018-12-11 11:12:13’)) return 13
toMonday() Take the time and date of the latest Monday ( Return date ) toMonday(toDate(‘2018-12-11’)) toMonday(toDateTime(‘2018-12-11 11:12:13’)) return 2018-12-10 return 2018-12-10
toTime() Fix the date of time and date to a certain day , Keep the original time toTime(toDateTime(‘2018-12-11 11:12:13’)) return 1970-01-02 11:12:13

 

function purpose give an example result
toStartOfMonth() Take the first day of the month of the date or time , Return date toStartOfMonth(toDateTime(‘2018-12-11 11:12:13’))toStartOfMonth(toDate(‘2018-12-11’)) return 2018-12-01 return 2018-12-01
toStartOfQuarter() Take the first day of the quarter of the date or time date , Return date toStartOfQuarter(toDateTime(‘2018-12-11 11:12:13’))toStartOfQuarter(toDate(‘2018-12-11’)) return 2018-10-01 return 2018-10-01
toStartOfYear() Take the first day of the year of the date or time , Return date toStartOfYear(toDateTime(‘2018-12-11 11:12:13’))toStartOfYear(toDate(‘2018-12-11’)) return 2018-01-01 return 2018-01-01
toStartOfMinute() Intercept time and date to minutes ( And then it goes back to zero ), Return date toStartOfMinute(toDateTime(‘2018-12-11 11:12:13’)) return 2018-12-11 11:12:00
toStartOfFiveMinute() Intercept time and date to the latest 5 Multiple minutes ( And then it goes back to zero ), Return date toStartOfFiveMinute(toDateTime(‘2018-12-11 11:12:13’)) return 2018-12-11 11:10:00
toStartOfFifteenMinutes() Intercept time and date to the latest 15 Multiple minutes ( And then it goes back to zero ), Return date toStartOfFifteenMinutes(toDateTime(‘2018-12-11 11:12:13’)) return 2018-12-11 11:00:00
toStartOfHour() Intercept time date to hour ( And then it goes back to zero ), Return date toStartOfHour(toDateTime(‘2018-12-11 11:12:13’)) return 2018-12-11 11:00:00
toStartOfDay() Intercept time date to day ( And then it goes back to zero ), Return date toStartOfDay(toDateTime(‘2018-12-11 11:12:13’)) return 2018-12-11 00:00:00
timeSlot() Put time and date in , Minutes are greater than or equal to 30 To 30, Minutes less than 30 Of is classified as 00 timeSlot(toDateTime(‘2018-12-11 11:33:13’))timeSlot(toDateTime(‘2018-12-11 11:33:13’)) return 2018-12-11 11:00:00 return 2018-12-11 11:30:00

3.4.2 Date or time date generation function

function purpose give an example result
now() Generate current time and date now() return 2018-12-13 10:10:12
today() Generate today's date today() return 2018-12-13
yesterday() Generate yesterday's date yesterday() return 2018-12-12

3.4.3 Type conversion class function

function purpose give an example result
toDecimal32(‘whdwjfew’,8) Preserve the precision of numeric or non numeric strings toDecimal32(23.12291, 3)toDecimal32(’_23.12291’, 3) return 23.122 return 0.000
toDecimal64(‘whdwjfew’,8) Preserve the precision of numeric or non numeric strings toDecimal64(23.12291, 3)toDecimal64(’_23.12291’, 3) return 23.122 return 0.000
toDecimal128(‘whdwjfew’,8) Preserve the precision of numeric or non numeric strings toDecimal128(23.12291, 3)toDecimal128(’_23.12291’, 3) return 23.122 return 0.000

 

function purpose give an example result
toUInt8OrZero() Convert the character type of an unsigned integer to an integer type , Otherwise return to 0 toUInt8OrZero(‘123’)toUInt8OrZero(‘123.12’) return 123 return 0
toInt8OrZero() Convert integer character type to integer type , Otherwise return to 0 toInt8OrZero(‘123’)toInt8OrZero(’-123’) return 123 return -123
toFloat32OrZero() Convert numeric string to numeric , Be careful : from toFloat32OrZero Start , lose 32 There is no corresponding function for toFloat32OrZero(‘-123’)toFloat32OrZero(‘123.123’) return -123 return 123.123

 

function purpose give an example result
toDate() Convert character date to date type toDate(‘2018-12-24’) return 2018-12-24
toDateTime() Convert character time date to time date type toDateTime(‘2018-12-24 10:10:00’) return 2018-12-24 10:10:00

 

function purpose give an example result
toString() Put the numerical type 、 Character 、 Date, etc. are converted to character type toString(‘2018-12-24’)toString(‘123’) return 2018-12-24 return 123

 

function purpose give an example result
toTypeName() Return the type of data toTypeName(toString(‘123’))toTypeName(toDate(‘2018-12-24’)) return String return Date

3.4.4 String manipulation

function purpose give an example result
empty() Judge whether the string is empty 1, Otherwise 0 empty(’’)empty(‘123a’) return 1 return 0
notEmpty() Judge whether the string is not empty 1, Otherwise 0 notEmpty(’’)notEmpty(‘123a’) return 0 return 1
length() Returns the length of the string length(’’)length(‘123a’) return 0 return 4
lower() Make the string lowercase lower(‘aBc’) return abc
upper() Convert a string to uppercase upper(‘aBc’) return ABC
reverse() Invert string reverse(‘abc’) return cba
substring(s, offset, length) String interception substring(‘123abcABC’, 2, 3) return 23a
appendTrailingCharIfAbsent(s, c) If the string s Non empty , Will s Add a character after c(s The last character and c Different ), Otherwise, it will not be dealt with appendTrailingCharIfAbsent(‘123abc’, ‘b’)appendTrailingCharIfAbsent(‘123abc’, ‘c’) return 123abcb return 123abc

 

function purpose give an example result
match(haystack,pattern) String regular match , return 0 or 1 match(‘avhsca’,'vh’) return 1
extract(haystack,pattern) Returns the first substring matched to extract(‘iioomAj12123124OOBJB’, ‘\d+’) return 12123124
extractAll(haystack,pattern) Returns all substrings that are matched , Output list extractAll(‘iioomAj12123124OOBJ123B’, ‘\d+’) return [12123124,123]
like(haystack,pattern) The matching data is returned 1, Otherwise return to 0 like(‘avhsca’,’%vh%’)like(‘avhsca’,’%vabjh%’) return 1 return 0
notLike(haystack, pattern) And like() The function is the opposite notLike(‘avhsca’,’%vh%’)notLike(‘avhsca’,’%vabjh%’) return 0 return 1

 

function purpose give an example result
replaceOne(haystack,pattern,replacement) Replace the first match to pattern replaceOne(‘asd123cbbj464sd’, ‘sd’, ‘-’) return a-123cbbj464sd
replaceAll(haystack,pattern,replacement) Replace all matches to pattern replaceOne(‘asd123cbbj464sd’, ‘sd’, ‘-’) return a-123cbbj464-
replaceRegexpOne(haystack, pattern, replacement) Regular matching replaces the first match to pattern replaceRegexpOne(‘Hello, World!’, ‘o’, '- ') return Hell- , World!
replaceRegexpAll(haystack,pattern,replacement) Regular matching replaces all matches pattern replaceRegexpAll(‘Hello, World!’, ‘^’, 'here: ')replaceRegexpAll(‘Hello, World!’, ‘o’, '-- ') return here: Hello, World! return Hell-- , W-- rld!

 

function purpose give an example result
splitByChar(separator, s) Split a string with a single character splitByChar(’-’, ‘qw-asaf-asfqw-2312-asd’) return [‘qw’,‘asaf’,‘asfqw’,‘2312’,‘asd’]
splitByString(separator, s) Split a string into single or multiple characters splitByString(’-’, ‘qw-asaf-asfqw-2312-asd’)splitByString(’-a’, ‘qw-asaf-asfqw-2312-asd’) return [‘qw’,‘asaf’,‘asfqw’,‘2312’,‘asd’] return [‘qw’,‘saf’,‘sfqw-2312’,‘sd’]

 

function purpose give an example result
concat(s1,s2,…) Concatenate strings concat(‘123’, ‘abc’, ‘ABC’) return 123abcABC

3.4.5 Conditional statements  

function purpose give an example result
if(cond,then,else) Conditional output if(1 > 2, ‘ correct ’, ‘ error ’) return error
multiIf(cond_1, then_1, cond_2, then_2…else) Multi conditional output multiIf(1 > 2, ‘ correct ’, 2 < 0, ‘ correct ’, ‘ error ’) return error

3.4.6 Mathematical functions

function purpose give an example result
e() return e Value e() return 2.718281828459045
pi() return pi Value pi() return 3.141592653589793
exp(x) return e Of x Power exp(1) return 2.718281828459045
exp2(x) return 2 Of x Power exp2(2) return 4
exp10(x) return 10 Of x Power exp10(1) return 10
log(x) return log With e The value of the base log(e()) return 1
log2(x) return log With 2 The value of the base log2(2) return 1
log10(x) return log With 10 The value of the base log10(100) return 2
sqrt(x) Yes x Square root sqrt(4) return 2
cbrt(x) Yes x Open Cube cbrt(8) return 2
pow(x, y) return x Of y Power pow(2, 3) return 8

3.4.7 Rounding function

function purpose give an example result
floor(x[, N]) Take the number down floor(123.883, 1)floor(123.883, -1) return 123.8 return 120
ceil(x[, N]) Take the number up ceil(123.883, 1)ceil(123.883, -1) return 123.9 return 130
round(x[, N]) rounding round(123.883, 1)round(123.883, -1) return 123.9 return 120

3.4.8 URL Operation function

function purpose give an example result
protocol() return URL The type of agreement protocol(‘http://www.baidu.com.cn’) return http
domain() return URL Domain name of domain(‘http://www.baidu.com.cn’) return www.baidu.com.cn
domainWithoutWWW() return URL No www Domain name of domainWithoutWWW(‘http://www.baidu.com.cn’) return baidu.com.cn
topLevelDomain() Return to top level domain name topLevelDomain(‘http://www.baidu.com.cn’) return cn
firstSignificantSubdomain() Returns the “first significant subdomain”. firstSignificantSubdomain(‘http://www.baidu.com.cn’) return baidu
cutToFirstSignificantSubdomain() Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain” (see the explanation above). cutToFirstSignificantSubdomain(‘http://www.baidu.com.cn’) return baidu.com.cn
path() return URL The path of path(‘https://www.baidu.com/s?wd=SQL%E4%B8%AD%E7%9A%84split’) return /s
pathFull() return URL Full path to pathFull(‘https://www.baidu.com/s?wd=SQL%E4%B8%AD%E7%9A%84split’) return /s?wd=SQL%E4%B8%AD%E7%9A%84split
queryString() return URL Parameters of ( Query string ) queryString(‘https://www.baidu.com/s?wd=SQL%E4%B8%AD%E7%9A%84split’) return wd=SQL%E4%B8%AD%E7%9A%84split
extractURLParameters() Return... As a list URL Parameters of extractURLParameters(‘https://www.baidu.com/s?wd=SQL%E4%B8%AD%E7%9A%84split&ur=qwguq’) return [‘wd=SQL%E4%B8%AD%E7%9A%84split’,‘ur=qwguq’]
extractURLParameterNames() Return... As a list URL Parameter name of extractURLParameterNames(‘https://www.baidu.com/s?wd=SQL%E4%B8%AD%E7%9A%84split&ur=qwguq’) return [‘wd’,‘ur’]
cutQueryString() return URL?( Parameters ) Previous content cutQueryString(‘https://www.baidu.com/s?wd=SQL%E4%B8%AD%E7%9A%84split&ur=qwguq’) return https://www.baidu.com/s

3.4.9 IP Operation function

function purpose give an example result
IPv4StringToNum(s) take IPV4 Convert to value , Not IPV4 The transformation of is to 0 IPv4StringToNum(‘23.217.198.69’)IPv4StringToNum(‘adwh.124.qwfqw’) return 400148037 return 0
IPv4NumToString(num) Change the value to IPV4 IPv4NumToString(400148037) return 23.217.198.69
IPv4NumToStringClassC(num) Change the value to IPV4, And the last segment is xxx Instead of IPv4NumToStringClassC(400148037) return 23.217.198.xxx

3.4.10 Table operations

function purpose give an example result
INNER JOIN Internal connection A surface INNER JOIN B surface A Table and B The common part of the table
LEFT OUTER JOIN The left outer join A surface LEFT OUTER JOIN B surface A The watch is not in B Part of the table
RIGHT OUTER JOIN Right connection A surface RIGHT OUTER JOIN B surface B The watch is not in A Part of the table
FULL OUTER JOIN Full outer join A surface FULL OUTER JOIN B surface A And B All tables , Not for NULL

 

function purpose give an example result
LIMIT N Inquire about N Data , General follow ORDER BY Continuous use ORDER BY hit DESC LIMIT 10 according to hit Before descending the train 10
LIMIT N BY Clause according to Clause Column query N Data , General follow ORDER BY Continuous use SELECT date, domain, count(1) AS hit from db.tb where…GROUP BY date, domain,ORDER BY hit DESCLIMIT 10 BY date Take every day TOP10 Domain name of

3.4.11 Dictionary operation

function purpose give an example result
dictGetString() Dictionary mapping dictGetString(‘ck_abc_dic’, ‘ck_value’, tuple(_abc))

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

4 User behavior analysis

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ClickHouse It provides users with rich multi parameter aggregate functions (parametric aggregate function) And based on arrays +Lambda Higher order functions of expressions (higher-order function), Use them flexibly to achieve magical effects . In our system ,ClickHouse Location click stream data warehouse , So here are a few examples of using it for user behavior ( route ) Examples of actual combat analysis , Include :

  • Path matching
  • Intelligent path detection
  • Ordered funnel transformation
  • User retention
  • Session Statistics

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

4.1  Path matching

CK By default sequenceMatch Function to check if there is an event chain that satisfies the input pattern ,sequenceCount The function counts the number of event chains that satisfy the input pattern . Example :

SELECT
  site_id,
  sequenceMatch('(?1)(?t<=15)(?2).*(?3)')(
    ts_date_time,
    event_type = 'shtLogon',
    event_type = 'shtKkclick' AND column_type = 'homePage',
    event_type = 'shtAddCart'
  ) AS is_match
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-07-01'
AND site_id IN (10266,10022,10339,10030)
GROUP BY site_id;

 

┌─site_id─┬─is_match─┐
   10030         1 
   10339         1 
   10266         1 
   10022         1 
└─────────┴──────────
SELECT
  site_id,
  sequenceCount('(?1)(?t<=15)(?2).*(?3)')(
    ts_date_time,
    event_type = 'shtLogon',
    event_type = 'shtKkclick' AND column_type = 'homePage',
    event_type = 'shtAddCart'
  ) AS seq_count
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-07-01'
AND site_id IN (10266,10022,10339,10030)
GROUP BY site_id;
┌─site_id─┬─seq_count─┐
   10030      33611 
   10339      14045 
   10266      74542 
   10022      31534 
└─────────┴───────────┘

4.2  Intelligent path detection

CK Built in sequenceMatch and sequenceCount Functions can only meet part of the requirements , There is a more complex requirement :

To the end of the path of regular hope 、 Waypoint and maximum event interval , Find out the path details that meet the conditions and the number of users that meet the path ( In descending order of users ).

At present, there is no ready-made function that can directly give the result , But we can save the country in a curve , It is realized indirectly by the combination of array and higher order function . complete SQL The statement is as follows ,

SELECT
  result_chain,
  uniqCombined(user_id) AS user_count
FROM (
  WITH
    toUInt32(maxIf(ts_date_time, event_type = 'shtOrderDone')) AS end_event_maxt,
    arrayCompact(arraySort(
      x -> x.1,
      arrayFilter(
        x -> x.1 <= end_event_maxt,
        groupArray((toUInt32(ts_date_time), (event_type, column_type)))
      )
    )) AS sorted_events,
    arrayEnumerate(sorted_events) AS event_idxs,
    arrayFilter(
      (x, y, z) -> z.1 <= end_event_maxt AND (z.2.1 = 'shtOrderDone' OR y > 600),
      event_idxs,
      arrayDifference(sorted_events.1),
      sorted_events
    ) AS gap_idxs,
    arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
    arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
    arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
  SELECT
    user_id,
    arrayJoin(split_events) AS event_chain_,
    arrayCompact(event_chain_.2) AS event_chain,
    hasAll(event_chain, [('shtKkClick', 'homePage')]) AS has_midway_hit,
    arrayStringConcat(arrayMap(
      x -> concat(x.1, '#', x.2),
      event_chain
    ), ' -> ') AS result_chain
  FROM (
    SELECT ts_date,ts_date_time,event_type,column_type,user_id
    FROM ods.analytics_access_log_all
    WHERE ts_date >= '2020-06-30' AND ts_date <= '2020-07-02'
    AND site_id IN (10266,10022,10339,10030)
  )
  GROUP BY user_id
  HAVING length(event_chain) > 1
)
WHERE event_chain[length(event_chain)].1 = 'shtOrderDone'
AND has_midway_hit = 1
GROUP BY result_chain
ORDER BY user_count DESC LIMIT 20;

Main idea

  • Use the user's behavior as groupArray Functions are organized into < Time , < Event name , Page name >> tuples , And use arraySort Functions are sorted in ascending order of time ;
  • utilize arrayEnumerate Function to get the subscript array of the original behavior chain ;
  • utilize arrayFilter and arrayDifference function , Filter out the boundary subscript in the original behavior chain . The boundary condition is that the end of the path or the time difference is greater than the maximum interval ;
  • utilize arrayMap and has Function to get the mask of the subscript array ( from 0 and 1 The sequence of components ), For final segmentation ,1 It means the dividing point ;
  • call arraySplit Function cuts the original behavior chain into a single visit behavior chain according to the demarcation point . Note that this function takes the boundary point as the starting point of the new chain , So we need to add the subscript of the dividing point 1;
  • call arrayJoin and arrayCompact Function flattens an array of event chains into multiple rows and columns , And remove adjacent duplicates .
  • call hasAll Function to determine whether all of the specified waypoints exist . If you want any waypoint to exist , Just switch to hasAny function . Of course , You can also modify WHERE Predicate to exclude the specified waypoint .
  • Organize the final result into a readable string , Statistics of user base by behavior chain , complete .

 

4.3  Ordered funnel transformation

CK Provides windowFunnel Function implementation funnel , For a specified period of time ( The unit is in seconds ) Sliding windows match event chains in order , And returns the number of steps converted to in the window . If there are multiple matches , With the largest number of steps ( The transformation is the deepest ) Shall prevail .

Take the custom funnel model as an example :

@Mapper
public interface FunnelMapper {
    @Select("select\n" +
            "total ,\n" +
            "(o1+o2+o3+o4) as v1 ,\n" +
            "(o2+o3+o4) as v2 ,\n" +
            "(o3+o4) as v3 ,\n" +
            "(o4) as v4 \n" +
            "from\n" +
            "(select\n" +
            "count(1) as total ,\n" +
            "sum(if(cc=4 , 1 , 0)) as o4 ,\n" +
            "sum(if(cc=3 , 1 , 0)) as o3 ,\n" +
            "sum(if(cc=2 , 1 , 0)) as o2 ,\n" +
            "sum(if(cc=1 , 1 , 0)) as o1 \n" +
            "from\n" +
            "(select\n" +
            "deviceId ,\n" +
            "windowFunnel(3600)(\n" +
            "toDateTime(ctime) ,\n" +
            "eventId='adShow'  ,\n" +
            "eventId='adClick',\n" +
            "eventId='productView' ,\n" +
            "eventId='submitOrder'\n" +
            ") as cc\n" +
            "from\n" +
            "tb_log \n" +
            "group by deviceId))")
    public EventCountBean getStaticFunnel() ;

    @Insert("insert into tb_funnel values(#{id},#{name},#{eventChannel})")
    public  void  addFunnel(TbFunnel tbFunnel) ;

    @Select("select * from tb_funnel")
    public List<TbFunnel> getAllFunnel() ;

    /**
     * mybatis Of SQL Input 
     *     #{}    It is recommended to use 
     *     ${}  like   ${}
     * @param name
     * @return
     */
    @Select("select * from tb_funnel where  name = #{name}")
    public  TbFunnel getFunnelByName(String name) ;

    @SelectProvider(type = MySQlProvider.class ,method = "getdynamicSQL")
    public EventCountBean getStaticFunnelByEventChannel(String  eventChannel) ;

    class MySQlProvider{
        public String  getdynamicSQL(String eventChannel){
            //  Splicing   Event chain properties 
            String[] split = eventChannel.split("-");
            StringBuilder sb = new StringBuilder();
            StringBuilder sb2 =  new StringBuilder();
            StringBuilder sb3 =  new StringBuilder();
            for(int i = 0 ; i<split.length ; i++)  {
                sb.append("eventId=").append("'"+split[i]+"'").append(",");
                sb2.append("sum(if(cc=").append(split.length-i).append(",1,0)) as o").append(split.length-i+",") ;
                for(int j = split.length ; j > i ; j--){
                    sb3.append("o").append(j).append("+") ;
                }
                sb3.deleteCharAt(sb3.length()-1).append(" as v").append(i+1).append(",");
            }
            String s1 = sb.deleteCharAt(sb.length() - 1).toString();
            String s2 = sb2.deleteCharAt(sb2.length() - 1).toString();
            String s3 = sb3.deleteCharAt(sb3.length() - 1).toString();
            String sql = "select\n" +
                    "total ,\n" +
                    s3+
                    " from\n" +
                    "(select\n" +
                    "count(1) as total ,\n" +
                    s2+
                    " from\n" +
                    "(select\n" +
                    "deviceId ,\n" +
                    "windowFunnel(3600)(\n" +
                    "toDateTime(ctime) ,\n" +
                    s1+
                    ") as cc\n" +
                    "from\n" +
                    "tb_log \n" +
                    "group by deviceId));";
            return  sql ;
        }

    }


}

 

 

4.4  User retention

 

retention The function can easily calculate the retention condition . This function accepts multiple conditions , Based on the result of the first condition , Observe whether the following conditions are satisfied , If satisfied, set 1, If you are not satisfied, you will set up 0, Eventually return 0 and 1 Array of . Through the statistics 1 The number of , The retention rate can be calculated .

SELECT
  sum(ret[1]) AS original,
  sum(ret[2]) AS next_day_ret,
  round(next_day_ret / original * 100, 3) AS next_day_ratio,
  sum(ret[3]) AS seven_day_ret,
  round(seven_day_ret / original * 100, 3) AS seven_day_ratio
FROM (
  WITH toDate('2020-06-24') AS first_date
  SELECT
    user_id,
    retention(
      ts_date = first_date,
      ts_date = first_date + INTERVAL 1 DAY,
      ts_date = first_date + INTERVAL 7 DAY
    ) AS ret
  FROM ods.ms_order_done_all
  WHERE ts_date >= first_date AND ts_date <= first_date + INTERVAL 7 DAY
  GROUP BY user_id
);

 

SELECT 
    sum(r[1]) AS r1,
    sum(r[2]) AS r2,
    sum(r[3]) AS r3,
    sum(r[4]) AS r4
FROM 
(
    SELECT 
        uid,
        retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03', date = '2020-01-04') AS r
    FROM retention_test
    WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04')
    GROUP BY uid
)

┌─r1─┬─r2─┬─r3─┬─r4─┐
│  5331 │
└────┴────┴────┴────┘

4.5 session Statistics

SELECT
  ts_date,
  sum(length(session_gaps)) AS session_cnt
FROM (
  WITH
    arraySort(groupArray(toUInt32(ts_date_time))) AS times,
    arrayDifference(times) AS times_diff
  SELECT
    ts_date,
    arrayFilter(x -> x > 1800, times_diff) AS session_gaps
  FROM ods.analytics_access_log_all
  WHERE ts_date >= '2020-06-30'
  GROUP BY ts_date,user_id
)
GROUP BY ts_date;

 

                                                                                                              Good video tutorial recommended by friends

dd

                                                                                                 

 

 

版权声明
本文为[HANGGE_ IT]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201217162217564J.html

随机推荐