当前位置:网站首页>Jingdong OLAP practice road

Jingdong OLAP practice road

2021-05-18 08:51:00 InfoQ

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This paper mainly introduces the construction of OLAP From scratch, the key points to be considered in all aspects , Starting from demand scenarios , Analyze the current problems , And provide solutions , Finally, it introduces OLAP Development process of ."}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" Demand scenarios "}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Jingdong data portal "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b5c4ad6c042717ed21c8ecbf1294bfff.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① Business data : Order "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" As an e-commerce enterprise, Jingdong , It has self operated commodity sales platform and full link logistics channel . First of all, the first data entry is orders , Multi dimensional analysis of different orders , such as : Analysis of orders 、 Analysis of stores 、 Analysis of commodity categories and so on ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" For the order dimension , Because we as e-commerce , Order is a very important dimension analysis and data support , So often combined with orders SKU, Calculate the conversion rate of category orders ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② Behavioral data : Click and search "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The user's operation in Jingdong Mall , Like click and search , We will combine these behaviors of users with order information , To do some analysis , For example, analyze the degree of hot sale and unsalable sale of goods , And transformation , The most common is funnel analysis , To calculate the conversion ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"③ Advertising and recommendation "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Based on the user's order , We will push ads and recommendations to users , Then calculate , Analyze the situation of advertising touch ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"④ Monitoring indicators "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" For monitoring indicators , It's a huge scene for us , Because here we are , In addition to user behavior data , There is also a lot of operation and maintenance data to be managed ."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Jingdong data export "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/aa\/aa935014092a28eeffadefd33f908371.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Jingdong's data export , There are two main categories : Offline and real time "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① offline "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Monthly and weekly "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Typical offline scenarios : Financial statements often need monthly and weekly reports ."}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" machine learning "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" A lot of training data will be used , The generation of these data will also be used OLAP To analyze the data , Generate some training data ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② real time "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Interactive query "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Non R & D colleagues , For example, operations analysts , It is often necessary to query business data temporarily , For example, query the summary and detailed data of orders in the latest week , Analyze the data , Aid decision making ."}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Real time large screen display "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The platform can promote or monitor the operation in real time , According to the index of large screen , Adjust resources dynamically in real time . Like marketing strategy 、 Advertising spending 、 Supply chain inventory . Especially important is , When it comes to big sales , Dynamic resources can be adjusted , For example, vehicle resource scheduling 、 According to the promotion effect, decide whether the activity needs to be adjusted ."}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" Problems and solutions "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The above are some business scenarios of JD , Unlike most data architectures , We divide the data building process into 4 In terms of : How to write data in ; How to store it after writing it in ; How to access after storage ; Finally, how to manage before 3 A link ."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Write "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/578943eabb8b7d574d38f5c6363390d1.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① Diversity of data sources and data structures "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" present situation :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Data sources are diverse : file system ( Local files & Distributed system file )& MQ"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Local files , For example, the local data of the server , Direct import in "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" When the amount of data is large , You can't store such a large amount of data locally , Will store data in HDFS On "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka perhaps MQ data , The upstream stores the generated data directly into the message queue , Inside Jingdong , such MQ There are more than one system , But we need unified access "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Diversity of data structures , Most common :CSV、TSV、JSON、AVRO、PARQUET、BINLOG etc. "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Multiple data sources and data types , For developers , The cost of data analysis is relatively high , We want analysts to , Just focus on the business logic itself , Go straight ahead SQL Inquire about , You don't have to care about the source of the data ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Establish a unified import data service , Encapsulate diverse data sources and data types , By configuring permissions for users , Users only need to operate directly in the visual interface , To complete the data import operation , The specific operation is also relatively simple , With MQ Examples of data sources :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Select the topic"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Specify the source of the import "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Choose the data format "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Select the corresponding data field type and so on "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② Timeliness of data "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" present situation :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Real time data needs to be calculated and displayed in real time ; Offline data , It can be pushed and calculated regularly , The requirement for timeliness is relatively low ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" How to ensure the timeliness of real-time and offline data , And will not affect each other ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Physical isolation of real-time and offline clusters , Prevent interference with each other , The benefits of this are :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Real time and offline demand different resources , Real time data writing is very frequent , The offline data is more at the specified time point , Separate, easy to manage ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"③ Data update and deletion "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" about OLAP In terms of Architecture , How to search and update ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Data update , Using the overlay write scheme . Take the order for example , The user placed an order 1, At this time, the order status is 1, Later, the user carried out the payment operation , The order status is 2, So we're going to push a full amount of data again , It's just that the field value of the state changes ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Data deletion , Delete the partition and re import the partition data , Or version management , Because the new version of the data will cover the original version , Play the role of deletion ."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"③ High throughput "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" With the current volume of Jingdong , The daily data is huge , How to solve the problem of high throughput ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The computer room is equipped with 10 Gigabit network , In addition, for real-time scenes , Equipped with SSD; Offline scenarios , Equipped with HDD."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. save "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8d\/8d86f16ac26fe3e513abd25fb4c115bc.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① Huge amounts of data "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Jingdong's data ,TB Data volume is very common , Sometimes it will reach PB Level , Then, the stand-alone mode will definitely not work , So how to solve the problem of data storage ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Using distributed technology to solve the problem of large-scale data ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In order to improve efficiency , Column storage is adopted , The calculation efficiency of follow-up indicators is also improved , Actually in OLAP In the framework of , Most of them are column storage ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Using different types of compression , such as snappy etc. ."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② Fault tolerance "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Data is now a very important asset for technology companies , So how to ensure the security of data ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" firstly : Fault tolerance of multiple copies , our OLAP In the form of three copies , So among them 1 Or 2 It's easier to deal with the expansion when a replica is damaged or migrated ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second :RAID Redundant array of independent disks (RAID,redundant array of independent disks), Because disk metal machines often break down , In addition, there is a risk of damage to some machines , So we also pass raid Solve part of the problem of data fault tolerance , Prevent the whole machine from providing service after the disk is broken ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"③ Uniformity "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Solve the problem of data consistency , Distributed coordination and local transaction mechanism are needed ."}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Distributed coordination , such as zookeeper"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Local transaction mechanism : For local submission , Whether to ensure the consistency of data through transactions "}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" By combining the two , To achieve data consistency ."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. read "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/04\/049231f4757bd9d19952b1251104a6ce.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① Query speed "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" After the data is stored in the system , How to use it efficiently ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" General way , Data is partitioned , In many scenarios , The analysis of data is divided according to time , After partition , Then slice or barrel ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" such as : Order information , We may check by day , Then you can partition according to the actual date ; such as , Storage 10 Years of data , It's impossible to query all , According to statistics, we divide it by month ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Prepolymerization , Precomputing ahead of time , Reduce the amount of data for one-time calculation , Improve performance ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Indexes , Most scenarios use indexes , For example, do detailed inquiry , It may come to hash Indexes 、betree、 Range query or inverted ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Materialized view , It's actually similar to the function of prepolymerization , When data enters materialized view , Do some pre calculations ahead of time ."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② Ease of use "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" How to make the system easy to use ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" need OLAP System compatible JDBC and ODBC, At the same time, it supports the standard SQL."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Provide interface operation , In this way, all operational analysts can avoid having Mysql And so on , You can directly log in to the graphical interface for query operation ."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"③ QPS"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" How to improve QPS?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" cache , Use doris when , Add, for example partitioncache, Or result level caching , Or partition caching mechanism to improve query efficiency ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Multiple copies , How to set up multiple copies , By sacrificing part of the storage space , To enhance QPS, But the effect is limited ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Scale up the machine ."}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. management "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8e\/8e0a3324d51f6f59ef03a4c3720a01c3.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" present situation :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In the early days, we had a disk failure , In a hurry , It takes a long time to replace the disk or get off the machine , And this operation also involves rebalancing data or data migration , This process is very cumbersome ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" problem :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" How to reduce the operation and maintenance cost ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Solution :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Build monitoring and alarm mechanism , Prevention ahead of time ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Optimize the offline of nodes , In the Jingdong system , There are two ways :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The way 1: By blacklisting , If monitoring a node is often unhealthy , We'll blacklist it , And kicked him off the line ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The way 2: Through script operation , But from the beginning we need to replace a node , It's estimated to take three hours from discovery to replacement , Now through something more automatic , Now we can replace a node in about ten minutes ."}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" development history "}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.0 Time "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/de\/de807a04152713c7de652d4c8a91f8ac.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.0 There are fewer scenes in the era , The amount of data is also relatively small , Mainly some data related to orders , Analysis can be done through a relational database , for instance oracle perhaps mysql."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" It can be synchronized to the backup database for analysis through data reserve , perhaps mysql One is slave library , The main library does some online business , Then prepare some analysis and query of the inventory ."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.0 Time "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1d\/1d4852bd7be95151d71b828e99e2b39f.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" here we are 2.0 scene , It's no longer a simple order issue , And deal with logistics 、 Supply chain 、 Customer service 、 Payment and so on ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" There's a big increase in the number of scenes , The amount of data has also exploded , From the original G To the present TB even to the extent that PB Level . Traditional relational databases , It can't meet the demand ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" At this time, we started to build an offline data warehouse , Analysis in data warehouse , It mainly uses Hive and Spark Calculate , The data are all T-1, The experience of temporarily querying data is very poor , It takes minutes and it's still yesterday's data ."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.0 Time "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ac\/ac5d1eb30d39506361822f84fbd46be2.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In order to improve the speed of data query and the timeliness of data , Start doing real-time queries . We are now using unified OLAP service , Use it from the beginning Kylin Deal with some offline business , Now we use doris and clickhouse Combine , Handle both real-time and offline , Unified service interface , Open to users and developers ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" At the same time, for different businesses , Provide different computing engines , We now offer a whole package solution , Business only needs to be in OLAP On service , It can be deployed to the data technology platform , Greatly reduces the development cost ."}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":" The future planning "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a1\/a1319d3d23918b10a2ab87715d8cc72d.png","alt":" picture ","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① Management platform optimization "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse Dynamic scaling of ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Intelligent operation and maintenance . Intelligent optimization of node up and down line or data balancing work . The main idea is to reduce manual intervention , Reduce the waste of time . In Jingdong's system , Now there are thousands of servers , If it's handled manually all the time , So the cost is very high ."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② Optimize query speed "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Optimize the real-time computing cache . stay doris in , Use partitioncacahe perhaps sqlcache More effective in offline scenarios , In the future, we will consider introducing cache into real-time computing for optimization ."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Index intelligent management . There are many engines for indexing , If you need developers to understand all the engines , The cost of learning is relatively high , Can we build an intelligent index engine ? In this way, users don't need to care about how to build a similar index , Instead, we are analyzing the user's query behavior , Automatically create an index for him , This simplifies the cost of work , Let developers have more thoughts to do a good job in operation and Analysis ."}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":" Q & A "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Q1: May I ask the teacher if your company has used druid The engine ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A1: No, , We found that , First of all ,druid Yes sql Our support is not very friendly ; second ,druid Good at persistent data scenarios , But Jingdong order is a scene of frequent state changes , It's not easy to handle , So we didn't choose druid."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Q2:clickhouse and doris How to select models according to business scenarios , What are the pit ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A2:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First of all , Query performance : When there are not many query tables, that is, there are not many table associations ,clickhouse Query speed is relatively fast , But when doing large table Association query ,doris The performance of will be better than clickhouse."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" second ,QPS:clickhouse It's in extreme use CPU Performance of , however doris Of QPS Performance in some scenarios is better than clickhouse good ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Third , O & M costs :doris The operation and maintenance cost will be less than clickhouse, At least for us now , Because it will have nodes automatically up and down line expansion, it will be very convenient , however clickhouse It's troublesome to do this ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Fourth , Data update :clickhouse There are three kinds of data update engines , It's troublesome for users to use it , however doris, It does data coverage directly , You can do an update , Easier to operate ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Q3:doris and clickhouse It's automatic selection , Users still need to choose by themselves , If it's automatic , What kind of recognition mechanism is ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A3: At present, automatic selection has not been realized . At present, we should provide the whole solution first , We provide technical support for users , In the future, we may think about how to get through the unification , Because its model creation is different , its sql There are differences ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Q4: Data access CK aspect , What is the current plan of Jingdong ?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A4: There are currently two routes ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" One is the user's own service side access , Because of some historical reasons , The user has been using it for a long time clickhouse, He has a relatively mature access system of his own ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" One is to use clickhouse To access , Whether it's offline or from kafka in , This way we're on the platform side , On the platform side, a unified OLAP service , Users can do the above . As I said before , The user selects the data source , Choose another target source , Click it to perform the import , You can do it ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Sharing guests :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" Li Yang "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" JD.COM  |  Senior R & D Engineer "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Jingdong senior R & D Engineer , Have more than 10 Years of R & D experience , Good at OLAP Related service R & D and distributed system design ."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Reprinted from :DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Link to the original text :"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/y78sQPP9Fp2S3ZgPSCeK-Q","title":"xxx","type":null},"content":[{"type":"text","text":" JD.COM OLAP Road of practice "}]}]}]}

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/05/20210518084932335q.html

随机推荐