当前位置:网站首页>Construction of high efficiency data Lake through Apache Hudi and alluxio

Construction of high efficiency data Lake through Apache Hudi and alluxio

2020-12-07 02:43:33 itread01

T3 Yang Hua and Zhang Yongxu describe the development of their data Lake architecture . The architecture uses many open source technologies , Include Apache Hudi and Alluxio. In this paper , You'll see how we use Hudi and Alluxio Cut the data intake time by half . Besides , How data analysts use Presto、Hudi and Alluxio The query speed is improved 10 times . We base our data arrangement into multiple stages of the data pipeline ( Including extraction and Analysis ) The data lake is constructed . ## 1.T3 Travel information Lake Overview T3 Travel is still in the period of business expansion , Before building the data lake, different lines of business , Will choose a different storage system 、 Transmission tools and processing framework , As a result, serious data island appears, which makes the complexity of mining data value become very high . Due to the rapid development of business , This inefficiency has become our engineering bottleneck . We turned to Alibaba OSS( Similar to AWS S3 Object storage ) A unified data solution , To follow multiple clusters 、 Shared data architecture (Multi-cluster,Shared-data Architecture) Provides a centralized location for storing structured and unstructured data . Contrary to different data islands , All applications will OSS Store as a source of facts to access . This architecture allows us to store data as is , You don't have to structure the data first , And perform different types of analysis to guide better decisions , Through big data processing , Real time analysis and machine learning to build dashboards and visualizations . ## 2. Use Hudi For efficient near real time analysis T3 The smart travel business of travel promotes the demand for near real-time data processing and analysis . Using a traditional data warehouse , We face the following challenges : * Long tail updates cause frequent and cascade updates of cold data * The long business window leads to high backtracking cost of order analysis * Random update and late data cannot be predicted * Data intake Pipeline There is no guarantee of reliability * Distributed data Pipeline Lost data cannot be reconciled * The delay of data acquisition is very high therefore , We're in OSS It uses Apache Hudi To solve these problems . The picture below shows Hudi The architecture of : ![](https://img2020.cnblogs.com/blog/616953/202012/616953-20201206211038346-445961078.png) ### 2.1 Enable near real time data ingestion and Analysis T3 Travel information Lake support Kafka Message 、Mysql binlog、GIS、 Business logs and other data sources enter the lake in near real time , The whole company 60% The above information has been stored in the data lake , And it's growing . T3 Travel by introducing... Into the data pipeline Hudi Shorten the data intake time to a few minutes , Combined with big data interactive query and analysis framework ( Such as Presto and SparkSQL), It enables more real-time insight into data 、 analysis . ### 2.2 Enable incremental processing pipeline T3 Travel by means of Hudi The ability to provide incremental queries , For multi-layer data processing in frequent change scenarios , You can only feed back incremental changes to the downstream derived table , Downstream derived tables only need to apply these change data , We can quickly update the regional data of multi-layer links , This greatly reduces the efficiency of data update in frequent change scenarios . It effectively avoids tradition Hive The fully segmented slot in the data warehouse 、 Cold data update . ### 2.3 Use Hudi As a unified data format Traditional data warehouse is usually deployed Hadoop To store data and provide batch analysis ,Kafka Used separately to distribute data to other data processing frameworks , This leads to duplication of data .Hudi It effectively solves this problem , We always use Spark-kafka The pipeline inserts the latest updated data into Hudi In the table , And then read in increments Hudi Table update . In other words ,Hudi Unified storage . ## 3. Use Alluxio Efficient data caching It was not used in earlier versions of the data lake Alluxio,Spark Real time processing from Kafka Received data , Then use Hudi DeltaStreamer The task writes it to OSS. When you execute this process ,Spark In direct writing OSS The network latency is usually very high . Because all the data is stored in OSS in , This leads to the loss of local data , So yes Hudi Information OLAP Queries are also very slow . To solve the problem of delay , We will Alluxio Deploy as a data choreography layer , And Spark and Presto When the computing engines are in one place , And use Alluxio Speed up the reading and writing of the data lake , As shown in the figure below : ![](https://img2020.cnblogs.com/blog/616953/202012/616953-20201206211137605-194330172.png) Hudi,Parquet,ORC and JSON Most of the data in the same format are stored in OSS On , Occupy 95% Information about . Flink,Spark,Kylin and Presto And other computing engines are deployed in isolated clusters . When each engine accesses OSS When ,Alluxio Act as a virtual distributed storage system to speed up data , And coexist with each computing cluster . Let's introduce T3 Travel information is used in the lake Alluxio The case of : ### 3.1 Data into the lake We will Alluxio Co deploy with compute cluster . Before the data entered the lake , Will correspond to OSS Path mounted to alluxio In the filing system , And then set Hudi Of "--target-base-path" Arguments From oss://... Change it to alluxio://... . When the data entered the lake , We use Spark The engine pulls up Hudi The program keeps taking in data , The information is in alluxio Mid stream .Hudi After the program is pulled up , Set the data from every minute Allxuio Asynchronous synchronization to the remote end in cache OSS. So Spark From the previous write far end OSS To write local Alluxio, It shortens the time for data to enter the lake . ### 3.2 Lake data analysis We use Presto As a self-service query engine , Analyze the lake Hudi surface . In every Presto worker Nodes are Co located Alluxio. When Presto And Alluxio When the service is co executed ,Alluxio Input data may be cached to Presto worker The local , And provide the next retrieval at memory speed . In this case ,Presto You can use Alluxio From the local Alluxio worker Save read data ( It's called short-circuit reading ), No additional network transmission is required . ### 3.3 Concurrent access across multiple storage systems To ensure the accuracy of the training samples , Our machine learning team often synchronizes desensitization data in production to offline machine learning environment . During synchronization , Data flows across multiple file systems , From production OSS To offline data Lake cluster HDFS, Finally, it synchronizes to the machine learning cluster HDFS. For data modelers , The data migration process is not only inefficient , And there will be errors due to wrong configuration , Because it involves multiple file systems with different configurations . So we introduced Alluxio, Mount multiple file systems to the same Alluxio Next , Unifies the namespace . End to end docking , Use the respective Alluxio Path , This guarantees a difference API Applications that access and transfer data seamlessly . This data access layout can also improve performance . ### 3.4 Benchmarking Overall , We have observed Alluxio The following advantages of : * Alluxio Support hierarchical and transparent caching mechanism ; * Alluxio Supports read time caching promote Pattern ; * Alluxio Support asynchronous write mode ; * Alluxio Support LRU Recovery strategy ; * Alluxio Possess pin as well as TTL characteristic ; After comparison and verification , We choose to use Spark SQL As a query engine , Check out Hudi surface , The storage layers are Alluxio + OSS、OSS、HDFS These three different file systems . It was found during pressure measurement , The amount of data is larger than a certain order of magnitude (2400W) After , Use alluxio+oss The query speed is faster than that of mixed deployment HDFS Query speed , The amount of data is greater than 1E After , The query speed began to double . Arrive at 6E After the information , Relative to query native oss Reach 12 Times promotion , Relative to query native HDFS Reach 8 Times promotion . The larger the data size , The more effective it is , The multiplier depends on the machine configuration . ![](https://img2020.cnblogs.com/blog/616953/202012/616953-20201206211257977-1180076827.jpg) ## 4. expectation With T3 Travel information lake ecosystem extension kit , We will continue to face the critical scenario of computing and storage isolation, along with T The growing demand for data processing , Our team plans to deploy on a large scale Alluxio, In order to enhance the information Lake inquiry ability . So in addition to the data Lake computing engine ( Mainly Spark SQL) Will deploy Alluxio Outside , Follow up on OLAP Cluster (Apache Kylin) and ad_hoc Cluster (Presto) It's on the shelf Alluxio.Alluxio Will cover the entire scene , Every scene Alluxio Interconnection , Improve the reading and writing efficiency of data lake and surrounding lake ecology . ## 5. Conclusion As mentioned earlier ,Alluxio Covered Hudi Near real time ingestion , Near real time analysis , Incremental processing ,DFS On the distribution of all the scenes , In the data into the lake and the lake data analysis link plays the role of a powerful accelerator , It can be said that the two have joined forces . Landing on a specific scene , R & D engineers shorten the time of data entering the lake 1-2 times . Data analysts use Presto+Hudi+Alluxio The speed of searching for information on the lake has increased 10 More than times .Alluxio yes T3 Travel has become an important part of China's leading enterprise data Lake program , We are looking forward to T3 Travel information lake ecosystem and Alluxio Further

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207024030521e.html