当前位置:网站首页>Lakehouse: a new open platform for unified data warehouse and high level analysis

Lakehouse: a new open platform for unified data warehouse and high level analysis

2021-01-24 02:13:31 itread01

![](https://blog-static.cnblogs.com/files/leesf456/powedby-0123.gif)### 1. The data warehouse architecture will gradually die out in the future , It's going to be a new kind of Lakehouse Architecture replaces , The architecture has the following characteristics * Based on open data format , Such as Parquet;* Machine learning and data science will be supported as first class citizens ;* Provide excellent performance ;Lakehouse It can solve several major challenges of data warehouse , Such as ** The information is out of date **,** reliability **,** Total cost **,** Data format is not open ** and ** Limited scenario support **.### 2. Data analysis platform develops data warehouse to collect data from business database into centralized warehouse to help enterprise leaders obtain analysis opinions , Then use it for decision support and business intelligence (BI), The warehouse uses write mode (schema-on-write) Write data , Downstream consumers have been optimized , This is the first generation of data analysis platform .![](https://img2020.cnblogs.com/blog/616953/202101/616953-20210123224342113-671702509.png) Slowly, the first generation systems began to face a number of challenges , First, the coupling of computing and storage increases the cost of capacity expansion ; Second, more and more data sets are unstructured , For example, video , Audio and text files , The data warehouse can't store and query the data . To solve these problems , Introduce the second generation data analysis platform , It collects all the original data into the data lake : Have a file API Low cost storage system , The API Storing data in a common and usually open file format , for example Apache Parquet and ORC, It can be based on HDFS Realize low cost data storage , Data lake is a reading mode (schema-on-read) Architecture , Flexibility to store any data at low cost . A small part of the data in the architecture will be later ETL To the downstream data warehouse to provide the most important decision support and BI Applications .![](https://img2020.cnblogs.com/blog/616953/202101/616953-20210123224437528-261151975.png) From 2015 From the year onwards ,S3,ADLS,GCS,OSS When cloud data Lake begins to replace HDFS, The architecture on the cloud is basically the same as that in the second generation system , On the cloud Redshift、Snowflake and ADB And so on , This two-layer data Lake + Data warehouse structure is dominant in the industry ( Wealth 500 Almost all of them are used in powerful enterprises ). But this architecture also faces some challenges , Although due to separate storage ( for example S3) And Computing ( for example Redshift) And the architecture of cloud data lake and warehouse is cheap on the surface , But for users , Two tier architecture is very complex . In the first generation platform, all data are directly from the operation data system ETL To the warehouse , And in this architecture , The information was first ETL To the information Lake , And then it was ELT To shucang , Introduce additional complexity 、 Delay and failure rate , And enterprise use cases include high-level analysis such as machine learning , The data lake and the warehouse are not well supported , Specifically , The current data architecture usually encounters the following four problems :* ** reliability **. It's difficult and expensive to keep data and data consistent , It's necessary to analyze the relationship between the two systems ETL Design your homework carefully , Every ETL There is also the risk of steps failing or introducing errors , For example, the risk of data quality degradation due to subtle differences between data lake and warehouse engine .* ** The information is out of date **. Compared with the data of data Lake , The data in the warehouse is old , It usually takes a few days to load new data . Compared with the first generation analysis system, it is a retrogression , The new operation data in the first generation analysis system can be used for query immediately . * ** Limited support for higher-order analysis **. Companies want to use data for forecasting , but TensorFlow,PyTorch and XGBoost And machine learning systems can't work on data warehouse , And BI It's different to query and extract a small amount of data , These systems require the use of complex non SQL Code deals with large data sets , And through ODBC/JDBC Reading this data is inefficient , And there is no direct access to the proprietary format inside the warehouse , For these cases, it is suggested that the data be exported to the file , This adds further complexity and antiquity ( Added a third ETL Steps !), Or users can run these systems against open format data , But it will lose the rich management function of data warehouse , for example ACID Business , Data version control and indexing .* ** Total cost **. In addition to payment ETL In addition to the operating expenses , Users also pay twice the storage cost for data copied to the warehouse , The use of internal proprietary formats in commercial warehouses increases the cost of migrating data or workloads to other systems . One widely adopted solution is not to use data lakes , Store all data in a warehouse with built-in separation function of calculation and storage , But the feasibility of this solution is limited , Because it doesn't support managing video / News / Text information or from ML And data science workload . As more and more business applications begin to rely on operational data and high-level analysis ,Lakehouse Architecture can eliminate some of the major challenges in data warehouse ,Lakehouse The time has come .![](https://img2020.cnblogs.com/blog/616953/202101/616953-20210123224525509-700024283.png)Lakehouse Provide solutions to the following key issues :* ** Reliable data management on the data Lake **:Lakehouse Need to store raw data , Simultaneous support ETL/ELT Process to improve the quality of data analysis , The traditional data lake takes semi-structured data as " A bunch of files " Manage , It's hard to provide some simplified data warehouse ETL/ELT Key management functions of , For example, transactions 、 Version rollback and zero copy, etc . Some new data Lake frameworks ( Such as Delta、Hudi、Iceberg) Provides a transaction view of the data Lake , And provides management functions , Less ETL Steps , And analysts can efficiently query the original data table , This is very similar to the first generation of analysis platforms .* ** Support machine learning and Data Science **:ML The system supports direct reading data format , quite a lot ML The system adopts DataFrames As an abstraction of operational data , And declarative DataFrame API It can be done to ML Data access in the workload for query optimization , Can enjoy directly in Lakehouse Many optimizations in .* **SQL Efficiency of the **:Lakehouse Need to be in the mass Parquet/ORC It's a great collection of SQL Efficiency of the , In contrast, the classic data warehouse is right SQL More thorough optimization ( Including the use of proprietary storage formats ).Lakehouse A variety of techniques can be used to maintain Parquet/ORC Auxiliary data of data set , And optimize the data layout in these existing formats to achieve better performance . The current industry trend shows that customers are interested in two layers of information + The data warehouse structure is not satisfactory , First of all, in recent years, almost all data warehouses have increased the number of Parquet and ORC External table support for format , This allows data warehouse users to access data from the same SQL Engine query data table ( Access... Through the linker ), But it doesn't make it easier to manage , It will not eliminate the data in the warehouse ETL Complexity 、 Obsolescence and higher-order analysis challenges . In fact, the performance of these couplings is usually poor , Because SQL The engine mainly optimizes its internal data format , These analysis engines alone can not solve all the problems of data lake and replace the warehouse , The data Lake still lacks basic management functions ( for example ACID Business ) And effective access methods ( For example, indexes that match data warehouse performance ).### 3. Lakehouse Architecture Lakehouse It can be defined as ** Based on low cost , Direct access to the stored data management system , The system also provides traditional analytical DBMS Management and performance functions , for example ACID Business , Data version , Audit , Indexes , Caching and query optimization **. therefore Lakehouse It combines the main advantages of data lake and data warehouse : Open format low-cost storage can be accessed through the former's various systems , The latter has powerful management and optimization functions . The core issue is whether these advantages can be effectively combined : In particular Lakehouse Support for direct access means that they give up some aspects of data independence , It's always been relational DBMS The foundation of design .Lakehouse Especially suitable for cloud environment with separate computing and storage : Different computing applications can run on completely independent computing nodes ( for example ML Of GPU to cluster around ) On demand , Direct access to the same stored data at the same time , But it can also be stored in a local system ( for example HDFS) To achieve Lakehouse.#### 3.1 Realize Lakehouse System implementation Lakehouse My first key idea is ** Use standard file formats ( Such as Apache Parquet) Store data in low-cost object storage ( for example Amazon S3) in , And implement metadata layer on object storage , It defines which objects are part of the table version **. This enables the system to implement such functions as ACID Management functions like transaction processing or version control , At the same time, a large amount of data is kept in low-cost object storage , And allows the client to read objects directly from the store using the standard file format , Although the metadata layer adds management functions , But not enough to achieve good SQL Efficiency of the , Data warehouse uses a variety of technologies to achieve good performance , For example, storing thermal data in SSD And so on 、 Maintain Statistics 、 Build effective access methods ( For example index ) And optimizing data format and computing engine . Based on the existing storage format Lakehouse Format cannot be changed in , However, other optimizations can also be realized while keeping the data file unchanged , Include cache 、 Auxiliary data structure ( Such as indexes and Statistics ) And data layout optimization .Lakehouse It can speed up the high-order analysis load , It can also provide better data management function . A lot of ML Ku ( for example TensorFlow and Spark MLlib) You can read the file format ( Such as Parquet). So connect them with Lakehouse The easiest way to integrate is to query the metadata layer , To determine which Parquet The file belongs to the table , And then pass them on to ML Ku .![](https://img2020.cnblogs.com/blog/616953/202101/616953-20210123224654948-1547321106.png)#### 3.2 Metadata layer for data management Lakehouses The first element of is the metadata layer , It can achieve ACID Transactions and other management functions . Such as S3 or HDFS Such data storage systems only provide low-level object storage or file system interfaces , In these interfaces , Even simple operations ( Such as updating tables across multiple files ) It's not atomic , This problem has led some organizations to design richer data management , From Apache Hive ACID Start , Its use OLTP DBMS Tracks which data files in a given table version are Hive Part of the table , And allow the operation to update the collection transactionally . In recent years, some new systems have provided more functions and improved scalability , Such as 2016 year Databricks Developed Delta Lake, It stores information about which objects are part of the table in the data Lake , As Parquet Transaction log in the format of , Make it possible to expand the suite to billions of objects per table ;Netflix Of Apache Iceberg Use a similar design , And support Parquet and ORC Store ;Apache Hudi From Uber It's similar , Although it doesn't support concurrent writes ( In support ), The system focuses on simplifying the flow of data into the data lake . The experience of these systems shows that they can provide the same information as the original Parquet/ORC Similar or better performance , At the same time, a very useful management function has been added , For example, transaction processing , Zero copy and time travel . Metadata layer is very important for data quality , For example, it can be used to Schema Check it out , So that it does not damage the quality of data , In addition, the metadata layer can implement governance functions such as access control and audit logging , For example, the metadata layer can grant client credentials to read raw data in a table from a cloud object store , Check whether the client is allowed to access the table , And record all access behavior .** Future direction and alternative design **. As the metadata layer of data lake is very new , As a result, there are many outstanding issues and alternative designs . for example Delta Lake Designed to store the transaction log in the same object store that it executes ( for example S3) To simplify management ( Eliminates the need to perform a separate storage system ) And provide high availability and high read bandwidth , But the high latency of object storage limits the transaction rate it can support per second , In some cases, designs that use faster storage systems for metadata may be preferable . Again Delta Lake,Iceberg and Hudi Only single table transactions are supported , But the suite can also be expanded to support cross table transactions , Optimizing the format of the transaction log and the size of the management objects are also open issues .#### 3.3 Lakehouse Medium SQL Efficiency of the Lakehouse The biggest technical problem of the scheme may be how to provide the latest information SQL Efficiency of the , At the same time, they gave up the tradition DBMS A lot of data independence in design , There are many solutions , For example, you can add a cache layer to the object store , And whether it is possible to change the storage format of data objects without using existing standards ( for example Parquet and ORC( New designs are emerging to continuously improve these formats )). No matter what the design is , The core challenge is that the data storage format has become a public system API Part of to allow quick direct access to , It's with tradition DBMS Different . We propose several techniques that can be used in Lakehouse In the process of optimization SQL Efficiency of the , And it has nothing to do with the data format , So it can be used with existing formats or future data formats , These format independent optimizations are roughly as follows :* ** Get it **: When using the metadata layer ,Lakehouse The system can cache the files stored in the cloud objects on the processing node safely and store them faster ( for example SSD and RAM) On , The executing transaction can determine whether the read cached file is still valid , In addition, the cache can be in transcoding format , It is more efficient for query engine , For example, in Databricks The cache of will unzip part of it loaded Parquet Information .* ** Supporting information **: Even if Lakehouse To support direct I/O Access requires open table storage format ( Such as Parquet), It can also maintain other data to help optimize the query , If in Parquet The smallest column of each data file in the file maintenance table - The biggest Statistics , Help skip data , And based on Bloom The index of the filter . Various auxiliary data structures can be realized , It is similar to " original " Data indexing .* ** Data layout **: Data layout plays an important role in access performance .Lakehouse The system can also optimize multiple layout decisions , The most obvious is the record sorting : Which records can be most easily read in batch when gathered together ,Delta Use in Z-Order,Hudi Based on which columns are used in Clustering. For typical access patterns in analysis systems , These three optimizations work well together . In a typical workload, most queries tend to focus on data " Heat " On a subset ,Lakehouse It can be cached with the same optimized data structure as the data warehouse , To provide the same query performance . For... In cloud object storage " cold " Information , The performance is mainly determined by the amount of data read per query , In this case, the data layout is optimized ( Cluster the data that you visit together ) And auxiliary data structure ( As shown in the area map , Enables the engine to quickly determine the range of data files to read ) The combination of can make Lakehouse The system is as minimized as the data warehouse I/O Spending , Although using the standard open file format ( Compared with the data warehouse built-in file format ).#### 3.4 High level analysis efficient access to high-level analysis libraries is usually not using SQL Command writing , It requires access to a lot of information , How to design a data access layer to maximize the flexibility of the code running at the top , You can still start from Lakehouse Benefit from the optimization of . Machine learning API Rapid development , But some data access API( for example TensorFlow Of tf.data) There is no attempt to push query semantics into the underlying storage system , some API And focus on CPU To GPU The transmission and GPU Calculation , This has not attracted much attention in the data warehouse . We need standards ML Interface so that data scientists can make full use of Lakehouse( Even the data warehouse ) Powerful data management function in , Such as affairs , Data version control and time travel, etc .### 4. Research questions and Implications Lakehouse Some other research problems are also put forward , The industry trend of data lake with increasingly rich functions also has an impact on other areas of data system research .* ** There are other ways to do it Lakehouse Goals ?** You can imagine other ways to achieve Lakehouse The main goal of , For example, building a massively parallel service layer for data warehouse , It can support parallel reading of high-order analysis workload , But it's more expensive than the workload accessing the object repository directly , Difficult to manage , And the effectiveness may be reduced . This service layer is not widely used , for example Hive LLAP. In addition to efficiency 、 Usability 、 Cost and lock in challenges , There are also some important management reasons , For example, enterprises may prefer to keep the data in an open format . With the continuous improvement of regulatory requirements for data management , Organizations may need to search for old data sets in a short period of time , Delete all kinds of data or change their data processing infrastructure , And standardization in an open format means that they will always have direct access to data , The long-term trend of software industry has been open data format , Enterprise data should continue to maintain this trend .* ** What is the correct storage format and access API?**Lakehouse The access interface includes the original storage format and the client library to read this format directly ( For example, using TensorFlow When reading ) And higher order SQL Interface . There are many different ways to place rich functionality on these layers , For example, by asking readers to perform more complex “ Programmable ” Decoding logic , It can provide more flexible storage solution for the system . It remains to be seen which storage format 、 Metadata layer design and access API The best combination of .* **Lakehouse How to influence other data management research and Trends ?** The popularity of data lakes and the increasing use of rich management interfaces , Whether they are metadata layer or complete Lakehouse Design , They have had an impact on other areas of data management research .Polystore It aims to solve the problem of querying data across different storage engines , The problem persists in the enterprise , But in cloud data lake, the proportion of data provided in open format is increasing , You can also do many things directly for cloud object storage polystore Inquire about , Even if the basic data files are logically separated Lakehouse Part of . It can also be in Lakehouse Design data integration and cleaning tools , And can quickly access all data in parallel , This opens up new algorithms such as large joins and clustering . You can HTAP The system is built as Lakehouse Ahead " additional " Layer , By using its transaction management API File the data directly to Lakehouse In the system ,Lakehouse Will be able to query consistent snapshots of data .ML The data management will also become more simple and powerful , Today, organizations are building standards that can be reimplemented DBMS Functional , Specific to ML Data version control and feature storage system , Use with built-in DBMS It may be easier to manage the data lake to realize the feature storage function . No server engine or the like DBMS The design will need to be integrated with a richer metadata layer , Instead of directly scanning the original files in the data Lake , It can improve query efficiency . Finally Lakehouse It's easy to design distributed collaboration , Because all datasets can be accessed directly from the object repository , This makes it easy to share information .### 5. Conclusion the unified data platform architecture that realizes the function of data warehouse on the open data Lake file format can provide competitive performance for today's data warehouse system , And help to deal with many of the challenges that data warehouse users face , Although limiting direct access to the storage layer of a data warehouse in a standard format seems to be a major limitation , But optimizations such as hot data caching and cold data layout optimization can make Lakehouse Get very good performance , In addition, in view of the data, there are a lot of data in the lake , And there is an opportunity to greatly simplify the enterprise data structure , The industry is likely to Lakehouse The architecture gradually

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/01/20210124021047854M.html

随机推荐