The data warehouse architecture will gradually die out in the future , It's going to be a new kind of Lakehouse Architecture replaces , The architecture has the following characteristics
- Based on open data formats , Such as Parquet;
- Machine learning and data science will be supported as first class citizens ;
- Provides superior performance ;
Lakehouse It can solve several major challenges of data warehouse , Such as The data is old , reliability , The total cost , The data format is not open and Limited scenario support .
2. Data analysis platform development
Data warehouse collects data from business databases into a centralized warehouse to help business leaders gain analytical insights , And then use it for decision support and business intelligence （BI）, The warehouse uses write mode （schema-on-write） Write data , Downstream consumers have been optimized , This is the first generation of data analysis platform .
Slowly, the first generation systems began to face a number of challenges , First, the coupling of computing and storage increases the cost of capacity expansion ; Second, more and more data sets are unstructured , For example, video , Audio and text documents , Data warehouse can't store and query these data .
To solve these problems , Introduce the second generation data analysis platform , It imports all the raw data into the data Lake ： With file API Low cost storage system , The API Save data in a common and usually open file format , for example Apache Parquet and ORC, Can be based on HDFS Realize low cost data storage , Data lake is a read mode （schema-on-read） framework , Flexibility to store any data at low cost .
A small portion of the data in the architecture will then be ETL To the downstream data warehouse to provide the most important decision support and BI Applications .
from 2015 From the year onwards ,S3,ADLS,GCS,OSS When cloud data Lake begins to replace HDFS, The architecture on the cloud is basically the same as that in the second generation system , There are... On the cloud Redshift、Snowflake and ADB Wait for the data warehouse , This two-tier data Lake + Data warehouse structure is dominant in the industry （ Wealth 500 Almost all of them are used in powerful enterprises ）. But this architecture also faces some challenges , Although due to separate storage （ for example S3） And calculation （ for example Redshift） The architecture of cloud data lake and warehouse is cheap on the surface , But for users , Two tier architecture is very complex . In the first generation platform, all data is directly from the operation data system ETL To the warehouse , And in this architecture , The data is first ETL To the data Lake , And then be ELT To shucang , Introduce additional complexity 、 Delay and failure rate , And enterprise use cases include advanced analysis such as machine learning , Data lakes and warehouses are not well supported , say concretely , The current data architecture usually meets the following four problems ：
- reliability . It's difficult and expensive to keep the data lake and data warehouse consistent , It's necessary to analyze the relationship between the two systems ETL Design your homework carefully , Every ETL There is also the risk of steps failing or introducing errors , For example, the risk of data quality degradation due to subtle differences between the data lake and the warehouse engine .
- The data is old . Compared to the data from the data Lake , The data in the warehouse is stale , It usually takes a few days to load new data . Compared with the first generation analysis system, it is a retrogression , The new operational data in the first generation analysis system can be used for query immediately .
- Limited support for advanced analysis . Companies want to use data for forecasting , but TensorFlow,PyTorch and XGBoost And machine learning systems can't work on data warehouse , And BI Queries extract a small amount of data differently , These systems require the use of complex non SQL Code for large data sets , And by ODBC/JDBC Reading this data is inefficient , And can not directly access the warehouse internal proprietary format , For these cases, it is recommended to export the data to a file , This adds further complexity and antiquity （ Added a third ETL step ！）, Or users can run these systems against open format data , But it will lose the rich management function of data warehouse , for example ACID Business , Data versioning and indexing .
- The total cost . In addition to payment ETL In addition to the operating expenses , Users also pay twice the storage cost for data copied to the warehouse , The use of internal proprietary formats in commercial warehouses increases the cost of migrating data or workloads to other systems .
One widely adopted solution is not to use data lakes , Store all data in a warehouse with built-in separation of computing and storage , But the feasibility of this solution is limited , Because it doesn't support managing video / Audio / Text data or from ML And direct access to data science workloads .
As more and more business applications begin to rely on operational data and advanced analysis ,Lakehouse Architecture can eliminate some of the main challenges in data warehouse ,Lakehouse The time has come .
Lakehouse Provide solutions to the following key issues ：
- Reliable data management on data Lake ：Lakehouse Need to store raw data , Support at the same time ETL/ELT Process to improve the quality of data analysis , The traditional data lake takes semi-structured data as " A pile of papers " Conduct management , It's hard to provide some simplification in data warehouse ETL/ELT Key management functions of , Such as transaction 、 Version rollback and zero copy, etc . Some new data Lake frameworks （ Such as Delta、Hudi、Iceberg） Provides a transaction view of the data Lake , And provides management functions , Less ETL step , And analysts can efficiently query the original data table , This is very similar to the first generation of analysis platforms .
- Support machine learning and Data Science ：ML The system supports direct reading data format , quite a lot ML System USES DataFrames As an abstraction of operational data , And sonographic DataFrame API It can be done to ML Data access in the workload for query optimization , Can enjoy directly in Lakehouse Many optimizations in .
- SQL performance ：Lakehouse Need to be in the mass Parquet/ORC There's a good... On the dataset SQL performance , In contrast, the classic data warehouse is right SQL More thorough optimization （ Including the use of proprietary storage formats ）.Lakehouse A variety of techniques can be used to maintain Parquet/ORC Auxiliary data for data sets , And optimize the data layout in these existing formats to achieve better performance .
Current industry trends show that customers are interested in two-tier data lake + The data warehouse structure is not satisfactory , First of all, in recent years, almost all data warehouses have increased the number of Parquet and ORC The external table of format supports , This allows warehouse users to access the same SQL The engine queries the data table （ Access... Through connectors ）, But it doesn't make data tables easier to manage , It doesn't eliminate the data in the warehouse ETL complexity 、 Obsolescence and advanced analysis challenges . In fact, the performance of these connectors is usually poor , because SQL The engine mainly optimizes its internal data format , These analysis engines alone can not solve all the problems of data lake and replace the warehouse , The data Lake still lacks basic management functions （ for example ACID Business ） And effective access methods （ For example, indexes that match data warehouse performance ）.
3. Lakehouse framework
Lakehouse Can be defined as Based on low cost , Direct access to stored data management systems , The system also provides traditional analytical DBMS Management and performance functions , for example ACID Business , Data version , Audit , Indexes , Caching and query optimization . therefore Lakehouse Combining the main advantages of data lake and data warehouse ： Open format low-cost storage can be accessed through various systems of the former , The latter has powerful management and optimization functions . The core issue is whether these advantages can be effectively combined ： especially Lakehouse Support for direct access means that they give up some aspects of data independence , It's always been relational DBMS The basis of design .
Lakehouse Especially suitable for cloud environment with separate computing and storage ： Different computing applications can work on completely independent computing nodes （ for example ML Of GPU to cluster around ） On demand , Direct access to the same stored data at the same time , But it can also be on the local storage system （ for example HDFS） Implemented on Lakehouse.
3.1 Realization Lakehouse System
Realization Lakehouse My first key idea is Use standard file formats （ Such as Apache Parquet） Store data in low-cost object storage （ for example Amazon S3） in , And implement metadata layer on object storage , It defines which objects are part of the table version . This enables the system to implement such functions as ACID Management functions like transaction processing or version control , At the same time, keep a lot of data in low-cost object storage , And allows clients to read objects directly from the store using standard file formats , Although the metadata layer adds management capabilities , But not enough to achieve good SQL performance , Data warehouse uses a variety of technologies to achieve good performance , For example, store thermal data in SSD And so on 、 Maintain Statistics 、 Build effective access methods （ For example index ） And optimizing data formats and computing engines . Based on existing storage formats Lakehouse Format cannot be changed in , But you can also achieve other optimizations while keeping the data file unchanged , Including the cache 、 Auxiliary data structure （ Such as indexes and Statistics ） And data layout optimization .
Lakehouse It can speed up the load of advanced analysis , It can also provide better data management functions for it . many ML library （ for example TensorFlow and Spark MLlib） You can already read the data Lake file format （ Such as Parquet）. So connect them with Lakehouse The easiest way to integrate is to query the metadata layer , To determine which Parquet The file belongs to the table , Then pass them on to ML library .
3.2 Metadata layer for data management
Lakehouses The first component of is the metadata layer , It can be realized ACID Transactions and other management functions . Such as S3 or HDFS Data storage systems like lake only provide low-level object storage or file system interfaces , In these interfaces , Even simple operations （ Such as updating tables across multiple files ） It's not atomic , This problem has led some organizations to design richer data management layers , from Apache Hive ACID Start , Its use OLTP DBMS Tracks which data files are in a given table version Hive Part of the table , And allow the operation to update the collection transactionally . In recent years, some new systems have provided more functions and improved scalability , Such as 2016 year Databricks Developed Delta Lake, It stores information about which objects are part of the table in the data Lake , As Parquet Transaction log in the format of , Make it possible to scale to billions of objects per table ;Netflix Of Apache Iceberg Use a similar design , And support Parquet and ORC Storage ;Apache Hudi Began in Uber Also similar , Although it doesn't support concurrent writes （ Supporting ）, The system focuses on simplifying the flow of data into the data lake .
The experience of these systems shows that they can provide the same information as the original Parquet/ORC Data Lake similar or better performance , At the same time, a very useful management function has been added , For example, transaction processing , Zero copy and time travel . The metadata layer is very important for data quality , For example, it can be right Schema check , So that it doesn't destroy data quality , In addition, the metadata layer can implement governance functions such as access control and audit logging , For example, the metadata layer can do this before granting client credentials to read the raw data in the table from the cloud object store , Check whether the client is allowed to access the table , And record all access behavior .
Future direction and alternative design . Because the metadata layer of data lake is very new , As a result, there are many outstanding issues and alternative designs . for example Delta Lake Designed to store the transaction log in the same object store that it runs （ for example S3） To simplify management （ Eliminates the need to run a separate storage system ） And provide high availability and high bandwidth , But the high latency of object storage limits the transaction rate it can support per second , In some cases, the design of a faster storage system that uses metadata may be preferable . Again Delta Lake,Iceberg and Hudi Only single table transactions are supported , But it can also be extended to support cross table transactions , Optimizing the format of the transaction log and the size of the management object is also an open problem .
3.3 Lakehouse Medium SQL performance
Lakehouse The biggest technical problem of the solution may be how to provide the latest SQL performance , At the same time, they gave up the tradition DBMS A large part of data independence in design , There are many solutions , For example, you can add a cache layer to the object store , And whether you can change the storage format of data objects without using existing standards （ for example Parquet and ORC（ New designs are emerging to continuously improve these formats ））. No matter what the design is , The core challenge is that the data storage format has become a system public API Part of to allow quick direct access to , This and tradition DBMS Different .
We propose several techniques that can be used in Lakehouse Medium optimization SQL performance , And it has nothing to do with the data format , So you can use it with existing formats or future data formats , These format independent optimizations are roughly as follows ：
- cache ： When using the metadata layer ,Lakehouse The system can safely cache the files in the cloud object storage to the faster storage device on the processing node （ for example SSD and RAM） On , The running transaction can determine whether the read cached file is still valid , In addition, the cache can be transcoded , It is more efficient for query engine , For example, in Databricks The cache will decompress part of its load Parquet data .
- Secondary data ： Even if Lakehouse To support direct I/O Access requires an open table storage format （ Such as Parquet）, It can also maintain other data to help optimize queries , If in Parquet The columns of each data file in the maintenance table are the smallest in the file - Maximum Statistics , Helps to skip data , And based on Bloom The index of the filter . Various auxiliary data structures can be implemented , Similar to " original " Data indexing .
- Data layout ： Data layout plays an important role in access performance .Lakehouse The system can also optimize multiple layout decisions , The most obvious is the record sorting ： Which records can be most easily read in batch if they are clustered together ,Delta Use in Z-Order,Hudi Based on which columns are used in Clustering.
For typical access patterns in the analysis system , These three optimizations work well together . In a typical workload, most queries tend to focus on data " heat " On a subset ,Lakehouse It can be cached using the same optimized data structure as the data warehouse , To provide the same query performance . For... In cloud object storage " cold " data , Performance is mainly determined by the amount of data read per query , In this case, data layout optimization （ Clustering data that are accessed together ） And auxiliary data structures （ As shown in the area map , Enables the engine to quickly determine the range of data files to read ） The combination of can make Lakehouse The system is as minimized as the data warehouse I/O expenses , Although using the standard open file format （ Compared to the data warehouse built-in file format ）.
3.4 Advanced analysis efficient access
Advanced analysis libraries are not usually used SQL The command to write , It needs to access a lot of data , How to design the data access layer to maximize the flexibility of the code running at the top , You can still go from Lakehouse Benefit from the optimization of .
machine learning API The rapid development , But some data access API（ for example TensorFlow Of tf.data） There is no attempt to push query semantics into the underlying storage system , some API And focus on CPU To GPU The transmission and GPU Calculation , This has not attracted much attention in the data warehouse .
We need standards ML Interfaces to enable data scientists to take full advantage of Lakehouse（ Even data warehouses ） Powerful data management function in , Such as business , Data version control and time travel, etc .
4. Research questions and Implications
Lakehouse Some other research questions are also raised , The industry trend of data lake with increasingly rich functions also has an impact on other areas of data system research .
- There are other ways to do it Lakehouse The goal ？ You can imagine other ways to achieve Lakehouse Main objectives of , For example, building a massively parallel service layer for data warehouse , It can support parallel reading of advanced analysis workload , But the cost will be higher than the workload's direct access to the object repository , Difficult to manage , And the performance may be reduced . This service layer is not widely used , for example Hive LLAP. Except in terms of performance 、 Usability 、 Cost and lock in challenges , There are also some important management reasons , For example, enterprises may prefer to keep data in an open format . With the increasing regulatory requirements for data management , Organizations may need to search for old data sets in a short time , Delete all kinds of data or change their data processing infrastructure , And standardization in an open format means that they will always have direct access to data , The long-term trend in the software industry has been open data formats , Enterprise data should continue this trend .
- What is the right storage format and access API？Lakehouse The access interface includes the original storage format and the client library to read this format directly （ For example, using TensorFlow When reading ） And advanced SQL Interface . There are many different ways to place rich functionality on these layers , For example, by asking readers to perform more complex “ A programmable ” Decoding logic , It can provide more flexible storage solution for the system . It remains to be seen which storage format 、 Metadata layer design and access API The best combination of .
- Lakehouse How to influence other data management research and Trends ？ With the popularity of data lake and the increasing use of rich management interfaces , Whether they're metadata layers or complete Lakehouse Design , They have had an impact on other areas of data management research .Polystore It aims to solve the problem of querying data across different storage engines , The problem persists in the enterprise , But in the cloud data lake, the proportion of data provided in open format is getting higher and higher , You can also run many... Directly against cloud object storage polystore Inquire about , Even if the underlying data files are logically separate Lakehouse Part of . You can also do it in Lakehouse Design data integration and clean-up tools on , And can access all data in parallel quickly , This opens up new algorithms such as large joins and clustering . Can be HTAP The system is built as Lakehouse Ahead " additional " layer , By using its transaction management API Archive data directly to Lakehouse In the system ,Lakehouse Will be able to query a consistent snapshot of the data .ML Data management will also become more simple and powerful , Today, organizations are building standards that can be reimplemented DBMS Functional , Specific to ML Data version control and feature storage system , Use with built-in DBMS It may be easier to manage the data lake of the function to realize the feature storage function . No server engine or the like DBMS The design will need to integrate with a richer metadata layer , Instead of directly scanning the original files in the data Lake , Can improve query performance . Last Lakehouse It's designed for distributed collaboration , Because all datasets can be accessed directly from the object repository , This makes it easy to share data .
A unified data platform architecture that implements data warehouse functions on an open data Lake file format can provide competitive performance for today's data warehouse systems , And help to deal with many of the challenges that data warehouse users face , Although restricting direct access to the storage tier of a data warehouse in a standard format seems to be a major limitation , But optimizations such as hot data caching and cold data layout optimization can make Lakehouse Get good performance , In addition, since there is a lot of data in the data Lake , And there is an opportunity to greatly simplify the enterprise data architecture , The industry is likely to Lakehouse Architecture transition .