Introduction | TBase It's Tencent. TEG The data platform team is open source PostgreSQL Based on the research and development of enterprise level distributed HTAP Database system , In the same database cluster, it can provide customers with strong consistent and high concurrency distributed online transaction ability and high performance data online analysis ability . This article is about Tencent TBase Expert engineer Wu Xin in cloud ＋ Community salon online Share and organize , Will bring you Tencent cloud TBase In distributed HTAP The exploration and practice of the field .
One 、TBase Introduction to distributed database
1. TBase development history
Tencent cloud from 2009 We started to try our internal business in 1997 , In the field of enterprise distributed database research process is more experienced . At that time, it was mainly to meet some small needs , For example, introduce PostgreSQL As TDW A supplement to , make up TDW The low performance of small data analysis , The demand for processing is also small .
But as the business grew bigger , Need to have a more efficient online transaction processing capacity , An extension to the database , So we continue to invest in the database development process .
2014 year TBase The first version released began to be used inside Tencent's big data platform ;2015 year TBase Wechat payment merchant cluster online , Supporting more than... Every day 6 Billion deals ;2018 In the year V2 Version to transaction 、 Query optimization and enterprise level functions have been greatly enhanced , Slowly began to face some external customers ;2019 year TBase Won the bid PICC The group's core business , Help them in the domestic insurance industry leading core system , And stable service for a long time .
in consideration of TBase The continuous development of overall capabilities , We hope to put TBase The ability to contribute to the open source community , This can more support database localization project . So in 2019 year 11 month , We open source the database , Hope to help the digital industry upgrade .
2. PostgreSQL Database introduction
TBase It's based on a single machine PostgreSQL A self-developed distributed database , In addition to the ability of perfect relational database , And a lot of enterprise level capabilities . At the same time, the distributed transaction ability is enhanced , And better support for online analysis business , Provide a one-stop solution .
On data security , We have something unique , Security system including separation of powers , Data desensitization 、 The ability to encrypt . Second, data liveliness 、 Flexible configuration of multiple places and centers , It also provides relatively perfect ability , Protect some highly available scenarios in online transactions in the financial industry . For financial insurance and other core business localization to lay a solid foundation .
PostgreSQL It's an open source RDBMS, The open source protocol is based on BSB style , So the source code can be more flexible for you to modify , It can also be commercialized on the basis of modification .
PostgreSQL By the Turing prize winner MichaelStonebraker Leading an open source project , As shown in the figure above , It's been iterating for a long time , It has been published to 12 edition , And it's been iterating all the time , The whole is at a relatively active level .
3. PostgreSQL Development trend
PostgreSQL In the last ten years, people began to pay attention to , First, because of its kernel function , Including the continued activity of the community , Continuous progress has been made in the past few years . The picture above is from DB-Engines The statistics of , According to the data, we can see that in the past year, we all had some retrogression and low growth PostgreSQL It's obvious that .
The Yellow curve in the figure below is PostgreSQL, We can intuitively see that its development trend is relatively good .
Currently open source TBase The version is based on PostgreSQL10, We are also constantly matching PostgreSQL More features , It will be fed back to the open source community , Hope and the whole PostgreSQL Ecology has a good combination and interaction .
Two 、 Open source TBase Positioning and overall architecture
1. Open source TBase The positioning of
Databases are mainly divided into ：OLAP、OLTP and HTAP.
OLAP The business characteristic of the company is the large amount of data , It's usually 10PB+, Sensitive to storage costs . Its concurrency is relative to OLTP It won't be too high , But it can provide better support for complex queries .
OLTP The amount of data is relatively small , Many small and medium-sized systems will not achieve TB The amount of data at level , However, the requirements for transaction and query request are relatively high , Throughput reaches million level TPS above . also OLTP The requirements for disaster recovery capacity are high .
Many of the domestic databases will be from OLAP To cut into the field , from OLTP Angle cutting is relatively difficult , This is still a piece of IBM perhaps Oracle Monopoly is more serious , We hope to realize localization in this area as soon as possible . and TBase Because I have been working in the insurance industry for a long time , stay OLTP There are strong core business capabilities .
The other is HTAP. In most of the previous business deployment , People will put TP and AP Separate , There may be ETL Or the technology of stream duplex will interact with each other . But it is more ideal to support two business types in one system at the same time .
Of course, this is going to be complicated . First of all, you can see that their business characteristics are quite different , The optimization direction of the kernel field is also completely different , In other words, there are big differences in technology .
TBase Introduction HTAP It's also about specific needs , actually TBase Is more inclined to TP, At the same time, the better AP Handling capacity of , Try to be compatible in a system . But if you want to do more extreme performance , It's still about HTAP In isolation , To provide users with a complete service capability .
TBase The angle of view is also derived from the demand . Tencent cloud is the earliest trading system , Later, I added AP The ability to analyze .
The main business scenario needs to be met , The first is that trading data may be larger than 1T, Analytical ability is greater than 5T, The concurrency capability is required to reach 2000 above , Trading peaks per second may reach 1000 ten thousand . In the case of the need to expand capabilities , Need to be able to deal with the original 、 Analytical ability , Or the impact of data redistribution is minimized . At the same time in the transaction level to achieve a complete distributed consistency database .
meanwhile TBase There has also been a lot of enterprise level enhancements , The security guarantee ability of the separation of powers 、 Data governance capabilities 、 The separation of hot and cold data and commercial data .
We introduced TBase Development process of , In the process, we also hope to contribute to the open source community .
In fact, it is difficult to replace the core business in the domestic environment , More from the analysis system , Only in recent years has the system begun to cut into the core transaction capabilities ,TBase I also hope to give back to the community through open source , Make sure you can pass TBase Of HTAP Ability to fill in some gaps , Expanding ecological development .
After open source, we also get more attention and use , It also includes the European Space Agency's Gaia Mission We are using our system to analyze the data of the stellar system in the galaxy , Hope to have more classmates or friends to join in TBase In the development process , I also hope that through this introduction , It's convenient for you to cut into TBase In the interaction of the open source community .
2. TBase Overall architecture
A cluster is made up of these parts ：GTM、Coordinator and Datanode. among GTM Mainly responsible for the control of global affairs , Is the cornerstone of providing distributed consistency protocols ;Coordinator It's the access to the user's business , Analyze and issue the user's request , The specific calculation and data storage are put into Datanode among .
3、 ... and 、HTAP Introduction of ability in aspects of
What we were talking about just now is HTAP, Now let's talk about OLTP,TBase In this part of the ability is more prominent .
If the user needs to have requirements on transaction or concurrent transaction volume , We need a better distributed transaction system . Specific requirements include high performance and low cost , In this part ,TBase Compared with traditional IBM Or foreign more expensive all-in-one machine has greater advantages .
Another requirement is scalability , In the case of node expansion, approximately linear expansion of transaction processing capacity . So how to achieve this goal ？
Briefly introduce the business of MVCC To deal with , stand-alone PostgreSQL It is mainly to maintain a list of current active transactions , It has a structure called Proc array, Equivalent to every user's session If there is a new transaction request , The current active transaction will be recorded in the list of things , When you need to judge tuple The visibility of , Will take one in the active transaction list Snapshot Go with the storage tuple header Recorded in the XID Information for a comparison , To do MVCC Access control .
If it's a distributed extension , A simpler way is to have a central node . According to the previous framework , stay GTM There will be a centralized list of active things , To uniformly allocate for each access request Snapshot.
But there is a big problem at this time , The central node becomes a bottleneck ,GTM There will also be some problems , For example, the size of the snapshot is too large or the network occupation is high .
GTM If you do a centralized node , There's actually a single point of bottleneck , Make sure that every request gets it snapshot The correctness of the , This requires locking the active transaction list , In the case of high concurrency, the lock conflict will be very large . How do we solve this problem ？
This is actually a common problem in the industry . At present, we see solutions for the Internet industry , It's from Google Spanner Derived from the direction of .
Google Spanner It's a global distributed database , The ability to provide consistent database services across continents . Its control concurrency technology features , One is to pass. KV Store multi version concurrency control based on global time , The other is that it is more expensive to use GPS And a globally consistent service timestamp mechanism to provide a TrueTime API, Make a submission protocol based on real time . Because of its global distribution , The average error is about 6 Millisecond or so , The overall transaction delay is relatively high .
in addition , There are more systems to learn from Google Spanner Do a transaction model ,CockRoachDB One of them .
in addition ,Percolator It's also Google For the search engine to provide a more efficient database , Use KV Storage , Based on the global logical timestamp MVCC Concurrent control . Its timestamp is provided by a dedicated timestamp service , The first phase of distributed transaction needs to lock the modified records , Commit phase end lock ; The transaction commit time complexity is O(N),N It's the record number , The performance of the resulting commit will have an impact , Of course, this design is also related to the system requirements .
So let's see TBase Capabilities on distributed transactions , In this part, we have made great improvement on the basis of the previous part .
First of all, we are right GTM The cluster is optimized , From the original overall situation XID Changed to assign global timestamps GlobalTimeStamp（GTS）,GTS It's monotonous , We are based on GTS Designed a new set of MVCC Visibility judgment protocol , Include vacuum Equal mechanism . Such a design can take the submission protocol from GTM The single point bottleneck is put down to each node , Reduce the pressure , At the same time, through the way of timestamp log replication GTM The node is highly available .
Under this Agreement GTM You just need to allocate the global GTS, In this way, the single point of pressure will be solved more obviously . According to our calculation , Teng Xu TS85 The server can probably handle 1200 ten thousand TPS, It can basically meet all distributed pressure and user scenarios .
We also mentioned , stay Percolator Under the realization of , Need to be right Tuple Lock and modify records , In this case, the performance is poor . In fact, we have made an optimization of the submission protocol , In the face of Tuple Header Of GTS The write is delayed , When the transaction is committed, there is no need for each Tuple modify GTS Information , But the GTS The information is stored in the corresponding GTS Store File among , As a safeguard for transaction recovery .
When the user scans the data for the first time , From GTS Store File Get the state of , Write it back to Tuple Header in , After the scan does not need to traverse the state file , In order to achieve accelerated access , Speed up transaction processing . So on the whole , Let the database at the transaction level to ensure a more efficient design .
On the classification of centralized data distribution , There are three kinds of .
The first is to copy the table . Each storage node in the replication table has a complete copy of the data , Suitable for small tables with less changes , It can speed up relational queries .
The second is HASH Distribution , This is a more classic way . In a nutshell, it's just to put the data in a distributed column hash, The data is scattered among the storage nodes , If hash key Wrong choice , Data skew may occur .
The third is based on RANGE The distribution of .RANGE Distribution breaks the data into small pieces according to segments , and hash Compared with the distribution, it's not very strict , Good support for the node elasticity of the upper layer . But when it's calculating , relative hash The effect will not be particularly good .
On the whole ,TBase Replication table and enhanced are selected hash Distribution .
Here's how to look at distributed queries ,PushQuery and PullData.
In the beginning, some early systems may choose a faster implementation , For example, the storage is divided into multiple DN, Then pull the data to CN Calculate .
In this case, the advantages and disadvantages are obvious , The advantage is more efficient and faster , The disadvantage is that CN It's a bottleneck , There's a lot of pressure on the Internet . So we prefer the way on the right in the picture above , Put some data and calculations down to DN On the node .
In the most basic case , I hope all the calculations can be put into DN Come up and do .DN When doing redistribution , Need to follow DN The ability to interact with each other , This is in TBase V2 After that, I did a lot of enhancement , at present TBase The calculation can be dispersed as much as possible DN The nodes come up to .
The picture above shows SQL Shipping and PlanShipping The difference between .
In fact, when dealing with a query Or a query plan , There will be two situations . One is that I put SQL Send to by analysis DN On the implementation ,CN Only responsible for the collection of results . This kind of optimization effect will be better , Because there is no need to establish a distributed and consistent commit protocol in multiple nodes , In addition, it will be more efficient in computing resources . We are OLTP Some optimizations of the domain will take this way .
The other is in OLAP field , More formal PLAN Distributed . stay CN Yes query Make a whole plan, According to the redistribution, the plan is divided into different calculation pieces , Spread to DN Top processing
I just talked about , That is, if you can put the right OLTP Push to a single DN If you come up and do it , The effect will be better . Let's give a simple example .
The distribution columns of the two tables are f1, The data column is f2, if query It can be written in terms of the correlation of distributed keys , And it USES the hash Distribution , You can put the query Push to different DN Come up and do it . Because it's different DN The data between is constrained by the distribution , There's no need to do cross calculations or redistribute data .
The second is equivalent links with distributed keys , At the same time, there is a specific fixed value of a distribution key .
under these circumstances ,CN Can pass f1 To determine which one to push to DN Do it in the middle .
There are more complex queries , For example, there are subqueries , But in a similar way .
A subquery can have a complex situation , If we can judge that there is the same single node distribution as the upper layer in the multi-level subquery ,query It can also be sent to DN in . In this case, to OLTP Performance will have a better impact , The ability of cluster will be better optimized .
For more complicated query, May need to involve the adjustment of optimized configuration .
There are two main ways ： Rule optimization （RBO） And cost optimization （CBO）.RBO Mainly through the rules to determine whether the query plan is in line with or not to optimize , This is some of the earlier implementation methods , Because the amount of calculation is relatively small , So it's more efficient for some scenarios , But the obvious drawback is the lack of flexibility , At the same time, it can't be used in more complex scenes .
In fact, more of these databases use CBO The way . In short ,CBO Will carry on the dynamic plan to all paths , Choose the one with the lowest cost as the execution plan . The advantage of this method is that it has better applicability , It can be optimized for complex scenes , The performance is stable . The disadvantage is the complexity of implementation , It requires certain preconditions , Including statistics 、 Cost calculation model construction, etc .
But it's not absolute either , Neither can “ win ” After the other side's statement , More is the need for a combination of the two .TBase Mainly in the CBO To optimize , For example, in the scenario of calculating some small tables , There is no need for redistribution, direct replication That's all right. .
About distributed Chinese distribution Some of the adjustments , Let's make a simple analogy . Two tables ,TBL_A and TBL_B.
If f1 Is the distribution column , If the distribution class is equivalent, it will become push down, In this case, it can be done in DN We can calculate directly on .
In the middle of the picture above ,TBL_A It's the distribution bond ,TBL_B It's a non distributed bond . under these circumstances , If TBL_B Small enough , You have to deal with TBL_B To redistribute , That is to say TBL_B Conduct replication, This will involve some cost estimates . And if the TBL_B If it's bigger , You may need to TBL_B Conduct redistribution.
We also talked about ,TBase stay OLAP We also have a strong ability in this aspect , This part of the optimization idea is mainly with the help of parallel computing . The full parallelism of computing is mainly reflected in several aspects .
The first is node level parallelism , Because we are distributed databases , So you can have multiple nodes or processes to compute ; The other layer is process level parallelism , at present TBase No change to thread model , So parallelism is mainly embodied in the process level model , be based on PostgreSQL The ability of process parallelism has been enhanced . There is also the parallelism of instruction sets , Optimize instructions , This part will be continuously enhanced later .
that Postgres Of the , Or how the ability of process parallelism is realized ？ In the early days we follow Yes. PG10, Parallelism is not very strong , Only the basic framework and optimization of some operators are provided , This is a TBase One point of optimization at that time .
In a distributed situation , Many single machines can be paralleled , But you can't parallelize in a distributed environment , So we want to enhance these capabilities , To ensure a larger range of parallel capabilities .
Parallel computing is actually a bottom-up approach , For example, a node at the bottom can be parallel , Then if recursion reaches a certain level, it cannot be parallelized , You can add one to all the places below that can be paralleled Gather Node, Collect results from multiple processes , Keep planning up . That's one of the things we want to enhance .
Here are some specific optimizations .
In the early PG Of HashJoin stay outer plan It can be done in parallel , But in inner structure hash table We can't do parallel .
Simply speaking ,hashjoin It can be divided into several steps . First of all build hash table, The second step is to get the part outer plan data , Calculate the hash value , And match . Here we will inner hash table The construction process is also parallelized , Guarantee Hashjoin Both the left and right subtrees of can be parallelized , And continue to push to the upper level node parallelization .
The other is AGG（Aggregation）.
In many cases, it's a two-stage process Agg, Need to be in DN Do some partial agg, Then go to the upper level and plan to do more in sections final agg. In fact, in this case , In the middle redistribute You need to be in DN Data integration , Then do it again final Of Agg.
In the case of multi-level subqueries , Computing at each level will result in a low level of overall parallel computing . So we are redistribution And did some parallel , That is to say Partial Agg In this case, you can follow hash The distribution is sent to the corresponding upper layer DN Parallel on nodes .
There are also some data transfer optimizations .
We mentioned redistributio Nodes can enhance the ability of parallelism , And in the reception and transmission of data also need to be optimized . Early versions were single item processing , The network latency will be higher , The performance will not be very good , So we did some optimization in this part , In order to achieve better parallel execution ability .
Four 、 Open source TBase Introduction to enterprise capability
actually TBase We have also done some enterprise level enhancement , We will continue to do some open source contribution and optimization .
Currently open source Tbase Enterprise level has been able to achieve multi location and multi center , Or the construction of survivability , Including safety 、 management 、 The ability to audit ,TBase There are higher requirements for safety , We will continue to contribute later .
There is also the ability to expand horizontally ,TBase Can be achieved in the user perception is relatively small in the case of expansion . Expansion is a common pain point in the case of large amount of data , We will also have a continuous enhancement in this area .
Besides ,TBase There are also self-developed analysis tables and separation of cold and hot data , Also have better effect , For the reduction of user cost , And the flexibility of data distribution will be improved .
TBase In this year 7 month 13 Open source has released v2.1.0 edition , We also continue to build on open source capabilities , Including the continuous enhancement of the ability to survive 、 Maintenance enhancement , Continuous performance and security upgrades , Including the problems found by some enterprise customers, they will continue to contribute . It also includes statistical information enhancement and optimization of small table redistribution , I hope you can continue to pay attention to TBase, More discussion and discussion with us .
5、 ... and 、Q&A
Q：TBase What are the requirements for the layout of ？
A： You can visit TBase Open source version of , There are specific usage methods and source based build , And the process of building , There are clear documents . If you want to try , You can use normal X86 Server or local Linux The server can do some simple building , On the enterprise level project, we can also make some attempts , Keep in touch .
Q： Why choose based on PostgreSQL Development ？
A： In fact, people will face MySQL and PostgreSQL A choice in two directions , I would like to introduce our choice of PostgreSQL Why .
One is PostgreSQL The agreement will be more friendly , The flexibility of the agreement will be better , You can make changes and complete release of its code at will .
In addition, its kernel implementation is more rigorous , It has its own uniqueness , And it's growing , Including its iteration speed is also relatively fast . We've been following up early on , Continuous merge PostgreSQL Some of feature, Along with it PostgreSQL Growth , Our kernel has also done a relatively fast iteration . At the same time, we are also learning about the kernel , Did some more in-depth tuning .
Q：DN The storage cluster of nodes is based on Raft Do you ？ many Leader Or just Leader Well ？
A： At present, our DN The node is not used Raft agreement , It's a master-slave replication . I know a lot of new businesses will be based on Raft Submission agreement for , Or use this replication protocol for consistency and high availability . But actually Raft There is still some performance impact on the commit protocol with multiple copies of , Compared with the traditional process, the whole process will have longer delay , amount to CAP principle ,C Improved ,A It's going to have a partial impact .
And we're more inclined to OLTP System , Therefore, the transaction requirements and delay response requirements are relatively high , So I made this choice .
Q： Can you elaborate on the implementation process of distributed transactions ？ How to ensure the distributed transaction between multiple computers , Two stage submission ？
A： Now we're basically follw Two-phase commit . One is a two-stage submission control process 、 Control agreement , The other is transaction isolation protocol . I just talked about MVCC, The enhanced version of the commit protocol is basically two-phase commit .
Due to the use of GTM, It is different from the traditional stand-alone mode , Did some unified coordination , Just now I also emphasized . One advantage of this is that it reduces GTM Pressure on , In addition to prepare There will be partial blocking in the phase , But after optimization, the impact is very small , But it can greatly reduce GTM The pressure of the .
Q： How can the underlying storage meet the needs of row storage and column storage at the same time ？ Or according to the block (Tile) Continuous storage ？
A： The underlying storage of our open source version is mainly line storage , It will be listed and HTAP Continue to enhance , Further improve HTAP The ability of . After gradual stabilization , We're going to think about iterative open source versions .
Q： How many servers do you need to build at least ？
A： In fact, single point building is OK . One DN、 One CN、 One GTM It's fine too . Actually, it's better to make it in two DN, You can experience more distributed architecture . In fact, we have more than 1000 nodes in the cluster of enterprise services , Including solving GTM On a single point of pressure , It can improve the overall scalability of the cluster , So from two nodes to multiple nodes, you can try .
Q：GTM It's time service , Yes batch perhaps pipeline Do you ？ And now Tbase Supported read consistency of slave Libraries ？
A： yes , we have .GTM We have also done more optimization for the time service of , In short, you can do some parallel GTS The monotonous timing of , According to the current scale or our estimation of customer scenarios , stay x86 The server can probably reach 1200 ten thousand QPS The ability of timing . In the case of enhanced servers , The overall ability is relatively strong , Basically, there's no bottleneck in this part .
Q：tbase What are the security mechanisms ?
A： From a business perspective , We talked about security isolation , And there are strong row level security rules 、 Column level access control and data encryption 、 Desensitization is enhanced , All of these can be applied to an enterprise database , Now a lot of enterprise service capabilities are also in TBase Inside , We will do further iterations according to the situation .