01 The challenge of log service
With the escalation of Sino US friction , The rise of open source culture in China , Major Internet companies and leading enterprises in various industries , They are moving towards open source 、 Security 、 independent 、 Controllable development path . Based on open source engine Kafka/ElasticSearch, Build an infrastructure consensus for logging infrastructure ：
- Log collection capability ： Server side 、 client 、Web、 Database log collection work ;
- journal ETL Ability ： Log real time ETL、ETL Link monitoring ,ETL Link quality measurement ;
- Log retrieval capability ： Full text search capabilities 、 Log context restore capability ;
- Log analysis capabilities ：Adhoc Log OLAP Ability .
With log traffic 、 Log tasks continue to increase , bring “ Log timeliness 、 Friendly operation and maintenance 、 Service stability 、 Data security ” The problem becomes very tricky , Such as ：
1） Challenges in the log collection phase
- Need to support physical machines 、 virtual machine 、 Containerized scenes , Log collection at service granularity ; Support elastic dynamic expansion and contraction ;
- Need to support massive 、 Hundreds of thousands Agent monitor 、 Operation and maintenance 、 Multi version management ;
- Need to support shared multi tenant hierarchical security model ;
- Need to provide rich metrics for task level , Fault diagnosis and self-healing ability .
2） journal ETL The challenge at this stage
- ETL The semantic expression should be simple and clear , And decouple from the underlying infrastructure , Yes SQL The expression is strong demand ;
- ETL Links involve many links , Each has its own index system , The caliber is not uniform , The cost of problem location and investigation is very high ;
- ETL Link involves log storage and calculation , stay Quota The ability of inner end-to-end elastic expansion and contraction is full of technical challenges .
3） The challenge of log storage
- Kafka disk IO Cluster production and consumption avalanche caused by hot spots ;
- Topic Poor resource isolation , A sudden increase in flow 、 Back to consumption , Affect the stability of the cluster ;
- Kafka There are lots of clusters and topic We need a platform to carry on the community Kafka-Manager Lack of ability .
4） The challenge of log retrieval
- ElasticSearch Constrained by the meta information bottleneck , colony Shard We can't break hundreds of thousands of levels , Scalability issues need to be addressed ;
- ElasticSearch Lack of multi tenant and query isolation system for cluster resources , It's the biggest killer of stability ;
- ElasticSearch The end-to-end stereoscopic monitoring system is missing , The ability of operation and maintenance support is insufficient , We need to solve the problem of operation and maintenance friendliness .
5） The challenge of log analysis
- Hundred million level detail data level Adhoc Query analysis ability ;
- One hundred million level cardinal dimension is the support of high-precision scene removal ability ;
- The lack of end-to-end stereoscopic monitoring system , The ability of operation and maintenance support is insufficient , We need to solve the problem of operation and maintenance friendliness .
02 sound of dripping water Logi Logging service Suite
With the digital transformation of enterprises 、 The whole process of business going to the cloud , Microservices 、 The rapid development of containerization and other technologies , Business to stability 、 An easy-to-use logging infrastructure presents three pressing needs ：
- The need for service security ： Full link tracking is an important guarantee of stability ;
- The needs of business operations ：A/B TEST、 Activity operation analysis 、 End user behavior analysis 、 Precision marketing , Right MB/S Second level log storage capacity ,TB The second level search ability of level log is strongly demanded ;
- The need for business security ： Identify the source of attack and stop the loss of assets , Safety audit and traceability ,TB The level of log Adhoc Analytical ability .
sound of dripping water Logi The log service suite passes through Didi 7 Years of precipitation and polishing , For log collection 、 The logging stored 、 Log Computing 、 Log retrieval 、 Every link of log analysis , In terms of component capabilities PAAS Chemical construction 、 Targeted optimization on engine stability and scalability , The structure is as follows ：
It has the following advantages ：
- Open source is independent and controllable ：Logi-Agent、Logi-LogX、Logi-KafkaManager、 Logi-ElasticSearchManager various PAAS The suite plan is all open source ;
- The engine is stable and reliable ：Agent 40MB/S Single task acquisition performance of , The ability to isolate controllable resources ;LogX Real time data acquisition of the task ETL Second delay 、 The ultimate optimization of computing performance ; sound of dripping water kafka hundred GB/S Real time traffic ; sound of dripping water ElasticSearch Dozens of PB Index storage cluster stability 99.95%;
- Service operation precipitation ： Hundreds of thousands of log service tasks ensure the timeliness of log data through end-to-end full link 、 integrity 、 Observability 、 Friendly operation and maintenance ; Flexible scheduling of resources and productization of hierarchical support capability ;
- The platform is professional and easy to use ： The minute level completes the end-to-end self-service access of the whole log link ;SQL Templates +UDF Personalized cleaning capabilities support ; hundred TB Second level data retrieval experience .
Logi-Agent Committed to building enterprise level data collection platform , Responsible for the company's multi terminal 、 Collection of polymorphic data , The structure is as follows ：
sound of dripping water Logi-Agent Online scale 10W Deployment nodes ,130GB/s The amount of log collection ,20000+ Log collection task , Single task maximum acquisition capability 40MB/S.
Based on users 、 Research and development 、 High frequency scenes from different perspectives PAAS turn , Improve the friendliness of operation and maintenance 、 Engine observability 、 User convenience , Open source https://github.com/didi/kafka... 500+ Free users , Experience address ： http://220.127.116.11:8080/ , Account and password ：admin/admin
sound of dripping water Kafka The cluster size 500+,60GB/S Of traffic , Share the experience of multi tenant large cluster scenario （CPU Peak utilization 30%, disk 50%）,SLA promise 99.95%, Based on the engine 2.5 Version has been 40+ Feature enhancement , Disk overload protection , Partition dynamic migration , Business thread isolation is a feature of Didi , The key to stability ！
LogX Service oriented to MB/S As Quota The unit of , With SreamingSQL+UDF As ETL Expression vectors , Support with Quota Dynamic expansion of the unit 、 Capacity to shrink , On a mission basis , Build channel end-to-end performance 、 timeliness 、 Integrity index system .
sound of dripping water 20000+StreamingSQL ETL Mission , Single task maximum traffic 500MB/S, End to end ETL Delay 90 The quantile is less than 2Min, With minute level dynamic expansion and reduction capacity .
The most professional in the industry ElasticSearch-Manager, Based on users 、 Research and development 、 High frequency scenes from different perspectives PAAS turn , Precipitation of the full hosting characteristics of the index service .
Provides capacity planning features based on index templates , Cluster disk utilization 30%→65%, Open source preparation .
Since the research ElasticSearch-GateWay, Provide cross cluster access , Multiple versions are compatible , Tenant definition and security ,DSL Audit and analysis, etc , Supporting didi 50 100 million times / Days of data reading ,1200W/S Data written to , yes ES Smooth engine upgrade 2.3.3->6.6.1->7.6.1 The cornerstone components of .
sound of dripping water ElasticSearch The cluster size 3500+,8PB Storage , Shared multi tenant cluster （1000+ example ,60W Shard,CPU Peak utilization 45%, disk 60% ） The experience of the scene .
SLA promise 99.95%, Based on the engine 7.6.1 Version has been 150+ Feature enhancement , Write performance is the community version 2 times .
FastIndex 50TB Indexes 1 Hours to build , Open source （https://github.com/didi/ES-Fastloader）.
Since the research DCDR, It provides the ability of high availability of index between clusters , For online 50+ The main search scenario provides the ability to live in different places , Cumulative direction ES Community contribution 30+PR.
03 sound of dripping water Logi The application case
sound of dripping water Logi There are a lot of scenarios in Didi's internal service , In fault location 、 Log analysis 、 The log service 、 Business operations 、 Security audit 、 Log assets 、 There are in-depth practices in such scenes as log and large screen .
Limited to space, we will focus on the log service next LogInsight And business operation mirror , The analysis is based on didi Logi Business value that can be generated .
LogInsight Based on didi Logi The ability of , Main cloud log storage solution , For the demands of log storage and analysis after cloud and container , Log cold standby is provided 、 Resource management 、 Log retrieval and other capabilities .
- Significantly reduce log usage 、 Storage costs Full custody 、 Stretch and stretch , No operation and maintenance Cold standby storage , about 0.02 element /GB/ month , Significantly reduce storage overhead , Support 1-365 Days custom storage time ;
- Quickly found 、 Location problem , Improve business stability Statistical analysis of interface performance and error log based on big data streaming computing , Provide interface call relationship 、 Topological relationship 、 Upstream and downstream flow analysis 、 Service error positioning 、 Error clustering and other functions ; Safe and reliable
- Safe and reliable Availability is no less than 99.9%, It can handle hundreds of TB Log volume Real time data collection , Minutes down , Log storage is not lost to meet the needs of log audit .
》 The magic mirror
Magic mirror is a professional scene based intelligent analysis platform for user behavior , Provide data collection from 、 Storage 、 Calculation 、 Analyze the whole process solution from operation promotion .
- Scenario analysis model User retention analysis , User trajectory analysis , User profile analysis ;
- Basic service capabilities The core index can check the data of the day in real time , Real time computing , Second level data generation , The market supports integrated reports ;
- Data analysis capabilities Non R & D personnel can build their own indicators , Support multiple types of visual reports , Support data export and analysis , Support omega Data reporting data ;
- Multi product satisfaction survey Support multi organization and multi product structure , Support online automatic configuration , Support the lottery , Increase participation .
Based on didi Logi Logging service Suite , sound of dripping water Logi Not only can it better meet the general operation and maintenance observability of the enterprise in the log scenario 、 Application observability appeal , It can also better meet the needs of business operation 、 Security audit 、 Log analysis 、 Log mining and other various scenarios .
sound of dripping water Logi The overall open source plan is as follows , Welcome to pay attention .
Enterprise users who use the open source version in production , You can join OCE, We will give extra and better support , For example, the exclusive Technology Salon 、 One on one communication opportunities for enterprises 、 Exclusive Q & a group, etc .OCE The application portal is in Obsuite In the official account menu , Click on 【OCE authentication 】 You can also apply directly .