BiliBili: application of machine learning workflow platform based on Flink in station B
2021-06-22 01:22:53 【Flink_ China】
Sharing guests ： Zhang Yang ,B Station Senior Development Engineer
Reading guide ： The whole process of machine learning , Reporting from data 、 To feature calculation 、 To model training 、 And then deploy online 、 Final effect evaluation , The whole process is very lengthy . stay b standing , Multiple teams will build their own machine learning Links , To fulfill their machine learning needs , Engineering efficiency and data quality are difficult to guarantee . So we're based on Flink Community aiflow project , A set of standard workflow platform for machine learning is constructed , Accelerate machine learning process construction , Improve the data effectiveness and accuracy of multiple scenarios . This sharing will introduce b Machine learning workflow platform of the station ultron stay b Stand for applications in multiple machine learning scenarios .
1、 Machine learning in real time
2、Flink stay B Stand up for the use of machine learning
3、 Machine learning workflow platform construction
4、 The future planning
One 、 Machine learning in real time
First, let's talk about real-time machine learning , It is mainly divided into three parts ：
The first is real-time sampling . Traditional machine learning , All the samples are t+1, in other words , Today's model uses yesterday's training data , Train the model every morning using yesterday's full day data ;
The second is the real-time feature . The previous characteristics are basically t+1, This will lead to some inaccurate recommendations . such as , Today I watched a lot of new videos , But what I recommend is something I saw yesterday or more ;
The third is real-time model training . After we have real-time samples and real-time features , Model training can also achieve real-time online training , Can bring more real-time recommendation effect .
Traditional offline Links
This is a traditional offline link diagram , First of all APP Log generation or server generation log, The whole data goes through the data pipeline to HDFS On , And every day t+1 Do some feature generation and model training , Feature generation will be put into feature storage , May be redis Or something else kv Storage , And then to the top inference Online services .
The disadvantages of traditional offline Links
So what's wrong with it ？
The first is the t+1 The characteristics of the data model are very low timeliness , It's very difficult to update with high timeliness ;
The second is the whole process of model training or feature production , Use day level data every day , The whole training or feature production takes a very long time , The computing power of the cluster is very high .
Real time link
In the figure above, we optimize the whole real-time link process , The part of the red fork is removed . After the whole data is reported, it is passed pipeline Direct to real time kafka, After that, we will do a real-time feature generation , And real time sample generation , The characteristic results will be written feature store Go inside , Sample generation also needs to start from feature store Inside to read some features .
After generating the samples, we directly conduct real-time training . The long link on the right has been removed , But we still save the offline features . Because for some special features, we still need to do some offline calculation , For example, some of them are very complex and difficult to be real-time or have no real-time requirements .
Two 、Flink stay b Stand up for the use of machine learning
Let's talk about how we do real-time samples 、 Real time feature and real-time effect evaluation .
The first is real-time samples .Flink Currently hosting b Station all recommended business sample data production process ;
The second is real-time features . At present, quite a few features use Flink Do real-time calculations , The timeliness is very high . There are a lot of features that use offline + Real time combination results , Historical data is calculated offline , Real time data with Flink, Use stitching when reading features .
however , Sometimes these two sets of computing logic can't be reused , So we're also trying to use Flink Do batch flow integration , Use all the definitions of features Flink To do it , According to business needs , Real time or offline , The underlying computing engines are all Flink;
The third is an evaluation of real-time effects , We used Flink+olap To get through the whole real-time computing + Real time analysis of links , Evaluate the final model effect .
Real time sample generation
The picture above shows the generation of real-time samples at present , It is for the whole recommendation service link . Log data falls into kafka after , First we make a Flink Of label-join, Put the click and show together . The results continue to fall into kafka after , One more Flink The characteristics of the task join, features join It's a combination of features , Some features are public domain features , Some are the private domain characteristics of the business party . The sources of characteristics are quite diverse , There are offline and real-time . After completing all the features , It will generate a instance Sample data fall into kafka, For the training model behind .
Real time feature generation
The image above shows the generation of real-time features , Here is a more complex feature process , The whole calculation process involves 5 A mission . The first task is offline , In the back 4 individual Flink Mission , A feature generated by a series of complex calculations falls into kafka Inside , Write again feature-store, And then it's used for online prediction or real-time training .
Real time effect evaluation
The picture above is a real-time evaluation of the effect , A very core indicator that recommendation algorithms focus on is ctr Click through rate , finish label-join after , You can figure it out ctr Data. , In addition to the next step of sample generation , At the same time, it will import a piece of data to clickhouse Inside , After the report system docking, you can see very real-time effect . The data itself is labeled with experiments , stay clickhouse It can be distinguished according to the label , See the corresponding experimental effect .
3、 ... and 、 Machine learning workflow platform construction
- There is sample generation in the whole link of machine learning 、 Feature generation 、 Training 、 forecast 、 Effect evaluation , Each part has to be configured with a lot of development tasks , The launch of a model ultimately needs to span multiple tasks , The link is very long .
- The new algorithm is difficult for students to understand the whole picture of this complex link , The cost of learning is very high .
- The change of the whole link will affect the whole body , It's very easy to break down .
- The computing layer uses multiple engines , Batch stream mixing , It's hard to keep semantics consistent , We need to develop two sets of the same logic , Keep no gap It's also very difficult .
- The whole real-time cost threshold is also relatively high , Need to have a strong real-time offline capabilities , Many small business teams are difficult to complete without platform support .
The figure above shows the general process of the model from data preparation to training , Seven or eight nodes are involved , Can we complete all the process operations on one platform ？ Why do we use Flink？ Because our team's real-time computing platform is based on Flink To do the , And we saw that Flink Potential in batch flow integration and some future development paths in real-time model training and deployment .
Aiflow It's Ali's Flink An open source machine learning workflow platform for ecological team , Focus on the standardization of processes and entire machine learning Links . Last August 、 September , After we got in touch with them , The introduction of such a system , Build and improve together , And gradually began to b Station landing . It abstracts the whole machine learning into a graph example、transform 、Train、validation、inference These processes . In project architecture, the core of capability scheduling is to support flow batch mixed dependency , The metadata layer supports model management , It's very convenient to update the model iteratively . We built our machine learning workflow platform based on this .
Let's talk about platform features ：
- The first is to use Python Define workflow . stay ai Direction , Everyone uses Python There are still more , We also refer to some external , image Netflix Is also used Python To define this kind of machine learning workflow .
- The second is to support batch flow tasks with mixed dependencies . In a complete link , All the real-time offline processes involved can be added to it , And batch tasks can depend on each other through signals .
- The third is to support the whole experimental process of one click cloning . From primitive log In the end, the whole experiment started training , We hope to be able to clone the whole link with one click , Quickly pull up a new experimental link .
- The fourth is some performance optimization , Support resource sharing .
- The fifth is to support feature backtracking and batch flow integration . The cold start of many features requires a long history of data calculation , It's very expensive to write a set of off-line feature calculation logic specially for cold start , And it's hard to align with real-time feature calculations , We support tracing off-line features directly on the real-time link .
Here's the basic architecture , At the top is business , At the bottom is the engine . At present, there are many engines supported ：Flink、spark、Hive、kafka、Hbase、Redis. There are computing engines , There's also a storage engine . With aiflow As an intermediate workflow management ,Flink As the core computing engine , To design the entire workflow platform .
The whole workflow uses Python To describe the , stay python Inside, users only need to define computing nodes and resource nodes , And the dependencies between these nodes , The syntax is a bit like a scheduling framework airflow.
The main dependencies of batch flow are 4 Kind of ： Flow to batch , Flow to flow , Batch to stream , Batch to batch . It can basically meet all the needs of our business at present .
Resource sharing is mainly used for performance , Because most of the time, the learning link of a machine is very long , For example, there are only five or six nodes that I often change just now , When I want to start the whole experiment again , Clone the whole picture , In the middle, I only need to change some nodes or most nodes , Upstream nodes can share data .
This is a technical implementation , After cloning, the shared node is tracked .
Real time training
This is a real-time training process . Feature traversal is a very common problem , It happens when the progress of multiple computing tasks is inconsistent . In the workflow platform , We can define the dependency of each node , Once a dependency occurs between nodes , The processing progress will be synchronized , Generally speaking, it is fast and slow , Avoid feature traversal . stay Flink Inside we use watermark To define the processing schedule .
This is the process of feature backtracking , We use real-time Links , Go straight back to its historical data . Offline and real-time data are different after all , There are many problems that need to be solved , So it also uses spark, We'll change the latter one to Flink.
The problem of feature backtracking
There are several big problems with feature backtracking ：
- The first is how to ensure the order of data . The implicit semantics of real-time data is that data comes in sequence , When it's produced, it's processed right away , Nature has a certain order . But offline HDFS No ,HDFS It's partitioned , The data in the partition is completely out of order , In real business, a lot of computing processes depend on time sequence , How to solve the disorder of offline data is a big problem .
- The second is how to ensure the consistency of features and sample versions . For example, there are two links , One is the production of features , One is sample production , Sample production depends on feature production , How to ensure the consistency of versions between them , No crossing ？
- The third is how to ensure the consistency of calculation logic between real-time link and backtracking link ？ In fact, we don't have to worry about this problem , We're tracking offline data directly over the real-time link .
- Fourth, there are some performance problems , How to quickly calculate a large number of historical data .
Here's the first 、 The solution to the second problem ：
The first question is . For the order of the data , We HDFS Offline data for kafka Chemical treatment , This is not to pour it into kafka Go inside , It's simulation kafka Data architecture of , Partition and order within the partition , We put HDFS Data is processed into a similar architecture , Simulate a logical partition , And the logical partition is orderly ,Flink Read the hdfssource The corresponding data architecture supporting this simulation has also been developed . This simulation is currently using spark It's done , Later we'll change it to Flink.
The second question is divided into two parts ：
- The solution of real-time feature part depends on Hbase Storage ,Hbase Query by version is supported . After the feature is calculated, it is written directly according to the version Hbase, Check when the sample is generated Hbase With the corresponding version number , The version in this is usually data time .
- Offline feature section , Because there's no need to recalculate , Offline storage hdfs There are , But it doesn't support checking , This is the place where kv It's just a chemical treatment , We do asynchronous preloading for performance .
The process of asynchronous preloading is shown in the figure .
Four 、 The future planning
Next, let's talk about our plan .
- One is data quality assurance . Now the whole link is getting longer and longer , There may be 10 Nodes 、 20 Nodes , Then how to quickly find the problem point when the whole link goes wrong . Here we want to do it for node set dpc, For each node, we can customize some data quality checking rules , Data is bypassed to a unified dqc-center Perform rule operation .
- The second is full link exactly once, How to ensure accurate consistency between workflow nodes , This is not clear yet .
- Third, we will add model training and deployment nodes to the workflow . Training and deployment can be connected to other platforms , It could be Flink Training models and deployment services supported by itself .
** Introduction to guests ：** Zhang Yang ,17 In induction b standing , Working on big data .
- Application of "strangling banyan"
- Computer knowledge: four ways to make windows 10 enter safe mode, simple and practical!
- How to choose the right company?
- Experience of new CKA examination in November 2020
- Iterator mode
- Goodbye, sanwai
- Seven psychological effects, life and work
- When it industry overtime has become a daily habit! Programmer, do you find it hard to work overtime?
- Personal development for smart people Chapter 2 love
- Pdf expert how to do PDF file splicing, come to see it!
"Software Engineering 11" structured system design: solving the "how to do" problem of software
How to write a good news release（ Current affairs and Politics
Structural system analysis of "software engineering 10": case study of data flow diagram and data dictionary
What are the advantages of Amazon products? How to open Amazon auto ads?
Some opinions and Thoughts on "knowledge synthesis" and "comprehensive ability"
Go Interviewer: can go structures be compared and why?
The most complete handwritten JS interview questions
（10） Network layer -- IP layer forwarding packet flow
HMS core is a safe gift for children on June 1
- New feature of kotlin 1.5: what is the advantage of sealed interface over sealed class?
- Performance optimization of distributed computing system qcon
- Let's play captcha with go
- Come on, think about it
- The road to prominence
- Interview essential: SQL ranking and window function
- Kubernetes: making container choreography easy and efficient
- In the first chapter, what is planning
- You don't know the activity lifecycle
- "What is the mark of a programmer's development from naivety to maturity"?
- New features of ES6 ~ ES10
- Because of a character proofreading problem, my factory interview failed
- Programmer's Diary (1): Children's day talk about game memory
- "Software test 3" eight typical black box test methods have come, catch up!
- Shocked, there was a 'pesticide' competition inside
- High performance video reasoning engine optimization technology
- Study notes - what is testDouble?
- Continuous testing the key to efficient testing in Devops Era
- Oceanbase, an ant self research database, first elaborated its strategy: continue to adhere to the road of self research and opening up, and open up 3 million lines of core code
- Understanding regularization from linear regression
- 2-setup and teardown
- 【无线通信篇01 | Zstack协议栈】CC2530 Zigbee Zstack协议栈组网项目及详细讲解篇
- Cold knowledge of C language 03_ Scope and life cycle, self-cultivation of variables
- Data stack technology sharing: open source · data stack - extend flinksql to join streams and dimension tables
- Using rust to program security system
- [wireless communication 01 | ztack protocol stack] CC2530 ZigBee ztack protocol stack networking project and detailed explanation
- uni-app 微信小程序使用 web-view 预览PDF
- SwiftUI 简明教程之自适应布局
- Preview PDF with web view in uni app wechat applet
- Adaptive layout of swiftui concise tutorial
- Zero basic IOS development learning diary - function chapter - SQLite database
- Huawei cloud and malanshan cultural and creative park help Hunan Radio and television win many awards from SARFT
- A collection of kotlin set functions
- Deep understanding of ES6 (1)