当前位置:网站首页>BiliBili: application of machine learning workflow platform based on Flink in station B

BiliBili: application of machine learning workflow platform based on Flink in station B

2021-06-22 01:22:53 Flink_ China

Sharing guests : Zhang Yang ,B Station Senior Development Engineer

Reading guide : The whole process of machine learning , Reporting from data 、 To feature calculation 、 To model training 、 And then deploy online 、 Final effect evaluation , The whole process is very lengthy . stay b standing , Multiple teams will build their own machine learning Links , To fulfill their machine learning needs , Engineering efficiency and data quality are difficult to guarantee . So we're based on Flink Community aiflow project , A set of standard workflow platform for machine learning is constructed , Accelerate machine learning process construction , Improve the data effectiveness and accuracy of multiple scenarios . This sharing will introduce b Machine learning workflow platform of the station ultron stay b Stand for applications in multiple machine learning scenarios .

Catalog :

1、 Machine learning in real time

2、Flink stay B Stand up for the use of machine learning

3、 Machine learning workflow platform construction

4、 The future planning

One 、 Machine learning in real time

img

First, let's talk about real-time machine learning , It is mainly divided into three parts :

  • The first is real-time sampling . Traditional machine learning , All the samples are t+1, in other words , Today's model uses yesterday's training data , Train the model every morning using yesterday's full day data ;

  • The second is the real-time feature . The previous characteristics are basically t+1, This will lead to some inaccurate recommendations . such as , Today I watched a lot of new videos , But what I recommend is something I saw yesterday or more ;

  • The third is real-time model training . After we have real-time samples and real-time features , Model training can also achieve real-time online training , Can bring more real-time recommendation effect .

Traditional offline Links

img

This is a traditional offline link diagram , First of all APP Log generation or server generation log, The whole data goes through the data pipeline to HDFS On , And every day t+1 Do some feature generation and model training , Feature generation will be put into feature storage , May be redis Or something else kv Storage , And then to the top inference Online services .

The disadvantages of traditional offline Links

img

So what's wrong with it ?

  • The first is the t+1 The characteristics of the data model are very low timeliness , It's very difficult to update with high timeliness ;

  • The second is the whole process of model training or feature production , Use day level data every day , The whole training or feature production takes a very long time , The computing power of the cluster is very high .

Real time link

img

In the figure above, we optimize the whole real-time link process , The part of the red fork is removed . After the whole data is reported, it is passed pipeline Direct to real time kafka, After that, we will do a real-time feature generation , And real time sample generation , The characteristic results will be written feature store Go inside , Sample generation also needs to start from feature store Inside to read some features .

After generating the samples, we directly conduct real-time training . The long link on the right has been removed , But we still save the offline features . Because for some special features, we still need to do some offline calculation , For example, some of them are very complex and difficult to be real-time or have no real-time requirements .

Two 、Flink stay b Stand up for the use of machine learning

img

Let's talk about how we do real-time samples 、 Real time feature and real-time effect evaluation .

  • The first is real-time samples .Flink Currently hosting b Station all recommended business sample data production process ;

  • The second is real-time features . At present, quite a few features use Flink Do real-time calculations , The timeliness is very high . There are a lot of features that use offline + Real time combination results , Historical data is calculated offline , Real time data with Flink, Use stitching when reading features .

    however , Sometimes these two sets of computing logic can't be reused , So we're also trying to use Flink Do batch flow integration , Use all the definitions of features Flink To do it , According to business needs , Real time or offline , The underlying computing engines are all Flink;

  • The third is an evaluation of real-time effects , We used Flink+olap To get through the whole real-time computing + Real time analysis of links , Evaluate the final model effect .

Real time sample generation

img

The picture above shows the generation of real-time samples at present , It is for the whole recommendation service link . Log data falls into kafka after , First we make a Flink Of label-join, Put the click and show together . The results continue to fall into kafka after , One more Flink The characteristics of the task join, features join It's a combination of features , Some features are public domain features , Some are the private domain characteristics of the business party . The sources of characteristics are quite diverse , There are offline and real-time . After completing all the features , It will generate a instance Sample data fall into kafka, For the training model behind .

Real time feature generation

img

The image above shows the generation of real-time features , Here is a more complex feature process , The whole calculation process involves 5 A mission . The first task is offline , In the back 4 individual Flink Mission , A feature generated by a series of complex calculations falls into kafka Inside , Write again feature-store, And then it's used for online prediction or real-time training .

Real time effect evaluation

img

The picture above is a real-time evaluation of the effect , A very core indicator that recommendation algorithms focus on is ctr Click through rate , finish label-join after , You can figure it out ctr Data. , In addition to the next step of sample generation , At the same time, it will import a piece of data to clickhouse Inside , After the report system docking, you can see very real-time effect . The data itself is labeled with experiments , stay clickhouse It can be distinguished according to the label , See the corresponding experimental effect .

3、 ... and 、 Machine learning workflow platform construction

Pain points

img

  • There is sample generation in the whole link of machine learning 、 Feature generation 、 Training 、 forecast 、 Effect evaluation , Each part has to be configured with a lot of development tasks , The launch of a model ultimately needs to span multiple tasks , The link is very long .
  • The new algorithm is difficult for students to understand the whole picture of this complex link , The cost of learning is very high .
  • The change of the whole link will affect the whole body , It's very easy to break down .
  • The computing layer uses multiple engines , Batch stream mixing , It's hard to keep semantics consistent , We need to develop two sets of the same logic , Keep no gap It's also very difficult .
  • The whole real-time cost threshold is also relatively high , Need to have a strong real-time offline capabilities , Many small business teams are difficult to complete without platform support .

img

The figure above shows the general process of the model from data preparation to training , Seven or eight nodes are involved , Can we complete all the process operations on one platform ? Why do we use Flink? Because our team's real-time computing platform is based on Flink To do the , And we saw that Flink Potential in batch flow integration and some future development paths in real-time model training and deployment .

introduce Aiflow

img

Aiflow It's Ali's Flink An open source machine learning workflow platform for ecological team , Focus on the standardization of processes and entire machine learning Links . Last August 、 September , After we got in touch with them , The introduction of such a system , Build and improve together , And gradually began to b Station landing . It abstracts the whole machine learning into a graph example、transform 、Train、validation、inference These processes . In project architecture, the core of capability scheduling is to support flow batch mixed dependency , The metadata layer supports model management , It's very convenient to update the model iteratively . We built our machine learning workflow platform based on this .

Platform features

img

Let's talk about platform features :

  • The first is to use Python Define workflow . stay ai Direction , Everyone uses Python There are still more , We also refer to some external , image Netflix Is also used Python To define this kind of machine learning workflow .
  • The second is to support batch flow tasks with mixed dependencies . In a complete link , All the real-time offline processes involved can be added to it , And batch tasks can depend on each other through signals .
  • The third is to support the whole experimental process of one click cloning . From primitive log In the end, the whole experiment started training , We hope to be able to clone the whole link with one click , Quickly pull up a new experimental link .
  • The fourth is some performance optimization , Support resource sharing .
  • The fifth is to support feature backtracking and batch flow integration . The cold start of many features requires a long history of data calculation , It's very expensive to write a set of off-line feature calculation logic specially for cold start , And it's hard to align with real-time feature calculations , We support tracing off-line features directly on the real-time link .

Basic framework

img

Here's the basic architecture , At the top is business , At the bottom is the engine . At present, there are many engines supported :Flink、spark、Hive、kafka、Hbase、Redis. There are computing engines , There's also a storage engine . With aiflow As an intermediate workflow management ,Flink As the core computing engine , To design the entire workflow platform .

Workflow description

img

The whole workflow uses Python To describe the , stay python Inside, users only need to define computing nodes and resource nodes , And the dependencies between these nodes , The syntax is a bit like a scheduling framework airflow.

Dependency definition

img

The main dependencies of batch flow are 4 Kind of : Flow to batch , Flow to flow , Batch to stream , Batch to batch . It can basically meet all the needs of our business at present .

Resource sharing

img

Resource sharing is mainly used for performance , Because most of the time, the learning link of a machine is very long , For example, there are only five or six nodes that I often change just now , When I want to start the whole experiment again , Clone the whole picture , In the middle, I only need to change some nodes or most nodes , Upstream nodes can share data .

img

This is a technical implementation , After cloning, the shared node is tracked .

Real time training

img

This is a real-time training process . Feature traversal is a very common problem , It happens when the progress of multiple computing tasks is inconsistent . In the workflow platform , We can define the dependency of each node , Once a dependency occurs between nodes , The processing progress will be synchronized , Generally speaking, it is fast and slow , Avoid feature traversal . stay Flink Inside we use watermark To define the processing schedule .

Feature backtracking

img

This is the process of feature backtracking , We use real-time Links , Go straight back to its historical data . Offline and real-time data are different after all , There are many problems that need to be solved , So it also uses spark, We'll change the latter one to Flink.

The problem of feature backtracking

img

There are several big problems with feature backtracking :

  • The first is how to ensure the order of data . The implicit semantics of real-time data is that data comes in sequence , When it's produced, it's processed right away , Nature has a certain order . But offline HDFS No ,HDFS It's partitioned , The data in the partition is completely out of order , In real business, a lot of computing processes depend on time sequence , How to solve the disorder of offline data is a big problem .
  • The second is how to ensure the consistency of features and sample versions . For example, there are two links , One is the production of features , One is sample production , Sample production depends on feature production , How to ensure the consistency of versions between them , No crossing ?
  • The third is how to ensure the consistency of calculation logic between real-time link and backtracking link ? In fact, we don't have to worry about this problem , We're tracking offline data directly over the real-time link .
  • Fourth, there are some performance problems , How to quickly calculate a large number of historical data .

Solution

img

Here's the first 、 The solution to the second problem :

  • The first question is . For the order of the data , We HDFS Offline data for kafka Chemical treatment , This is not to pour it into kafka Go inside , It's simulation kafka Data architecture of , Partition and order within the partition , We put HDFS Data is processed into a similar architecture , Simulate a logical partition , And the logical partition is orderly ,Flink Read the hdfssource The corresponding data architecture supporting this simulation has also been developed . This simulation is currently using spark It's done , Later we'll change it to Flink.

  • The second question is divided into two parts :

    • The solution of real-time feature part depends on Hbase Storage ,Hbase Query by version is supported . After the feature is calculated, it is written directly according to the version Hbase, Check when the sample is generated Hbase With the corresponding version number , The version in this is usually data time .
    • Offline feature section , Because there's no need to recalculate , Offline storage hdfs There are , But it doesn't support checking , This is the place where kv It's just a chemical treatment , We do asynchronous preloading for performance .

img

The process of asynchronous preloading is shown in the figure .

Four 、 The future planning

Next, let's talk about our plan .

img

  • One is data quality assurance . Now the whole link is getting longer and longer , There may be 10 Nodes 、 20 Nodes , Then how to quickly find the problem point when the whole link goes wrong . Here we want to do it for node set dpc, For each node, we can customize some data quality checking rules , Data is bypassed to a unified dqc-center Perform rule operation .

img

  • The second is full link exactly once, How to ensure accurate consistency between workflow nodes , This is not clear yet .

img

  • Third, we will add model training and deployment nodes to the workflow . Training and deployment can be connected to other platforms , It could be Flink Training models and deployment services supported by itself .

** Introduction to guests :** Zhang Yang ,17 In induction b standing , Working on big data .

版权声明
本文为[Flink_ China]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/06/20210603183259931G.html

随机推荐