当前位置:网站首页>How to deal with the out of order data in real-time computing scene by Apache Flink

How to deal with the out of order data in real-time computing scene by Apache Flink

2020-12-06 16:15:09 itread01

## One 、 The future of streaming Computing Published on Google GFS、BigTable、Google MapReduce After three papers , Big data technology has really made its first leap ,Hadoop The ecosystem is gradually developing . Hadoop Very good at handling large amounts of data , It has the following characteristics : 1、 Before the calculation begins , Information must be prepared in advance , Then you can start to calculate ; 2、 When a lot of data is calculated , Will output the final calculation result , Complete the calculation ; 3、 The timeliness is relatively low , Not for real-time computing ; And with real-time recommendations 、 Risk control and other business development , The requirement of data processing delay is higher and higher , The real-time requirement is also higher and higher ,Flink Starting to show up in the community . Apache Flink As a real stream processing framework , It has low latency , It can ensure that the message transmission will not be lost or repeated , With very high throughput , Support native stream processing . This article mainly introduces Flink The concept of time 、 Windows computing and Flink How to handle out of order data in windows . ## Two 、Flink The concept of time in stay Flink There are three main concepts of time : (1) When the event happened , be called Event Time; (2) Data access to Flink Time for , be called Ingestion Time; (3) The information is in Flink The system time of the machine when the system is operated , be called Processing Time Processing time is a relatively simple concept of time , There is no need for coordination between flows and systems , Can provide the best performance and the lowest latency . But in a decentralized environment , The processing time of multiple machines cannot be strictly consistent , There is no guarantee of certainty . And event time is the time when an event occurs , In entering into Flink When it comes to systems , Already in record To record , This can be done by extracting event timestamps , Make sure that during processing , Reflect the sequence of events . ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105641708-1416579722.png) ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105642367-2061862335.png) ## 3、 ... and 、Flink Why windows computing is needed We know that streaming datasets have no boundaries , Data will continue to flow into our system . The ultimate goal of stream computing is to generate aggregate results from statistical data , And on the unbounded data set , If you do a global window statistics , It's unrealistic . Only to delimit a certain size window range to do the calculation , In order to finally aggregate to the downstream system , To analyze and demonstrate . ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105642639-1259834552.png) stay Flink When you do windows Computing , You need to know two core information : * Every Element Of EventTime Time stamp ?( It can be specified in the data record ) * Access data , When can statistical calculations be triggered ? ( Windows 11:00 ~ 11:10 All of the data have been received ) #### Orderly Events Suppose under perfect conditions , The information is strictly ordered , Then at this time , Stream computing engine can calculate the data of each window correctly ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105643182-511384768.png) #### Disorderly Events But in reality , Information can be for a variety of reasons ( System delay , Network delay, etc ) It's not a strictly ordered arrival system , Even some information will be late for a long time , Now Flink There needs to be a mechanism , Allow data to be out of order within a certain range . This mechanism is called watermark . ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105643375-128840522.png) Like above , There is an argument : MaxOutOfOrderness = 4, Is the maximum out of order time , It means how much data can be allowed to be out of order , It can be 4 Minutes ,4 Hours etc. . The watermark generation strategy is , Subtract the maximum event timestamp of the current window MaxOutOfOrderness Value . As shown above , event 7 Will produce a w(3) The watermark , event 11 Will produce to give w(7) The watermark , But the event 9 , It's less than an event 11 Of , The watermark update will not be triggered at this time . event 15 Will produce a w(11) The watermark . That is to say , The watermark reflects the overall flow of events , It just goes up , It won't go down . The watermark indicates that all events less than the watermark value have reached the window . > Whenever a new maximum timestamp appears , There will be new watermark #### Late events For events whose event time is less than watermark time , It's called a late event . Late events are not included in the window . Here's the picture ,21 After the event entered the system , Will produce w(17) The watermark . And then 16 event , Due to less than the current watermark time w(17), It's not going to be counted . ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105643557-576187723.png) #### When to trigger computation Let's use a graph to show when window computing will trigger Here's the picture , It means a 11:50 To 12:00 Windows for , Here's a piece of information , cat,11:55, The event time is 11:55, In Windows , The maximum delay time is 5 Minutes , So the current watermark time is 11:50 ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105643781-203061738.png) Here comes another piece of information ,dog,11:59, The event time is 11:59, Into the window . Since the time of this event is longer than that of the last one , So the watermark is updated to 11:54. At this time, because the watermark time is still less than the window end time , So there's still no trigger calculation . ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105644028-1750686641.png) Here's another piece of information , cow,12:06, At this time, the watermark time is updated to 12:01 , It's larger than the window end time , Windows computing is triggered ( Suppose the calculation logic is to count the number of different elements in the window ). ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105644550-93474008.png) Suppose there's another event , yes dog,11:58, Since it is less than the watermark time , And after the last time windows computing was triggered , Windows have been destroyed , therefore , This event will not trigger the calculation . Now , You can put this event in sideoutput In the queue , Additional logic processing . ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105644774-1954287544.png) ## Four 、Flink 1.11 edition in , How to define watermark So in 1.11 In the version , The watermark generation interface is reconstructed . In the new version , Mainly through WatermarkStrategy Class , To use different strategies to generate watermarks . The new interface provides many static methods and methods with default implementations , If you want to define your own generation strategy , You can do this : ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105645100-1466801694.png) Generate a WatermarkGenerator ![file](https://img2020.cnblogs.com/other/611106/202012/611106-20201206105645662-2046312946.png) This class is also very simple * onEvent: If we want to rely on each element to generate a watermark and send it downstream , You can do this ; * OnPeriodicEmit: If there is a large amount of data , If we generate a watermark for each data , It's going to affect performance , So there's also a way to periodically generate watermarks . For the convenience of development ,Flink Some built-in watermark generation methods are also provided for our use * Fixed delay generation watermark We want to generate a delay 3 s Fixed watermark for , You can do this ```java DataStream dataStream = ...... ; dataStream.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(3))); ``` * The watermark is generated monotonically and incrementally It is equivalent to that the delay time is removed from the above delay strategy , With event The time stamp in serves as a watermark , It can be used in this way : ```java DataStream dataStream = ...... ; dataStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps()); ``` ## 5、 ... and 、 A simple little example , To count the number of letters in the window ```java public class StreamTest1 { @Data @AllArgsConstructor @NoArgsConstructor @ToString public static class MyLog { private String msg; private Integer cnt; private long timestamp; } public static class MySourceFunction implements Source

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012061613372058.html