Robust data technology 05: data scheduling
2020-12-07 10:53:18 【osc_ qcm2mqmy】
Zhao Zhuangshi / A data man's private plot
hello , Hello everyone , I'm Zhao Zhuangshi .
It's a pleasure to meet you again on Saturday morning ～ In the last section 《 Solid learning data technology 04:ETL》 in , We talked about warehouse development , Today, let's connect , Talk about data processing 、 Data report production is an essential link ： Data scheduling .
01 What is data scheduling ？
In data development , For data scheduling , We usually mean “ Task scheduling ” or “ Job scheduling ”. here , Let's start with a concept , Namely job and task.
job and task There are several differences in different contexts ：
stay Spark in , task It's a Job The smallest unit of operation to run after cutting . In general , One rdd How many partition, How many will there be task, Because every one of them task It's just a process partition The data on the . and task After grouping and batching , go by the name of stage.Spark For different stage And different task Set up dependency , To guarantee the whole thing job The correctness and integrity of the operation , the last one resultTask The end means job Running successfully .
Hadoop An assignment is called a Job,Job It's divided into Map Task and Reduce Task Stage , Every Task All run in their own processes , When Task At the end , The process will come to an end ;
In the context of a scheduling product
Task： A task .
TaskType: Task type , Such as ETL、MR job、 Simple.
Job： Homework , An execution of a task in the running process .
in summary ,job、task In different contexts , Their relationship is different , So in different data scheduling products , Pay attention to their differences .
Let's summarize , Data scheduling , It's when a task runs , When to end and correctly handle dependencies between tasks . The first thing we need to focus on is starting the right job at the right time , Make sure that the job is executed according to the correct dependency relationship in time and accurately .
02 What modules does the data scheduling product contain ？
In designing scheduling products , We need to understand a few questions in it ：
1. Trigger mechanism ： Time 、 rely on 、 blend
· Time That is, tasks are scheduled according to time （ year / month / Japan / Hours / minute / second / millisecond ）
· rely on That is, tasks are scheduled according to dependency relationship
· blend The two schedule each other
2. workflow ： Task status （ interrupt & function ）、 task management or government （ type 、 change ）、 Task type 、 Task fragmentation .
3. Scheduling strategy ： be ready & Overtime ; retry & Retry count & When you try again .
4. Mission isolation ： The relationship between task and execution .
at present , The task scheduling system on the market has oozie、azkaban、airflow wait , Besides , And Ali's TBSchedule、 Tencent's Lhotse、 Dangdang elastic-job.
We can press dag Workflow class 、 There are two types of timing slicing systems ：
One is dag Workflow system ：oozie、azkaban、chronos、lhotse
One is a piecewise class system ：TB Schedule、elastic-job、saturn
among ,dag ( Directed Acyclic Graph), It is a kind of directed acyclic graph , Any side has a direction , And there is no loop . There's a soul painter , I learned from it , We can feel what “ Directed acyclic ”.
If you choose dag This way of workflow , We have to pay attention to the time 、 To complete the degree , Ensure a rich and flexible trigger mechanism .
What's the slice ？ Let's give you an example ： If we had 3 Taiwan physics machine , Yes 10 Per 5s Perform a scheduled task , It's just that every task goes to the first machine . for fear of “ A dry death , Waterlogged to death ”, So we need to balance the tasks to all the current executable physical machines , This is the so-called fragmentation mechanism . Common fragmentation mechanisms, such as average allocation algorithm 、hash value 、 Polling algorithm , We use a variety of algorithms to ensure an average for the physical machine “ To spend ”.
If we choose the piecewise approach , We should pay attention to accuracy 、 Just in time trigger .
03 Data scheduling product introduction
For simple offline data migration job, It's usually using shell The script by crontab Perform scheduled execution , But with more than one job The increase in complexity , It's like coordinating work 、 Task monitoring becomes a hassle , So we chose to use tools for scheduling monitoring .
3.1 dag Workflow system
oozie yes Hadoop Platform open source workflow scheduling engine , It can manage Hadoop Homework .oozie Belong to web Applications , from oozie client and oozie Server Two components make up .Oozie More than one configuration MR（mapreduce） workflow , It can execute a MR1 after , Then execute a java Script , One more shell Script , Next is Hive Script , And then... Again Pig Script , Finally, another MR2. Use oozie when , If the previous task fails , The latter task will not be scheduled .
azkaban By Linkedin Open source batch workflow task scheduler . Used to run a set of work and processes in a specific order within a workflow .Azkaban It defines one KV File format to build dependencies between tasks .
chronos By Airbnb The company launched to replace crontab Open source products of . Users can use it to arrange jobs , Support use Mesos As job executor , And Hadoop Interact . meanwhile ,chronos You can also define triggers after job execution , Support any length of dependency chain .
3.2 It's classified
TBSchedule：TBSchedule Taobao is an open source framework for distributed scheduling , be based on Zookeeper Java Realization . It allows for batch tasks or ever-changing tasks , Capable of being dynamically assigned to multiple hosts JVM Run in parallel in different thread groups in , So that all tasks can be repeated , Fast processing without omission .
elastic-job： The flexible distributed task scheduling system developed by Dangdang , use zookeeper Achieve distributed coordination , Realize high availability and fragmentation of tasks , And can support cloud development .
saturn： Vipshop independently developed a distributed scheduling platform for timed tasks , Based on Dangdang elastic-job edition 1 Development , And it can be deployed to docker On the container .
All right. , Today's another brain burning day .
I still remember Wen Yu, who was lovely at the beginning, told me crontab Things about , I thought it was a big deal .
But now from the whole scheduling product and technical framework ,crontab It's the novice village of Selda .
therefore , You'll experience crying in search of a cold suit , You'll also experience the joy of getting an aircraft .
To define is to limit.
To express a little emotion , I'm Zhao Zhuangshi , We 《 Solid learning data technology 06》 See you next week ！
A data man's private plot is a big family that helps data people grow up , Help partners who are interested in data to identify their learning direction 、 Improve your skills precisely . Pay attention to me , Take you to explore the mysteries of data
1、 return “ Data products ”, obtain < Interview questions for big data products >
2、 return “ Data center ”, obtain < Big factory data, middle office information >
3、 return “ Business analysis ”, obtain < Interview questions for business analysis of large factories >;
4、 return “ Make a friend ”, In communication group , Get to know more data partners .
- C++ 数字、string和char*的转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
C + + programming experience (6): using C + + style type conversion
Latest party and government work report ppt - Park ppt
Online ID number extraction birthday tool
Field pointer? Dangling pointer? This article will help you understand!
GVRP of hcna Routing & Switching
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing＆Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+＃1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11：如何处理标定助手品质问题
- HALCON 20.11：标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- Reverse linked list
- JS data type
- Remember the bug encountered in reading and writing a file
- Singleton mode
- 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
- In this world of N programming languages, is there a future for C + +?
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World