当前位置:网站首页>Robust data technology 05: data scheduling

Robust data technology 05: data scheduling

2020-12-07 10:53:18 osc_ qcm2mqmy

Zhao Zhuangshi / A data man's private plot
 Solid learning data technology 05: Data scheduling
hello , Hello everyone , I'm Zhao Zhuangshi .

It's a pleasure to meet you again on Saturday morning ~ In the last section 《 Solid learning data technology 04:ETL》 in , We talked about warehouse development , Today, let's connect , Talk about data processing 、 Data report production is an essential link : Data scheduling .

01 What is data scheduling ?

In data development , For data scheduling , We usually mean “ Task scheduling ” or “ Job scheduling ”. here , Let's start with a concept , Namely job and task.

job and task There are several differences in different contexts :

spark Context

stay Spark in , task It's a Job The smallest unit of operation to run after cutting . In general , One rdd How many partition, How many will there be task, Because every one of them task It's just a process partition The data on the . and task After grouping and batching , go by the name of stage.Spark For different stage And different task Set up dependency , To guarantee the whole thing job The correctness and integrity of the operation , the last one resultTask The end means job Running successfully .

job>stage>task

hadoop Context

Hadoop An assignment is called a Job,Job It's divided into Map Task and Reduce Task Stage , Every Task All run in their own processes , When Task At the end , The process will come to an end ;

job>task

In the context of a scheduling product

Task: A task .

TaskType: Task type , Such as ETL、MR job、 Simple.

Job: Homework , An execution of a task in the running process .

in summary ,job、task In different contexts , Their relationship is different , So in different data scheduling products , Pay attention to their differences .

Let's summarize , Data scheduling , It's when a task runs , When to end and correctly handle dependencies between tasks . The first thing we need to focus on is starting the right job at the right time , Make sure that the job is executed according to the correct dependency relationship in time and accurately .

02 What modules does the data scheduling product contain ?

In designing scheduling products , We need to understand a few questions in it :

1. Trigger mechanism : Time 、 rely on 、 blend

· Time That is, tasks are scheduled according to time ( year / month / Japan / Hours / minute / second / millisecond )

· rely on That is, tasks are scheduled according to dependency relationship

· blend The two schedule each other

2. workflow : Task status ( interrupt & function )、 task management or government ( type 、 change )、 Task type 、 Task fragmentation .

3. Scheduling strategy : be ready & Overtime ; retry & Retry count & When you try again .

4. Mission isolation : The relationship between task and execution .

at present , The task scheduling system on the market has oozie、azkaban、airflow wait , Besides , And Ali's TBSchedule、 Tencent's Lhotse、 Dangdang elastic-job.

We can press dag Workflow class 、 There are two types of timing slicing systems :

One is dag Workflow system :oozie、azkaban、chronos、lhotse

One is a piecewise class system :TB Schedule、elastic-job、saturn

among ,dag ( Directed Acyclic Graph), It is a kind of directed acyclic graph , Any side has a direction , And there is no loop . There's a soul painter , I learned from it , We can feel what “ Directed acyclic ”.

 Solid learning data technology 05: Data scheduling

If you choose dag This way of workflow , We have to pay attention to the time 、 To complete the degree , Ensure a rich and flexible trigger mechanism .

What's the slice ? Let's give you an example : If we had 3 Taiwan physics machine , Yes 10 Per 5s Perform a scheduled task , It's just that every task goes to the first machine . for fear of “ A dry death , Waterlogged to death ”, So we need to balance the tasks to all the current executable physical machines , This is the so-called fragmentation mechanism . Common fragmentation mechanisms, such as average allocation algorithm 、hash value 、 Polling algorithm , We use a variety of algorithms to ensure an average for the physical machine “ To spend ”.

If we choose the piecewise approach , We should pay attention to accuracy 、 Just in time trigger .

03 Data scheduling product introduction

For simple offline data migration job, It's usually using shell The script by crontab Perform scheduled execution , But with more than one job The increase in complexity , It's like coordinating work 、 Task monitoring becomes a hassle , So we chose to use tools for scheduling monitoring .

3.1 dag Workflow system

oozie

oozie yes Hadoop Platform open source workflow scheduling engine , It can manage Hadoop Homework .oozie Belong to web Applications , from oozie client and oozie Server Two components make up .Oozie More than one configuration MR(mapreduce) workflow , It can execute a MR1 after , Then execute a java Script , One more shell Script , Next is Hive Script , And then... Again Pig Script , Finally, another MR2. Use oozie when , If the previous task fails , The latter task will not be scheduled .

azkaban

azkaban By Linkedin Open source batch workflow task scheduler . Used to run a set of work and processes in a specific order within a workflow .Azkaban It defines one KV File format to build dependencies between tasks .

 Solid learning data technology 05: Data scheduling

 Solid learning data technology 05: Data scheduling

chronos

chronos By Airbnb The company launched to replace crontab Open source products of . Users can use it to arrange jobs , Support use Mesos As job executor , And Hadoop Interact . meanwhile ,chronos You can also define triggers after job execution , Support any length of dependency chain .

 Solid learning data technology 05: Data scheduling

3.2 It's classified

TBSchedule:TBSchedule Taobao is an open source framework for distributed scheduling , be based on Zookeeper Java Realization . It allows for batch tasks or ever-changing tasks , Capable of being dynamically assigned to multiple hosts JVM Run in parallel in different thread groups in , So that all tasks can be repeated , Fast processing without omission .

elastic-job: The flexible distributed task scheduling system developed by Dangdang , use zookeeper Achieve distributed coordination , Realize high availability and fragmentation of tasks , And can support cloud development .

saturn: Vipshop independently developed a distributed scheduling platform for timed tasks , Based on Dangdang elastic-job edition 1 Development , And it can be deployed to docker On the container .

 Solid learning data technology 05: Data scheduling

All right. , Today's another brain burning day .

I still remember Wen Yu, who was lovely at the beginning, told me crontab Things about , I thought it was a big deal .

But now from the whole scheduling product and technical framework ,crontab It's the novice village of Selda .

therefore , You'll experience crying in search of a cold suit , You'll also experience the joy of getting an aircraft .

To define is to limit.

 Solid learning data technology 05: Data scheduling

To express a little emotion , I'm Zhao Zhuangshi , We 《 Solid learning data technology 06》 See you next week !


A data man's private plot is a big family that helps data people grow up , Help partners who are interested in data to identify their learning direction 、 Improve your skills precisely . Pay attention to me , Take you to explore the mysteries of data

1、 return “ Data products ”, obtain < Interview questions for big data products >

2、 return “ Data center ”, obtain < Big factory data, middle office information >

3、 return “ Business analysis ”, obtain < Interview questions for business analysis of large factories >;

4、 return “ Make a friend ”, In communication group , Get to know more data partners .

版权声明
本文为[osc_ qcm2mqmy]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207105113704b.html