当前位置:网站首页>The method of realizing high SLO on large scale kubernetes cluster

The method of realizing high SLO on large scale kubernetes cluster

2020-11-06 20:27:12 Alibaba cloud native

 The first figure .png

author |  Ant gold clothing technical expert Yao Jinghua ; Senior Development Engineer of ant financial Fan Kang

Reading guide : With Kubernetes The increase in cluster size and complexity , Clustering is increasingly difficult to ensure efficiency 、 Low delay delivery pod. This article will share ant financial services in design SLO  High architecture and implementation SLO The method and experience of .

Why SLO?

1.png

Gartner Yes SLO The definition of : stay SLA Within the framework of ,SLO It's what the system has to achieve ; The success of the caller needs to be guaranteed as much as possible . Some people may be right about SLI/SLO/SLA There is confusion , Let's first look at the relationship between the three :

  • SLI  Define an indicator , To describe how good a service is to meet good standards . such as Pod stay 1min Internal delivery . We usually delay 、 Usability 、 The throughput and success rate are used to formulate the following SLI.

  • SLO  Defined a small goal , To measure a SLI The proportion of indicators that meet good standards over time . for instance ,99% Of Pod stay 1min Internal delivery . When a service publishes its SLO After , Users will have expectations for the quality of the service .

  • **SLA ** yes SLO A derivative agreement , Commonly used in SLO When the defined target ratio is not completed , How much does the service provider have to pay . Generally speaking ,SLA The agreement will form a legally efficient contract in black and white , It is often used between service providers and external customers ( For example, Alibaba cloud and its users ). Generally speaking, for internal services between SLO Broken , Usually it's not financial compensation , Maybe it's more about responsibility .

therefore , What we pay more attention to inside the system is SLO.

What we concern about Larger K8s Cluster?

2.png

With the continuous development of the production environment 、K8s Clusters are becoming more and more complex 、 The scale of the cluster is increasing . How to protect the large-scale environment K8s Cluster availability ? There is a problem for many manufacturers . about K8s colony , We usually care about the following questions :

  • The first question is whether the cluster is healthy , Are all components working properly , In the cluster Pod How many failures have been created , It's a question of overall indicators .

  • The second question is what's going on in the cluster , Whether there is an exception in the cluster , What users do in the cluster , It's a matter of tracking ability .

  • The third problem is when there is an exception , Which component has a problem that leads to a lower success rate , It's a question of cause orientation .

that , How can we solve the above problems ?

  • First , We need to define a set of SLO, To describe the availability of a cluster .

  • next , We have to be able to deal with Pod To track the life cycle of ; For failed Pod, We also need to analyze the reasons for the failure , To quickly locate abnormal components .

  • Last , We need to optimize , Eliminate cluster exceptions .

SLls on Large K8s Cluster

Let's first look at some indicators of clusters .

3.png

  • The first indicator : Cluster health . There are Healthy/Warning/Fatal Three values to describe ,Warning and Fatal Corresponding to the alarm system , such as P2 An alarm occurred , Then the cluster is Warning; If P0 An alarm occurred , Then the cluster is Fatal, It has to be dealt with .

  • The second indicator : The success rate . The success rate here means Pod The creation success rate of .Pod Success rate is a very important indicator , Ants a week Pod The amount of creation is million , The fluctuation of success rate will cause a lot of Pod The failure of the ; and Pod The decline in success rate , Is the most intuitive response to cluster exceptions .

  • The third indicator : Residual Terminating Pod The number of . Why not delete the success rate ? Because at the million level , Even if Pod The deletion success rate reaches 99.9%, that Terminating Pod It's also a thousand . So much remains Pod, Will take up the application capacity , It's not acceptable in a production environment .

  • The fourth indicator : Service online rate . The service online rate is measured by a probe , Probe failed , It means that the cluster is not available . The service online rate will be correct Master Component design .

  • The last indicator : Number of malfunctioning machines , This is an indicator of the node dimension . Failure machines are usually those that cannot be delivered correctly Pod The physical machine of , Maybe the disk is full , May be load It's too high . Cluster failure machine and must do “ Quickly found , Fast isolation , In a timely manner to repair ”, After all, failure opportunities have an impact on cluster capacity .

The success standard and reason classification

With cluster indicators , We need to refine these indicators , Define the criteria for success .

4.png

First look Pod Create success rate indicators . We put Pod Divided into ordinary Pod and Job class Pob. Ordinary Pod Of RestartPolicy by Always,Job class Pod Of RestartPlicy by Never or OnFailure. Both have delivery times , For example, you have to be in 1 Delivery in minutes . Ordinary Pod The delivery standard of is 1min Inside Pod already Ready;Job class Pod The delivery standard of is 1min Inside Pod Has reached the state of Running、Succeeded or Failed. Of course, the time needed to create PostStartHook Execution time excludes .

about Pod The deletion of , The standard of success is : Within the prescribed time ,Pod from ETCD Delete inside . Of course , The deletion time needs to put PreStopHookPeriod Time excludes .

For faulty machines , Find and isolate and demote as soon as possible . For example, the physical disk is read-only , That must be in 1min Finish on the inside Pod hit taint. As for the recovery time of the faulty machine , According to different fault causes , Set different recovery times . For example, a system failure requires an important installation system , Then the recovery time will be longer .

With these standards , We are also right Pod The reasons for the failure have been sorted out , Some failures are caused by the system , It's something we need to care about ; Some failures are caused by users , It's something we don't need to care about .

such as RuntimeError, It's a system error , Bottom Runtime There's a problem ;ImagePullFailed,Kubelet Failed to download image , Because ants have Webhook Check the image access , All image download failures are generally caused by the system .

For user reasons , It can't be solved on the system side , We only provide these failure reasons to users in the form of interface query , Let the user solve it by himself . such as ContainerCrashLoopBackOff, It is usually caused by user container exit .

The infrastructure

5.png

around SLO The goal is , We built a whole system , On the one hand, it is used to provide end users with 、 The operation and maintenance personnel show the indicators of the current cluster ; On the other hand , The components work together , By analyzing the current cluster state , Get the impact SLO The factors of , In order to upgrade the cluster pod Delivery success rate provides data support .

Look from the top down , The top-level component is mainly oriented to various index data , Such as cluster health status 、pod establish 、 Delete 、 Upgrade success rate , Residual pods Number 、 The number of unhealthy nodes and other indicators . among Display Board It's what we often call monitoring the market .

We also built Alert Alarm subsystem , Support flexible configuration , It can be different indicators , According to the percentage of decline in the indicator , The absolute value of index drop is configured with multiple alarm modes , Telephone , SMS , Mail, etc. .

Analysis System By analyzing the historical data of indicators , And the collected nodes metrics and master Component metrics , Give a more detailed report on cluster operation . among :

  • Weekly Report The subsystem gives the current cluster this week pod establish / Delete / Updated statistics , And the reasons for failure cases .

  • Terminating Pods Number It is given that the newly added items in the cluster cannot pass through for a period of time K8s Mechanism deleted pods List and pods Residual cause .

  • Unhealthy Nodes Then the total available time of all nodes in the cluster in a cycle is given , Available time per node , O & M records , And it doesn't automatically recover , List of nodes requiring manual intervention for recovery .

To support these functions , We developed Trace System, Used to analyze and show individual pod establish / Delete / Specific reasons for the upgrade failure . It includes logging and event collection 、 Data analysis and pod The lifecycle shows three modules :

  • The log and event collection module collects each master Component and node component run log and pod/node event , Respectively by pod/node Store logs and events for the index .

  • Data analysis module analysis and restore pod Each stage of the life cycle takes time , And judgment pod Reasons for failure and node unavailability .

  • Last , from Report The module exposes the interface and UI, Show to end users pod Life cycle and the cause of the error .

The trace system

Next , With a pod Create a failure case as an example , Let's show you tracing The workflow of the system .

6.png

User input pod uid after ,tracing system adopt pod Indexes , Find the pod Corresponding to the life cycle analysis record 、 Whether the delivery is successful or not determines the result . Of course ,storage The stored data not only provides basic data for end users , What's more important is that through the pods Life cycle , Analyze the operation status of the cluster and each node in the cycle . For example, there are too many in the cluster pods Scheduling to hot nodes , Different pods The delivery of the node causes resource competition on the node , The node load is too high , And delivery capacity is declining , In the end, it shows up on nodes pods Delivery timeout .

Another example , Through historical statistics , Analysis shows that pods The life cycle of the baseline , Take the baseline as the evaluation standard , Compare the average time taken by different versions of a component 、 Time distribution , Give suggestions for component improvement . in addition , Through holistic pods The proportion of the time spent by each component in the life cycle , Take more steps to find out , For subsequent optimization pod Delivery time provides data support .

Node Metrics

7.png

A healthy cluster , Not only need master Components remain highly available , Node stability should not be ignored .

If you put pod The creation analogy is rpc call , Then each node is a rpc Service providers , The total capacity of the cluster is equal to what each node can handle pod Create the sum of requests . Every additional node that is not available , Both represent a decline in cluster delivery capability , It also represents a decrease in the available resources of the cluster , This requires the high availability of nodes in the cluster as much as possible ; every time pod deliver / Delete / Upgrade failed , It also means that the cost of users is rising , Experience decline , This requires that cluster nodes only ensure good health , Scheduling to nodes pods To deliver successfully .

let me put it another way , Not only should node anomalies be detected as soon as possible , Repair nodes as soon as possible . By analyzing the components in pod Deliver the functionality on the link , We've added... For various types of components metrics, And will be host The running state changes to metrics, After collecting them into the database , Combined with each node pod Delivery results , Models can be built to predict node availability , Analyze whether the node has unrecoverable exception , Adjust the proportion of nodes in the scheduler properly , Thus enhance pod Delivery success rate .

Pod establish / Upgrade failed , Users can try again to solve the problem , but pod Delete failed , Although there is K8s The ultimate state oriented concept , The component will try again and again , But there will be dirty data after all , Such as pod stay etcd Delete... On , But dirty data remains on the node . We design and implement a patrol system , By inquiring apiserver Get the schedule to the current node pods, by force of contrast , Find the remaining processes on the node / Containers /volumes Catalog /cgroup / Network equipment, etc , Try to release residual resources by other means .

Unhealthy node

Next, describe the processing flow of the fault machine .

8.png

There are many data sources for fault diagnosis , There are mainly monitoring indicators of nodes , such as :

  • Some kind of Volume Failed to mount

  • NPD(Node Problem Detector), It's a framework for the community

  • Trace System , For example, on a node Pod Failed to create persistent report image download

  • SLO, For example, there is a large amount of residue on a single machine Pod

We've developed a number of Controller Inspect these kinds of faults , Form a list of faulty machines . A breakdown machine can have several faults . For faulty machines , Different operations will be carried out according to the fault . The main operations are : hit Taint, prevent Pod Dispatch up ; Reduce Node The priority of the ; Direct automatic processing for recovery . For some special reasons , For example, the disk is full , That requires manual intervention .

Every day a daily report will be produced by the faulty machine system , To show what the faulty machine system has done today . Developers can continuously add Controller And processing rules to improve the whole fault processing system .

Tips on increasing SLO

Next , Let's share the release to high SLO Some of the ways .

9.png

  • The first point , In the process of improving the success rate , The biggest problem we're facing is downloading . Need to know ,Pod It must be delivered within the specified time , And image downloads usually take a lot of time . So , We calculate the image download time , There's also a dedicated ImagePullCostTime Error of , That is, the image download time is too long , Lead to Pod Unable to deliver on time .

not so bad , Alibaba image distribution platform Dragonfly Support Image lazyload technology , That is to support remote mirroring , stay Kubelet When creating a container , No need to download the image . therefore , This greatly accelerates Pod The speed of delivery . of Image lazyload technology , You can take a look at Ali Dragonfly The share of .

  • Second point , For the promotion of single Pod The success rate , As the success rate increases , It's getting harder and harder . You can introduce some workload retry . In ants ,paas The platform will try again and again , until Pod Successful delivery or overtime . Of course , When you try again , Previous failed nodes need to be excluded .

  • The third point , pivotal Daemonset Be sure to check , If the key Daemonset defect , But the Pod Dispatch up , It's very easy to have problems , So it affects the creation of / Delete link . This requires access to the failed machine system .

  • Fourth, , quite a lot Plugin, Such as CSI Plugin, It's necessary to talk to Kubelet Registered . It may be that everything is OK on the node , But to Kubelet Failed to register , This node also cannot provide Pod Delivered services , Need to access the faulty machine system .

  • And finally , Because the number of users in the cluster is very large , So isolation is very important . On the basis of privilege isolation , Still need to be done QPS Isolation , And capacity isolation , Prevent a user from Pod Exhausting the cluster capacity , In order to protect the interests of other users .

Alibaba cloud's first show Serverless The developer offline salon appears in Beijing

This offline activity will invite people from aliyun 、 TaoBao 、 Idle fish 、 Baifu travel, etc Serverless technician , Bring... To developers :

  • TaoBao / Tmall has to deal with double 11 How to carry out large-scale flood peak Serverless.
  • Hit the developer pain point , Talk about the idle fish 、 Baifu travel and other Chinese enterprises Serverless Landing and “ Step on the pit ” Experience .
  • The latest open source tool chain of Alibaba cloud is disclosed for the first time Serverless Devs Design details and future directions .

Click the link to register now :https://www.huodongxing.com/event/9570184556300

Alibaba cloud native Focus on microservices 、Serverless、 Containers 、Service Mesh And other technical fields 、 The trend of primary popular technology of focus cloud 、 Large scale practice of cloud original , Official account of cloud developers .”

版权声明
本文为[Alibaba cloud native]所创,转载请带上原文链接,感谢