当前位置:网站首页>Discussion on the origin of AI cloud: tal AI practice in Taiwan

Discussion on the origin of AI cloud: tal AI practice in Taiwan

2020-12-07 19:22:18 Aliyun yunqi

brief introduction : 2020 At the annual cloud habitat Conference , Good future AI Liu Dongdong, the person in charge of Zhongtai , Shared his right to AI Cloud's original understanding and tal's AI China Taiwan practice , This article is about the content of the speech .

AI Coming of age , To the bottom of the enterprise IT The richness and agility of resources present greater challenges , Use alicloud to stabilize 、 Elastic GPU Cloud server , Leading GPU Containerized sharing and isolation technology , as well as K8S Cluster management platform , Tal achieves flexible resource scheduling through cloud native architecture , For its AI Zhongtai has laid a solid foundation for Agile Technology .

stay 2020 At the annual cloud habitat Conference , Good future AI Liu Dongdong, the person in charge of Zhongtai , Shared his right to AI Cloud's original understanding and tal's AI China Taiwan practice , This article is about the content of the speech .

Hello everyone , I am tal future AI Liu Dongdong, technical director of Zhongtai . The theme of my speech today is 《 Good future AI A brief introduction to cloud origin 》. My sharing is mainly divided into four parts :

First of all ,AI The challenge of services to cloud Nativity .
second ,AI And cloud native service deployment .
Third ,AI And cloud native service governance .
Finally, I would like to talk about , K8S And Spring Cloud The organic combination of .

1、AI The challenge of services to cloud Nativity

First , Let's talk about AI The challenge of services to cloud Nativity . In the age of cloud Nativity ,AI One of the biggest features of service is , Need more computing power to support , And a more powerful service stability .

 

image.png

Our service is not just a single service , Now it's switched to a cluster service . At the same time, the stability of performance is required , Has gone from 3 individual 9, Began to 5 individual 9 The challenge .

So these questions , It is no longer what the original traditional technology architecture can solve . So we need a new technology architecture .

What is the new technology architecture ? It's cloud native .

Let's take a look , The changes that cloud Nativity has brought to us . The biggest change brought about by the original cloud , I summarize it in four points and two aspects .

The four main points are ,DevOps、 Continuous delivery 、 Microservices 、 Four characteristics of the container . Two aspects are service deployment and service governance . Of course , It has 12 A systematic summary of the elements .

 

image.png

Today's focus is on service deployment and service governance .

Under the wave of cloud primordial , How do we deal with service deployment and service governance ?

First we pass AI And cloud native service deployment , That is, through K8S, Plus a virtualization of resources , Resource pooling and other technologies , It's solved AI The increasing demand of various hardware resources for services .

the second ,AI Service and cloud native service governance are organically combined . Through the technology of service governance , Including service discovery 、HPA、 Load balancing, etc. , solve AI The service is right 5 individual 9 Of SLA The needs of .

 

image.png

2、AI Cloud native deployment of services

The first point is to talk about how to put AI Combined with cloud native service deployment .

Let's look at it first , stay AI What are the characteristics of service deployment in the era ?

The first is the contradiction between hardware resource demand and cost growth .AI The demand for hardware for services has grown by several orders of magnitude , But the hardware budget has not increased by an order of magnitude .

second ,AI The hardware requirements for services are diverse . Such as , High on GPU The needs of 、 high CPU The needs of 、 High memory requirements , There's even a partial mix of needs .

Third ,AI There is a need for services to isolate resources . every last AI Services can use these resources independently , And they don't disturb each other .

Fourth ,AI Services can have requirements for resource pooling .AI Services don't need to be aware of the specific configuration of the machine , Once all the resources are pooled , It can reduce resource fragmentation , Improve usage .

And finally ,AI Services have requests for unexpected resources . Because traffic is unpredictable , Companies need to keep , The ability to expand the resource pool at any time .

 

image.png

What is our solution ?

First , We use Docker Virtualization technology , Achieve resource isolation .

And then use GPU Sharing Technology , take GPU、 Memory 、CPU And pool resources , Then the whole resource is managed in a unified way .

Last , Use K8S Of resources, Including stains (taints)、 Tolerance (tolerations) And so on , Realize flexible configuration of services .

in addition , It is suggested that you should buy some high configuration machines , These high configuration machines , Mainly to further reduce debris .

Of course , We also need to monitor the whole cluster hardware , make the best of ECS It can schedule various complex time rules ( Below cron It is a time-based job scheduling task ), Coping with peak traffic .

 

image.png

Next , Let's take a closer look at tal AI How does central Taiwan solve these problems AI Deployment problems .

This page is one of our Node Service management of , Through this business , We can clearly see the deployment of each server , Including resource usage 、 Deploy what pod、 Which nodes and so on .

 

image.png

The second is actually AI Zhongtai's service deployment page . We are able to press the code file , Control every... With precision pod Of memory 、CPU、GPU Use . meanwhile , Through technology like stains , Let the diversified deployment of servers be satisfied .

 

image.png

According to our comparative experiment , Deployment in cloud native mode is compared with user deployment , The cost has probably been saved 65%. and , Such advantages will follow AI The growth of clusters , In terms of economic benefits and temporary traffic expansion , Will benefit more .

3、AI And cloud native service governance

Let's talk about AI And cloud native service governance .

What is micro service ? In fact, microservices , It's just an architectural style of services , It's actually a single service , As a whole set of small service development , Then each application has its own process to run , And through some lightweight , for instance HTTP、API Wait for communication .

 

image.png

These services , It's actually built around the business itself , Can be centralized management through deployment . meanwhile , Write in different languages , Using different storage resources .

In summary, what are the characteristics of microservices ?

First of all , Microservices, it's small enough , Even it can only do one thing .
second , Microservices are stateless .
Third , Microservices are independent of each other , And they're interface oriented .
Last , Microservices are highly autonomous , Everyone is only responsible for himself .

 

 

After seeing the characteristics of these microservices , Think about it again ,AI Service and micro service features , We found that ,AI Services are inherently suitable for microservices . Every microservice , In fact, there is only one thing to do in essence . such as OCR,OCR service , Only do OCR service ;ASR, Mainly do ASR service .

then , every last AI Requests for services are independent . A simple example , One OCR Request and another OCR request , There's no connection in essence .

AI Services are inherently demanding for horizontal expansion . Why? ? because AI The demand for service team resources is very great . therefore , This expansion is very necessary .

AI The dependency between services is also very small . Like our OCR service , Is likely to NLP Service for , Or for others AI service , There may not be any big demands .

be-all AI service , Can be written declaratively HTTP, even to the extent that API The way , Provide AI Ability .
Take a closer look at AI service , Will find , Not all of them AI Services are microserviced . therefore , What we did ?

First of all , Need to put AI Service is made into a stateless service , These stateless Services , It's all animal like 、 No state 、 Disposable , And do not use any disk or memory request mode , To do some storage functions . This allows services to be deployed on any node , Anywhere .

Of course , Not all services can be stateless . What if it's in shape ? We will go through the configuration center 、 Log center 、Redis、MQ, also SQL Such as the database , Store these request States . meanwhile , Ensure high reliability of these components .

 

image.png

This is tal future AI Zhongtai PaaS The overall architecture diagram . First, let's take a look at the outermost layer, which is the service interface layer . The outermost interface layer is provided externally AI The ability of .

The most important layer in the platform layer is the service gateway , Mainly responsible for some dynamic routing 、 flow control 、 Load balancing 、 Authentication etc. . Further down there are some of our service discoveries , Registry Center , Fault tolerance 、 Configuration Management 、 Elastic expansion and other functions .

Below that is the business layer , These business layers are what we call , some AI Reasoning services for .

At the bottom is what Alibaba cloud provides us K8S colony .

In other words, the whole architecture is ,K8S Responsible for service deployment ,SpringCloud Responsible for service governance .

 

image.png

How can we implement the overall architecture diagram just mentioned by technical means ?

The first is through Eureka As a registry , Realize the service discovery and registration of distributed system . Through the configuration center Apoll To manage the configuration properties of the server , And support dynamic update . gateway Gateway, It can isolate the inner and outer layers . Fuse Hystrix, It is mainly divided into time-sharing fusing and quantity fusing , And then protect our services from being blocked .

Load balancing plus Fegin operation , It can realize the load balance of the whole traffic , And put our Eureka Relevant registration information for consumption . Consumer bus Kafka It's a component of asynchronous processing . And then authentication is through Outh2+RBAC The way to do it , It realizes the user login, including the authentication management of the interface , Ensure safety and reliability .

Link tracking , It's using Skywalking, Through this kind of APM An architecture of , We can track every request , Easy to locate and alert every request .

Finally, the log system is through Filebeat+ES, Distributed collection of logs for the entire cluster .

 

image.png

At the same time, we have developed some of our own services , For example, Deployment Services 、Contral service . I'm mainly responsible for dealing with K8S communicate , Collect the whole K8S Service deployment of services in the cluster 、K8S Related hardware information .

And then the alarm system goes through Prometheus+Monitor To do , Can collect hardware data , Responsible for resources 、 Business and other related alarms .

Data services are mainly used for downloading , Including data backflow , Then intercept the data in our reasoning scenario .

Current limiting service is to limit each customer's request and QPS Related functions .

HPA It's actually the most important part .HPA It's not just memory level support , or CPU Grade HPA, There's also some support for P99、QPS、GPU And so on .

Finally, statistical services , It is mainly used for statistics of relevant adjustment amount , Such as requests, etc .

 

image.png

We go through a unified console , Yes AI Developers offer a one-stop solution , Through a platform to solve all the problems of service governance , It improves the automation of operation and maintenance , Let's have one that needs several people to maintain AI Service situation , Become a person can do maintenance more than a dozen AI service .

This page shows the service routing 、 Load balancing 、 Current limiting related configuration page .

 

image.png

This page shows us some alarms at the interface level , And deployment level hardware alerts .

 

image.png

This is log retrieval , Including real-time log related functions .

 

image.png

This is the manual scaling and auto scaling operation page . Automatic scaling includes CPU、 Memory level HPA, It also includes the development of HPA、 Timed HPA.

 

 

4、K8S And Spring Cloud The organic combination of

Finally, let's talk about K8S And SpringCloud The organic combination of .

 

image.png

Take a look at these two pictures . The picture on the left shows us SpringCloud Data center to routing diagram . On the right is K8S Of service To its pod Graph .

These two graphs are very close in structure . How do we do it ? It's actually putting our Application And K8S Of service Binding , That is to say, eventually register with us SpringCloud Inside LB The address of , It's actually turned into K8S service The address of . In this way, we can K8S And SpringCloud Combine . This is a set of routing levels . With this collection , You can achieve the ultimate effect

 

image.png

SprigCloud It's a Java Technical language station of . and AI The language of service is diverse , Yes C++、Java, There are even PHP.

In order to achieve cross language , We introduced sidecar technology , take AI Services and sidecar adopt RPC To communicate , You can block the features of the language .

Sidecar The main functions are , Application service discovery and registration 、 Route tracking 、 Link tracking , And health checks .

This is the end of my speech today , Thank you very much for listening . Thank you. .

Link to the original text
This article is the original content of Alibaba cloud , No reprint without permission .

版权声明
本文为[Aliyun yunqi]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/11/20201121035940612i.html