Development and evolution of enterprise internal storage architecture , It's a long-term process , IT It is possible for planners to use artificial intelligence （AI） It is regarded as a transformation project that needs to be invested in the next few years . However ,AI Big waves come faster than expected , More and more industries will use AI Drive business change . On the other hand ,AI The workload is different from any previously processed IT load .AI Workload has new features , It faces a large number of unstructured data sets , Requires extremely high random access performance , Very low latency and large storage capacity .
AI Not only will it create new industries , It will also fundamentally change the way the existing organization conducts its business .IT Planners need to immediately start focusing on whether their storage infrastructure is ready for the upcoming AI The tide is ready .
AI What are the requirements for storage ？
In answer to what is now facing AI Storage solutions for , We need to know , What are the characteristics of data under artificial intelligence , Based on these data , What kind of storage is needed ？ Through layer by layer analysis , Finally filter out AI Business's comprehensive demand for storage .
Massive unstructured data storage
AI In the business, except that individual business scenarios are mainly analyzed for structured data （ For example, consumption records 、 Risk control such as transaction records 、 Trend prediction scenario ）, Most scenarios need to deal with unstructured data , For example, image recognition 、 speech recognition 、 Autopilot, etc , These scenarios usually use deep learning algorithms , Must rely on massive images 、 voice 、 Video input .
Data sharing access
Multiple AI Computing nodes need to share access data . because AI The architecture needs to use large-scale computing clusters （GPU The server ）, The data accessed by the servers in the cluster comes from a unified data source , That is, a shared storage space . This shared access to data has many benefits , It can ensure the consistency of accessing data on different servers , Reduce data redundancy caused by retaining data on different servers .
So which interface can provide shared access ？
Block storage , Need to rely on the application of the upper layer （ for example Oracle RAC） Achieving synergy 、 lock 、 Session switching and other mechanisms , In order to share block storage devices among multiple nodes , Therefore, it is not suitable for direct use in AI application .
can There are usually object storage and file storage to realize shared access , From the interface level of data access , It seems that data sharing can be realized . but Which interface is more convenient , We need to take a closer look at AI How the upper application framework uses storage . We use AI Very popular in Ecology PyTorch For example ,PyTorch When loading picture data , The following program is usually called ：
from torchvision import datasets, transforms
dataset = datasets.ImageFolder('path/to/data', transform=transforms)
that torchvision Of datasets.ImageFolder How to load pictures ？ Let's see ImageFolder Constructor for , There will be a default default_loader：
default default_loader What kind of behavior would it be ？ Let's see , Usually ,default_loader Would call pil_loader Method ：
that pil_loader How to read data ？ The answer is coming ：
This is the most typical Python Direct access to file system files open Method , So obviously ,PyTorch It will access data through the file interface by default . if necessary Call... Through other storage interfaces ImageFolder, You also need to write specific for it loader, This adds additional unnecessary development effort .
therefore , from AI From the perspective of application framework , File interface is the most friendly way to access storage .
Read more and write less
AI Data is characterized by more reading and less writing , requirement High throughput 、 Low delay . In the process of deep learning and training , Need to train the data , Take visual recognition as an example , It needs to load tens of millions , Even hundreds of millions of pictures , Convolution neural network is used for images 、ResNet And so on , Generate the identified model . After a round of training , In order to reduce the influence of the correlation of picture input order on the training results , Will disrupt the order of the documents , Reload , Train multiple rounds （ Each round is called epoch）. That means each of them epoch All need to load tens of millions of... According to the new order 、 Hundreds of millions of pictures . Picture reading speed , Delay , It will have a great impact on the length of time to complete the training process .
Mentioned earlier , Both object storage and file storage can be GPU Clusters provide shared data access , So which storage interface can provide lower latency ？ Industry leading international standard high-performance object storage , The read delay is about 9ms, High performance file system latency is usually 2-3ms, Considering the hundreds of millions of pictures n Secondary load , This gap will be magnified to a serious impact AI Training efficiency .
From the perspective of file loading , High performance file system has a delay characteristic , Has become a AI The first choice of .
IO Pattern complex
A large file 、 Small files , Sequential reading 、 Random read mixed scene . The data corresponding to different business types have different characteristics , For example, visual recognition , What we usually deal with is 100KB The following little file ; speech recognition , majority 1MB The big documents above , For these separate documents , Sequential reading is adopted . And some Algorithm Engineers , Hundreds of thousands 、 Even thousands of small files are aggregated into hundreds GB, even to the extent that TB A large file of level , At every epoch in , A sequence randomly generated according to the framework , Random reading of these large files .
Unable to predict file size 、IO In the context of type , For complexity IO High performance support for features , It's also AI Business needs for storage .
AI Business containerization
AI The application business is gradually moving towards Kubernetes Container platform migration , Data access is natural Must let AI The business container platform is the most convenient to use . It's very easy to understand this , In the era of single machine operation , The data is placed on a disk directly to the server , be called DAS Pattern . It's time for business to run in a cluster composed of multiple physical machines , For unified management and convenient use of data , Data stored in SAN On array . To the cloud age , The data is then put on the cloud , Put it into distributed storage suitable for cloud access 、 Object storage . thus it can be seen , Data always needs to be stored and managed in the most convenient way through business access . So it's the container age 、 The age of cloud Nativity , Data should naturally be placed on the most convenient storage for cloud native applications to access and manage .
The operating platform is developing to the public cloud
The public cloud becomes AI The preferred or preferred operation platform for business , The native storage scheme of public cloud is more oriented to general-purpose applications , in the light of AI High throughput of business 、 Low delay 、 Large capacity demand , There are certain deficiencies .AI Most businesses have a certain tide , The elasticity and pay as you go nature of the public cloud , Coupled with the high performance of the public cloud GPU Maturity and use of server products , Make the computing resources of the public cloud AI The first choice for business cost reduction and efficiency increase . And with the AI Business matching , A public cloud storage scheme with the characteristics described above , But still missing . In recent years , We see some foreign storage manufacturers （ for example NetApp、Qumulo、ElastiFile etc. ）, Release and run its products on the public cloud , It is the confirmation and interpretation that the native storage products and solutions of the public cloud are missing from the user's specific business application demands . Again , fit AI The implementation of the application's storage scheme on the public cloud , Is the solution AI In the last kilometer of further landing of the public cloud .
What do you have AI Storage plan , Can meet the above AI The need for large-scale applications ？
DAS The way
Data is stored directly in GPU Server's SSD, namely DAS The way . This method can ensure the high bandwidth of data reading 、 Low delay , However, in comparison , The disadvantages are more obvious , That is, the data capacity is very limited , meanwhile ,SSD or NVMe The performance of the disk cannot be brought into full play （ Usually , High performance NVMe Insufficient performance utilization of 50%）, Between different servers SSD Form an island , Data redundancy is very serious . therefore , This way in real AI In business practice , Rarely used .
Shared upward expansion （Scale-Up） Storage arrays are the most common of the shared solutions available , It may also be the most familiar scheme . And DAS equally , Shared storage arrays have similar disadvantages , Compared to traditional workloads ,AI Your workload will actually expose these shortcomings faster . The most obvious is how much total data the system can store ？ Most traditional array systems can only grow to 1 PB The storage , And because most AI Large workloads will require tens of PB The amount of storage , Therefore, enterprises can only continuously purchase new storage arrays , This leads to data islands . Even if the capacity challenge is overcome , Traditional array storage can also cause performance problems . These systems usually support only a limited number of storage controllers , The most common are two controllers , And the typical AI Workloads are highly parallel , It can easily overburden small controllers .
Ordinary distributed file system
What users usually use is GlusterFS、CephFS、Lustre, The primary problem of open source distributed file system is the complexity of management and operation . secondly ,GlusterFS、CephFS For massive small files , And large-scale 、 Performance in the context of large capacity is difficult to guarantee . Considering the high cost GPU Price , If you can't give enough support in data access ,GPU The input-output ratio will be greatly reduced , This is a AI The last thing application managers want to see .
Build a file access interface gateway on the object storage . First, object storage has a natural disadvantage for random or additional writes , It can lead to AI When a write operation occurs in the business , Not very supportive . secondly , The disadvantage of object storage on read latency , After passing through the file access interface gateway , Magnified again . Although through pre reading or caching , Part of the data can be loaded into the front-end SSD On the device , But this will bring the following problems ：1） Cause the upper layer AI The framework needs to be adapted to the underlying special architecture , Intrusive to the framework , For example, execute a read ahead program ;2） It will lead to uneven data loading speed , During data loading , Or the front end SSD When cache misses ,GPU Utilization is down 50%-70% .
These solutions , Only from the scalability of data scale 、 Access performance 、AI From the analysis of the universality of the platform , Are not ideal for AI The storage plan of .
YRCloudFile—— oriented AI Storage products for scenarios
YRCloudFile It has several characteristics that fit very well AI Comprehensive requirements of application .
First , This is a paragraph Distributed file storage with shared access , Available GPU Cluster shared access . It provides a file access interface , Best for docking AI The upper platform of .
Support High performance access to massive unstructured data . adopt YRCloudFile client , The upper GPU The server can achieve concurrent access to different nodes in the storage cluster , adopt IO500 test , as well as AI Industry leader enterprise validation , The performance is at the industry-class level . In the scene of massive files , It can maintain the continuous and stable output of performance .YRCloudFile A lot of optimization in the design and implementation of metadata and data services , Ensured AI Business complex IO Type performance requirements for data access .
adopt Kubernetes platform , Can be seamlessly scheduled and used YRCloudFile Storage capacity provided .YRCloudFile In addition to providing standard CSI Outside the interface , It also provides RWX Reading and writing 、PV The quota 、PVC resize、PVC QoS And other enterprise functions , Can be strongly supported in Kubernetes Running on AI Business needs for data access .
Support public cloud deployment .YRCloudFile At present, it can be used in AWS、 Alibaba cloud 、 Rapid deployment on Tencent cloud , Makes up for the public cloud's impact on AI The performance required for a particular scenario 、 Extensibility 、 Special requirements for operation and maintenance .
Through analysis , We hope to give AI Business planners provide information about AI Business observation and insight into the actual needs of storage , Help customers in AI Business landing , Provide AI Optimization of storage products .AI It will become after the information industrial revolution , Change the technology and direction of the world again ,AI The tide has come to us inadvertently , It's time to consider facing AI New storage .