当前位置:网站首页>Quiver: make your multi card GNN training faster

Quiver: make your multi card GNN training faster

2021-11-25 18:25:06 Houye

Preface

Quiver It's an open source GNN frame , It can not only improve the performance of single card training , At the same time, it can greatly improve the multi card scalability of training , Even in NVLink Super linear speedup on the machine , The cost of all this is just dozens of lines of source code modification ( Especially if you are a PyG user )!

import quiver

...

## Step 1: Replace PyG graph sampler
# train_loader = NeighborSampler(data.edge_index, ...) # Comment out PyG sampler
train_loader = torch.utils.data.DataLoader(train_idx) # Quiver: PyTorch Dataloader
quiver_sampler = quiver.pyg.GraphSageSampler(quiver.CSRTopo(data.edge_index), sizes=[25, 10]) # Quiver: Graph sampler

...

## Step 2: Replace PyG feature collectors
# feature = data.x.to(device) # Comment out PyG feature collector
quiver_feature = quiver.Feature(rank=0, device_list=[0]).from_cpu_tensor(data.x) # Quiver: Feature collector

...
  
## Step 3: Train PyG models with Quiver
# for batch_size, n_id, adjs in train_loader: # Comment out PyG training loop
for seeds in train_loader:
  n_id, batch_size, adjs = quiver_sampler.sample(seeds)  # Use Quiver graph sampler
  batch_feature = quiver_feature[n_id]  # Use Quiver feature collector

https://pic3.zhimg.com/80/v2-7bde56cd249e756d8b34281027207c82_1440w.jpg

Quiver Of github Link below , welcome GNNSys Research peers star、fork、pr Three even !

https://github.com/quiver-team/torch-quiver

In this article we introduce Quiver Part of the design scheme and the motivation behind the implementation , The complete scheme will be released in the subsequent papers .

One 、 Overall introduction

All teams studying graph machine learning systems know that the bottleneck of graph model training performance based on sampling is graph sampling and feature aggregation , But what is the essence behind the two bottlenecks ?Quiver The core idea is :

  • Graph sampling is a latency critical problem, The core of high-performance sampling is to mask the access delay through a large number of parallelism .
  • Feature aggregation is a bandwidth critical problem, High performance is characterized by optimized aggregation bandwidth .

In general , We choose to use CPU Graph sampling and feature aggregation , This scheme will not only bring performance problems of single card training , At the same time, both sampling and feature aggregation are CPU Intensive operation , Doka trains because CPU The bottleneck of computing resources leads to poor training scalability . We use ogbn-product Data sets, for example benchmark PyG Use CPU The scalability results of multi card training during sampling and feature aggregation are as follows . See for specific test scripts Quiver Project links .

Now let's talk about Quiver Accelerated work on graph sampling and feature aggregation .

Two 、Quiver Performance optimization

2.1 Sampling scheme

In order to solve CPU Problems caused by sampling ,DGL and PyG At the same time, it also supports the implementation of some sampling algorithms GPU Realization .GPU In terms of performance, sampling must be much higher than CPU Sampled by , But a serious problem is : The topological size of the graph is limited by the memory space . For example, what we evaluated above paper100M, Pure topology data has about 13G, and MAG240M The pure topology of the dataset has about 25G, There are many business maps in the enterprise, such as MAG240M Even bigger , This is for most of the market GPU It's hard to put it completely into video memory , Not to mention itself, we need some video memory for model training .

Quiver Provide users with UVA-Based(Unified Virtual Addressing Based) Graph sampling operator , Both Support users to put data in GPU Sampling in , It also supports the option of storing the graph in when the graph topology data is large CPU Simultaneous use in memory GPU sampling . In this way, we not only get much higher than CPU Performance benefits of sampling , At the same time, the size of the graph can be processed from GPU The video memory size limit has been extended to CPU Memory size ( Generally far greater than GPU memory ).

We are ogbn-products and reddit Sampling performance tests on two data sets on show ,UVA-Based The sampling performance is much higher than CPU Sampling performance (CPU Sampling uses PyG The sampling is implemented as a baseline ), We measure the sampling performance by the number of sampling edges per unit time (Sampled Edges Per Second, SEPS). We can see In the same way, there is no need to store graphs in GPU In the case of video memory ,Quiver The sampling of provides about... On real data sets 20 Double performance acceleration .( Specifically benchmark See the project link for the code )

When the user's GPU Enough to put down the topology data of the whole graph , You can set the sampling parameters mode=GPU Choose to place the graph in GPU For higher performance . From our benchmark Found in , Set up mode=GPU when , Comparison UVA Patterns can bring 30%-40% Performance improvement of .

Use UVA The benefits of sampling are not just the acceleration of sampling , It alleviates the problem of CPU The demand for computing resources has greatly improved the scalability of multi card , To avoid ambiguity , All of us below UVA Sample All refer to data stored in Host Memory in , Next, we focus on the optimization of feature aggregation throughput .

2.2 Graph feature aggregation (Feature Collection)

because GNN One for training batch There are often hundreds of characteristic data MB Even a few GB, These characteristic data at any time CPU Handling in memory , as well as CPU To GPU Data handling between is time-consuming . The core of graph feature aggregation optimization is to optimize end-to-end throughput , That is, from feature acquisition to transmission to GPU Process throughput . The existing schemes are generally divided into two types :

  • CPU Feature aggregation , And transfer the aggregation feature to GPU Training
  • When there is less characteristic data , Put all the features in GPU In memory

be based on CPU Our scheme faces the problem of throughput performance , meanwhile CPU Of Feature Collection As such CPU Intensive operation , Also faced with multi card training due to CPU Resource competition leads to poor multi card scalability of feature aggregation . And based on GPU The plan 2 The graph features that still face processing are limited GPU Memory size . And in reality , The size of the feature is much larger than the size of the topology of the graph . such as MAG240M The dataset topology is about 25G, The actual feature size is about 300G, It's hard to put down GPU In the middle .

Quiver Provides high throughput quiver.Feature Used for feature aggregation .quiver.Feature The implementation of is mainly based on the following two observations :

  • Real graphs often have power-law distribution , A few nodes occupy most of the connections in the whole graph , Most of the sampling algorithms based on graph topology , The last one Epoch The probability of each node being sampled is positively correlated with the number of connections of the node .
  • One AI Server The relationship between various access transmission bandwidth in is as follows GPU Global Memory > GPU P2P With NVLink > Pinned Memory > Pageable Memory. But few existing systems can make good use of this hierarchy Access hierarchy to speed up the entire data throughput .

Considering the above hierarchical relationship of access bandwidth and the uneven access of graph nodes ,Quiver Medium quiver.Feature According to the parameters configured by the user, the features are automatically divided and stored in GPU Video memory and CPU Pinned Memory in , And store the hotspot data in GPU, Cold data is stored in CPU in . We have achieved an efficient GPU Kernel To uniformly do data storage and access across devices and ensure one warp All in thread Ability to aggregate storage access , So for cross PCIe Of memory request And for GPU Global memeory Video memory access is more friendly , At the same time, due to the uneven access of graph nodes , Most of the thread All hit GPU Part of the data , Then we can maximize the use of video memory , The final access bandwidth calculation formula is :

We're going to review this plan benchmark Throughput of feature aggregation , We found that compared with CPU, Quiver Complete feature aggregation with 4-10 Double acceleration . Again , Use GPU Feature aggregation brings not only performance improvement , And yes CPU The demand for resources is alleviated to avoid the impact of multi card expansion on CPU Competition of resources .

But as mentioned above NVLink Where is your participation ? Don't worry. , We will introduce it below .

3、 ... and 、 DOCA training —— Without NVLink

Based on the above quiver.Feature and quiver.pyg.GraphSageSampler Then we can do GNN Model training , If you are a PyG user , Then you only need to make dozens of lines of changes to your source code to use Quiver To speed up your training . We are right. quiver.Feature and quiver.pyg.GraphSageSampler The built-in IPC Mechanism , You can easily pass them as parameters to child processes for use . In Doka, we put forward device_replicate programme , That is, the thermal data is in each GPU Copy storage on , Cold data is stored in Host Memory in .

We use ogbn-product As an example benchmark Experimental verification , We found that in use GPU Only the cache 20% In the case of data ,Quiver And use CPU For feature aggregation and sampling PyG End to end training performance ( In the experiment, the sampling parallelism in each training process is 5) Compared with 4 There's about... On the card 9 Times performance improvement , Even if it's PyG Put all the feature data into GPU On ,Quiver In contrast, it is still 4 There are... On the card 3 Double acceleration , more importantly , With GPU The increase in numbers ,Quiver and PyG The performance difference of itself is gradually widening .

We are 4 Run on the card paper100M The experimental results show that the expansion efficiency is about 85%( The expansion ratio is 3.4), So the loss 15% Where did it come from ? We think The core still comes from CPU Memory bus competition . We just thought about how we could put CPU Share the load on the bus , In this way, multi card scalability will be better , So we aimed at NVLink.

Four 、 DOCA training —— With NVLink

We know , When storing data in NVLink Upper time , If you go P2P visit , Its data transmission itself does not go CPU Bus this way , This will significantly reduce CPU Bus load , meanwhile GPU P2P With NVLink The throughput is much higher than PinMemory Aggregate throughput , This will further enhance our characteristics .

When we have NVLink when , The throughput hierarchy of feature aggregation is :GPU Local video memory > GPU P2P With NVLink > PinMemory, We hope that the hot data can hit more local video memory and other GPU memory . So we put forward p2p_clique_replicate Strategy , namely When there is NVLink when , One NVLink Inside the regiment ( That's the group GPU All of them can P2P visit ) all GPU Shared cache ,p2p_clique There is still a strategy of hot data replication .( requirement p2p_clique Symmetry between ).

Specific to one Clique Internal data access is as follows , We store thermal data in Clique On the machine inside , Still one Kernel Complete different GPU、Host Memory The interview of . At the same time, in order to ensure that all GPU Load balancing for data access within , We did a series of shuffle Preprocessing .

This strategy has brought us the following benefits :

  • More cache space , Originally we could only one GPU Cache on 20% The data of , Now we can Clique Internal shared cache , A total of 40% The data of , At the same time as NVLink Faster access , Make the overall data access super linear acceleration .
  • Greatly reduce CPU Bus load . The original device_replicate With the addition of multi cards in the strategy ,CPU The load of the bus increases linearly , But in p2p_clique_replicate Under the strategy of , With Clique The increase of ,CPU The bus load drops instead , Due to the uneven access of graph data , So that most of the feature aggregation is NVLink to Bypass 了 .

The end-to-end result of these two points is , We are in a country with NVLink Train on a two card machine ogbn-product The data set has reached a superlinear speedup .quiver Provides a very simple and easy to use API Design , Convenient users only need to modify cache_policy Parameter to switch between two different multi card caching strategies .

5、 ... and 、 summary

What a long long road! , I will go up and down . At present, we are only open source Quiver Some features of the stand-alone version of , More functions and training strategy optimization will be released in subsequent papers , At the same time, next time realease In, we will open source Quiver Distributed version of , Try to make the big picture GNN Training gets faster , More relaxed .Quiver from University of Edinburgh, Imperial College London, Tsinghua University and University of Waterloo The researchers participated in the study , And receive Alibaba and Lambda Labs The support of the two enterprises . Interested students are welcome to pay attention to our project !

This article is from WeChat official account. - Pictures and recommendations (GraphRec)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2021-11-02

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Houye]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/11/20211109101534282e.html

随机推荐