当前位置:网站首页>Kwai K266Dec

Kwai K266Dec

2021-12-21 15:16:54 Dillon2015

brief introduction

K266Dec It's Kwai Tai Company. ARM Platform design VVC Software decoder , stay JVET-V0070 The proposal There's an introduction to . This decoder is based on VVC Standard de novo design , Support VVC main profile All coding tools .

about Android platform , stay Huawei P40 On , Decoding in single thread 2K 8bit CTC The bitstream speed can reach 33fps, yes VTM11.0 Decoder 4.11 times .

about IOS platform , stay iPhone 12 Pro Max On , Decoding in single thread 2K 8bit CTC The bitstream speed can reach 94fps, yes VTM11.0 Decoder 4.78 times .

K266Dec Highly optimized Android platform , Because of the browsing speed of Kwai fu videos 80-90% From Android Phones .

In order to make VVC Adapt to more models , Must be kept low CPU Usage and power consumption .K266Dec Focus on ARM The platform has cache friendly pipelining and high throughput SIMD Optimize . On the other hand ,K266Dec Focus on strengthening the ability of single thread decoding , This will have a serious impact on CPU Real time utilization .

For single thread decoding scenarios ,K266Dec It mainly optimizes the following contents :

  • ARM/NEON assembly :K266Dec make the best of ARM and NEON Instruction set , Optimized computing intensive modules , for example CABAC decode 、 In frame / Inter prediction 、 Reverse transformation 、 deblocking filter 、SAO etc. . In order to make full use of single core CPU Assembly line , Instruction level pipelining is carefully designed for almost every assembly function .

  • Cache friendly pipelining and low memory access operation :K266Dec Multiple local information sharing buffers between blocks are used to avoid accessing information from frame level memory cache Miss rate .K266Dec Through fine memory allocation and local operation, the memory access operation of decoding is minimized .

For scenes with higher decoding complexity , for example 4K Decoding of high resolution video ,K266Dec Make full use of multicore CPU Resources for , stay Huawei P40 On , about class A Sequence ,4 Thread decoding is better than single thread decoding 3.45 Times the acceleration ratio . The optimization of multithreaded decoding is as follows :

  • Fine grained parallelism : In addition to the traditional frame level and wavefront parallelism ,K266Dec Fine grained parallelism is also used . One CTU The decoding process is divided into several sub stages : analysis , Schema generation , The reconstruction 、 Filtering, etc . such ,K266Dec It can greatly reduce data dependence , Increase throughput .

experimental result

The code streams used in the experiment can be divided into two categories :JVET CTC Sequence sum Kwai App Of UGC Sequence .Kwai Of UGC The sequences are 720x1280 The resolution of the , All from Kwai App Popular videos uploaded by users on , Views from 800k To 2M.

JVET CTC The code stream used VTM11.0 code , The configuration is as follows :

  • The depth is 8bit

  • Ban ALF/CCALF

  • Other configurations use RA The configuration file

Kwai UGC The following configuration is used for code stream generation :

  • The depth is 8bit

  • Rate control with constant subjective quality , The average bit rate is lower than 2Mbps

  • Ban GEO / BDOF / SbTMVP / DMVR / LMCS / CIIP / MIP / JCCR / SBT / LFNST / MTS

  • Ban ALF/CCALF

  • Other configurations use RA The configuration file

Android platform speed test

Use the following three Android devices to test K266Dec Decoding speed ,

High-end :Huawei P40

  • Chipset: Kirin 990 5G (7 nm)

  • CPU: Octa-core (2x2.86 GHz Cortex-A76 & 2x2.36 GHz Cortex-A76 & 4x1.95 GHz Cortex-A55)

  • Memory: 8GB RAM

Middle end :Oppo R17

  • Chipset: Qualcomm Snapdragon 439 (10 nm)

  • CPU: Octa-core (4x1.95 GHz Cortex-A53 & 4x1.45 GHz Cortex A53)

  • Memory:4GB RAM

Low-end :Vivo Y93s

  • Chipset: MediaTek MT6762 Helio P22 (12 nm)

  • CPU: Octa-core 2.0 GHz Cortex-A53

  • Memory:4GB RAM

The following table 1~3 yes K266Dec Compared with VTM11.0 Average decoding speed . Compared with VTM11.0,K266Dec stay Huawei P40 Single thread decoding can achieve 3.89 Times the acceleration ratio .

surface 4 Show K266Dec stay Huawei P40, Oppo R17 and Vivo Y93 Upper use ARM Single core , about UGC The sequence decoding speed can reach 97fps,41fps and 19fps.

IOS Platform speed test

IOS Test equipment selection iPhone 12 Pro Max, The configuration is as follows :

  • Chipset: Apple A14 Bionic (5 nm)

  • CPU: Hexa-core (2x3.1 GHz Firestorm + 4x1.8 GHz Icestorm)

  • Memory: 6GB RAM

stay CTC On stream ,K266Dec The single thread decoding speedup ratio can reach 4.66 times .

Interested parties, please pay attention to WeChat official account Video Coding
 

版权声明
本文为[Dillon2015]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/12/20211201200513482x.html

随机推荐