With the rise of the whole Internet , The form of data transmission is also constantly upgrading and changing , The general trends are as follows ：
Text messages in plain text ,QQ -> Space , Microblogging , The pictures and words of the circle of friends -> Wechat voice -> Major live broadcast software -> Tiktok Kwai short video
The development of audio and video is expanding to various industries , Distance teaching from education , Face recognition of traffic , Remote medical treatment and so on , The direction of audio and video has occupied a very important position , There are few articles on the real introduction of audio and video , A newly graduated Xiaobai may be difficult to understand , Because audio and video involve a lot of theoretical knowledge , And code writing needs to combine these theories , So understand audio and video , Coding and decoding and other theoretical knowledge is very important . I also began to contact audio and video projects from the internship , Read a lot of people's articles , Here is a summary of an easy to understand article , Let more students ready to learn audio and video get started faster .
The theoretical knowledge in this paper comes from the induction of various audio and video articles, and the summary of the basic principles of audio and video coding , There will also be some parts that I summarize and add . If there are mistakes, you can comment , It will be corrected after checking .
In order to prevent people from understanding too empty , It took the author three months to put the most commonly used , Some of the most important functions of theoretical knowledge and practical combat Demo Write it yourself , It is better to read with the article . Click on the link below to start studying each chapter , There are articles in the link Github Address , Every Demo All personal tests can pass , Can download Demo function .
If you like , Please help to like and support the reprint , Please attach the link to the original .
Whether it's iOS platform , Or the Android platform , We all need official help API Realize a series of related functions . First we need to understand what we want , At first we needed a cell phone , Camera is an indispensable part of smart phone , So we use some systems API To obtain, you can obtain the video data collected by the physical camera and the audio data collected by the microphone .
Audio and video raw data are essentially a large piece of data , The system wraps it into a custom structure , It is usually provided to us in the form of callback function , After getting the audio and video data , A series of special treatments can be made according to the needs of their respective projects , Such as : Video rotation , The zoom , Filter , Skin care , Cutting and other functions , Mono noise reduction of audio , Eliminate echo , Mute and so on .
The original data can be transferred after custom processing , A function like live broadcasting is to send the collected video data to the server , For all fans to watch on the web side , Because the transmission itself is based on the network environment , Huge raw data must be compressed before it can be taken away , It can be understood that when we move, we have to pack all our items into the suitcase .
The encoded audio and video data is usually expressed in RTMP Protocol for transmission , This is a special protocol for transmitting audio and video , Because various video data formats cannot be unified , So we need to have a standard as the transmission rule . The agreement plays such a role .
After the server receives the encoded data we sent in the past , It needs to be decoded into raw data , Because the encoded data is sent directly to the physical hardware device, it cannot be played directly , Only decoded as raw data can be used .
Audio video synchronization
Each decoded audio and video frame contains the time stamp set at the beginning of recording , We need to play them correctly according to the timestamp , However, some data may be lost in network transmission , Or delayed acquisition , At this time, we need some strategies to achieve audio and video synchronization , It can be roughly divided into several strategies : Certain video data cache , Video, audio, etc .
Push flow , Pull flow process
- Push flow : Transmit the video data collected by the mobile phone to the background player for display , The playback end can be windows, linux, web End , That is, the mobile phone acts as a collection function , The video collected by the mobile phone camera and the audio collected by the microphone are synthesized and encoded and transmitted to the player of the corresponding platform .
- Pull flow : Play the video data from the player on the mobile phone , The inverse process of pushing flow , the windows, linux, web The video data transmitted from the terminal is decoded and transmitted to the corresponding audio and video hardware , Finally, the video will be rendered and played on the mobile phone interface .
The push flow is as follows :
The pull flow is as follows :
Push flow , Pull flow is actually an inverse process , Here, we will start with collection .
Collection is the first step of streaming , It is the source of original audio and video data . The original data type collected is audio data PCM, Video data YUV,RGB....
1.1. Audio acquisition
Further study of
- Built in microphone
- An external device with microphone function ( The camera , Microphone ...)
- The system comes with an album
Main audio parameters
- Sampling rate (samplerate): The process of digitizing analog signals , The amount of data collected per second , The higher the sampling frequency , More data , The better the sound quality .
- Track number (channels): That is, mono or dual channel (iPhone Unable to acquire two channels directly , But you can simulate , That is, copy a copy of the collected mono data . Some Android models can )
- A wide : The size of each sampling point , The more digits , The finer the expression , The better the sound quality , It's usually 8bit or 16bit.
- data format : iOS The original data collected by the terminal equipment is linear PCM Type audio data
- other : You can also set the precision of the sample value , Each packet has several frames of data , How many bytes does each frame of data occupy, etc .
Audio is different from video , Every frame of the video is a picture , Audio is streaming , There is no clear concept of frame itself , In practice, for convenience , take 2.5ms~60ms The unit of data is one frame of audio .
Data volume （ byte / second ）=（ sampling frequency （Hz）* Number of sampling bits （bit）* Track number ）/ 8
The number of mono channels is 1, The number of channels in stereo is 2. byte B,1MB=1024KB = 1024*1024B
1.2. Video capture
Further study of
- Screen Recording
- External equipment with camera acquisition function ( The camera ,DJI Unmanned aerial vehicle (uav) ,OSMO...)
- The system comes with an album
Be careful : Like some external cameras , Such as using the camera of the camera to collect , Then use the mobile phone to encode the data processing and send , It's OK, too , But the flow of data needs us to analyze , From the camera HDMI Turn the cable into a network cable port , The network cable port turns again USB,USB Turn apple Lighting Interface , utilize FFmpeg You can get the data .
Main video parameters
- Image format : YUV, RGB ( That is, red, yellow and blue are mixed to form various colors ).
- The resolution of the : The maximum resolution supported by the current device screen
- Frame rate : Number of frames collected in one second
- other : white balance , focusing , Exposure , Flash... Wait
- Calculation (RGB)
1 Amount of frame data = The resolution of the (width * height) * Number of bytes per pixel ( It's usually 3 Bytes )
Note that the above calculation method is not unique , Because there are many kinds of video data formats , Such as YUV420 The calculation method is
The resolution of the (width * height) * 3/2
1.3. in summary
Let's assume that the video to be uploaded is 1080P 30fps( The resolution of the :1920*1080), The sound is 48kHz, Then the amount of data per second is as follows :
video = 1920 * 1080 * 30 * 3 = 186624000B = 186.624 MB audio = (48000 * 16 * 2) / 8 = 192000B = 0.192 MB
From this we can get , If the original collected data is directly transmitted , Then a movie needs 1000 many G In the video , If so, how terrible it would be , So it involves our later coding link .
Further study of ( To be added )
From the previous step , We can get the collected audio raw data and video raw data , In mobile terminal , Usually through their respective mobile phone platforms API Get in the , There are implementation methods in the previous Links .
after , We can process the raw data , The original operation can only be processed before coding , Because the encoded data can only be used for transmission . For example, can For image processing
- Skin care
- Eliminate echo
- Noise reduction
At present, there are many popular large-scale frameworks dedicated to video processing , Audio , Such as OpenGL, OpenAL, GPUImage... There are open source libraries on the Internet for the above processing , The basic principle is , We get the original audio and video frame data , Send it to the open source library , After processing, we can get the processed audio and video and continue our own process . Of course, many open source libraries still need to be slightly changed and encapsulated according to the needs of the project .
3.1. Why code
1. At the end of this step, we have talked about , The original video is produced every second 200 many MB, If you take the raw data directly and transfer it , Network bandwidth, or memory consumption, is huge , Therefore, video must be encoded in transmission .
Similar examples are like we usually move , If you move directly , Things are scattered , Take... A lot of runs , If you put your clothes , Packing things , We only need a few suitcases to do it all at once . When we get to our new home , Then take things out , Rearrange , The principle of codec is like this .
3.2. Lossy compression VS lossless compression
- Video uses the visual characteristics of the human eye , In exchange for data compression with certain objective distortion , For example, the threshold of brightness recognition by human eyes , Visual threshold , Different sensitivity to brightness and chroma , So that an appropriate amount of error can be introduced in coding , Will not be detected .
- Audio takes advantage of the fact that human beings are insensitive to some frequency components in images or sound waves , It allows the loss of information during compression ; The method of removing redundant components in sound is realized . Redundant components refer to signals in audio that cannot be detected by human ears , They are sensitive to the timbre of the sound , Information such as tones doesn't help . The reconstructed data is different from the original data , However, it does not affect people's misunderstanding of the information expressed in the original data .
Lossy compression is suitable when the reconstructed signal does not have to be exactly the same as the original signal .
- Spatial redundancy of video , Time redundancy , Structural redundancy , Entropy redundancy, etc , That is, there is a strong correlation between each pixel of the image , Eliminating these redundancies does not result in information loss
- The audio compression format uses the statistical redundancy of data for compression , The original data can be completely recovered without any distortion , But the compression ratio is limited by the theory of statistical redundancy , It's usually 2:1 To 5:1.
Because of the above compression method , The amount of video data can be greatly compressed , Conducive to transmission and storage .
3.3. Video coding
principle ： How does coding reduce a large amount of data ? The main principles are as follows
- Spatial redundancy : There is a strong correlation between adjacent pixels of the image
- Time redundancy : The content of adjacent images in a video sequence is similar
- Coding redundancy : Different pixel values have different probabilities
- Visual redundancy : The human visual system is insensitive to certain details
- Knowledge redundancy : The regular structure can be obtained from previous knowledge and background knowledge
Compression coding method
Transcoding ( Understanding can , Please Google for details )
The image signal described in spatial domain is transformed into frequency domain , Then the transformed coefficients are encoded . Generally speaking , Images have strong spatial correlation , The transform frequency domain can realize the removal of correlation and energy concentration . The commonly used orthogonal transform is discrete Fourier transform , Discrete cosine transform and so on .
Entropy coding ( Understanding can , Please Google for details )
Entropy coding is named because the average code length after coding is close to the source entropy . Entropy coding uses variable word length coding （VLC,Variable Length Coding） Realization . Its basic principle is to assign short codes to symbols with high probability in the source , For symbols with low probability of occurrence, long codes are assigned , Thus, a shorter average code length is obtained statistically . Variable word length coding usually has Huffman coding 、 Arithmetic coding 、 Run length coding, etc .
- Motion estimation and motion compensation ( important )
Motion estimation and motion compensation are effective methods to eliminate the temporal correlation of image sequences . Transform coding described above , Entropy coding is based on one frame of image , Through these methods, the spatial correlation of each pixel in the image can be eliminated . In fact, in addition to the spatial correlation of image signals , And time correlation . For example, for the background like news broadcast , Digital video in which the motion of the main body of the picture is small , The difference between each picture is very small , There is a great correlation between pictures . In this case, it is not necessary to encode each frame of image separately , Instead, only the changed parts in adjacent video frames can be encoded , So as to further reduce the amount of data , This work is realized by motion estimation and motion compensation .
a. Motion estimation technology
The current input image is divided into several small image sub blocks that do not overlap each other , For example, an image is 1280*720, Firstly, it is divided into 40*45 A dimension for 16*16 Image blocks that do not overlap each other , Then, find the most similar image block for each image block within a search window of the previous image or the latter image , This search process is called motion estimation .
b. Motion compensation
By calculating the position information between the most similar image block and the image block , You can get a motion vector . In this way, the block in the current image can be subtracted from the most similar image block pointed to by the motion vector of the reference image , Get a residual image block , Because each pixel value in each residual image block is very small , Therefore, higher compression ratio can be obtained in compression coding .
Compressed data type
Because of motion estimation and motion compensation , Therefore, the encoder divides each input frame image into three types according to the reference image ：I frame ,P frame ,B frame .
- I frame : Only the data in this frame is used for encoding , There is no need for motion estimation and motion compensation in the coding process .
- P frame : Use the previous... In the encoding process I The frame or P Motion compensation of frame as reference image , In fact, it encodes the difference between the current image and the reference image .
- B frame : Use the previous... In the encoding process I The frame or P The frame and the following I The frame or P Frame prediction . thus it can be seen , Every P Frame uses one frame image as the reference image . and B Frame requires two images as a reference .
Mixed coding is used in practical application ( Transcoding + motion estimation , Motion compensation + Entropy coding )
After decades of development , The function of encoder has been very powerful , A wide variety , Here are some of the most popular encoders .
Compared with the old standard , It can provide high quality video at lower bandwidth （ In other words , Only MPEG-2,H.263 or MPEG-4 The first 2 Half the bandwidth of the part or less ）, And it doesn't add too much design complexity to make it impossible to implement or the implementation cost is too high .
Efficient video coding （High Efficiency Video Coding, abbreviation HEVC） Is a video compression standard , Be regarded as ITU-T H.264/MPEG-4 AVC The standard successor .HEVC It's considered not only to improve video quality , At the same time, it can achieve H.264/MPEG-4 AVC Twice the compression ratio （ Equivalent to the same picture quality, the bit rate is reduced 50%）.
VP8 Is an open video compression format , The earliest by On2 Technologies Development , Then by Google Release .
VP9 From 2011 From the third quarter of the year , The goal is to be in the same picture quality , Than VP8 Coding is reduced 50% File size , Another goal is to surpass... In coding efficiency HEVC code .
3.4. Audio encoding
Digital audio compression coding ensures that the signal does not produce distortion in terms of hearing , Compress the audio data signal as much as possible . Digital audio compression coding is realized by removing redundant components in sound . Redundant components refer to signals in audio that cannot be detected by human ears , They are sensitive to the timbre of the sound , Information such as tones doesn't help .
Redundant signals include audio signals outside the auditory range of human ears and masked audio signal lights . for example , The sound frequency that the human ear can detect is 20Hz~20kHz, Other frequencies beyond this are imperceptible to the human ear , Are redundant signals . Besides , According to the physiological and psychological phenomena of human hearing . When a strong tone signal and a weak tone signal exist at the same time , The weak signal will be masked by the strong signal , In this way, the weak tone signal can be regarded as a redundant signal without transmission . This is the masking effect of human hearing .
Compression coding method
- Spectrum masking
After the sound energy of a frequency is less than a certain threshold , The human ear will not hear , This threshold is called the minimum audible threshold . When there's another sound with more energy , The threshold near the sound frequency will increase a lot , The so-called masking effect
The human ear is right 2KHz～5KHz Your voice is the most sensitive , And it is very slow to sound signals with too low or too high frequency , When there is a frequency of 0.2KHz、 The intensity is 60dB When the sound of , The threshold near it has increased a lot .
- Time domain masking
When a strong signal and a weak signal appear at the same time , There is also a time-domain masking effect , Front cover , At the same time masking , Back cover . Front masking refers to the short time before the human ear hears a strong signal , The existing weak signal will be masked and not heard .
- Front masking refers to the short time before the human ear hears a strong signal , The existing weak signal will be masked and not heard
- Simultaneous masking refers to when strong signals and weak signals exist at the same time , Weak signals are masked by strong signals and can't be heard .
- Post masking refers to when the strong signal disappears , It takes a long time to hear the weak signal again , It's called post masking . These masked weak signals can be regarded as redundant signals .
4. Encapsulate encoded data
Encapsulation is the audio generated by the encoder , Video synchronization to generate what we can see with the naked eye , What the ear can hear and see is synchronized with what you hear . That is, a container is generated after encapsulation , To store audio and video streams and some other information ( Like subtitles , metadata etc. ).
- The advantage is good image quality . Because it's lossless AVI It can be saved alpha passageway , Often used by us
- There are too many shortcomings , Too large , And what's worse is that the compression standards are not uniform ,
- MOV(.MOV): The United States Apple A video format developed by the company , The default player is Apple's QuickTime.
- Real Video(.RM,.RMBV): According to different network transmission rate, different compression ratio is determined , In this way, real-time transmission and playback of image data can be realized on the low-speed network
- Flash Video(.FLV): from Adobe Flash An extension of a popular network video packaging format . With the enrichment of video websites , This format has become very popular .
4.3 Synthesize encoded data into a stream
On the mobile side, we need the help of FFmpeg frame , As mentioned above ,FFmpeg Can not only do codec , You can also synthesize video streams , Like commonly used .flv flow ,.asf flow .
Last , The synthesized data can be used to write files or spread on the network
Add : FFmpeg ( Must learn framework )
FFmpeg It's an open source framework , Can run audio and video recording in multiple formats 、 transformation 、 Stream function , Contains libavcodec: This is a decoder library for audio and video in multiple projects , as well as libavformat An audio and video format conversion library .
At present, we support Linux ,Mac OS,Windows Three mainstream platforms , You can also compile to Android perhaps iOS platform . If it is Mac OS , Can pass brew install brew install ffmpeg --with-libvpx --with-libvorbis --with-ffplay
4.4. FLV Stream profile
FLV Encapsulate the formatter .FLV The full name is Flash Video, It is a widely used video packaging format on the Internet . image Youtube, Video websites like Youku , All use FLV Encapsulate video
FLV（Flash Video） yes Adobe A popular streaming media format designed and developed by the company , Because of its lightweight video file size 、 Simple package and so on , Make it very suitable for application on the Internet . Besides ,FLV have access to Flash Player Play it , and Flash Player Plug ins have been installed in most browsers around the world , This makes it possible to play through the web FLV The video is very easy . At present, mainstream video websites such as Youku , Tudou.com , LETV and other websites without exception use FLV Format .FLV The file suffix of encapsulation format is usually “.flv”.
FLV Including file header （File Header） And the body （File Body） Two parts , The file body consists of a series of Tag form . So a FLV The file is as shown in the figure 1 structure .
Every Tag It also includes Previous Tag Size Field , That's the previous one Tag Size .Tag The type of can be video 、 Audio and Script, Every Tag Only one of the above three types of data can be included . chart 2 It shows FLV Detailed structure of the document .
5. Pass the data through RTMP Protocol transfer
- CDN Good support , Mainstream CDN Manufacturers support
- Simple protocol , It is easy to implement on each platform
- be based on TCP , High transmission cost , In the case of high packet loss rate in weak network environment, the problem is significant
- Browser push is not supported
- Adobe Private agreements ,Adobe It is no longer updated
The streaming media we push out needs to be transmitted to the audience , The whole link is the transmission network .
RTMP Protocol is an Internet TCP/IP Protocol of application layer in five layer architecture .RTMP The basic unit of data in a protocol is called a message （Message）. When RTMP When the protocol transmits data over the Internet , Messages are split into smaller units , Called message blocks （Chunk）.
The news is RTMP The basic data unit in the protocol . Different kinds of messages contain different Message Type ID, Represent different functions .RTMP There are more than ten message types in the protocol , Each plays a different role .
- 1-7 Message for protocol control , The news is usually RTMP The protocol itself manages the messages to be used , In general, users do not need to operate the data
- Message Type ID by 8,9 The messages are used to transmit audio and video data respectively
- Message Type ID by 15-20 The message is used to send AMF Coded command , Responsible for the interaction between users and servers , For example, play , Pause and so on
- The first part of the news （Message Header） There are four parts ： Indicates the type of message Message Type ID, Indicates the length of the message Payload Length, Identifying the time stamp Timestamp, Identifies the media stream to which the message belongs Stream ID
2. Message block
When transmitting data over the network , Messages need to be split into smaller chunks , Is suitable for transmission in the corresponding network environment .RTMP As stipulated in the agreement , Messages are split into message blocks when they are transmitted over the network （Chunk）.
Message block header （Chunk Header） There are three parts ：
- Used to identify this block Chunk Basic Header
- Used to identify the message to which this block load belongs Chunk Message Header
- And when the timestamp overflows Extended Timestamp
3. The message is partitioned
In the process of a message being divided into several message blocks , Message payload part （Message Body） Divided into blocks of fixed size （ The default is 128 byte , The last data block can be smaller than the fixed length ）, And add the message block header to its header （Chunk Header）, The corresponding message block is formed . The message blocking process is shown in Figure 5 Shown , One size is 307 The byte message is split into 128 Byte message block （ Except for the last one ）.
RTMP In the process of transmitting media data , The sender first encapsulates the media data into messages , The message is then divided into message blocks , Finally, the segmented message block is passed through TCP Protocol sent out . The receiver is passing TCP After the protocol receives the data , First, reassemble the message blocks into messages , Then the media data can be recovered by unpacking the message .
4.RTMP The logical structure of
RTMP Provisions of the agreement , There are two prerequisites for playing a streaming media
- First step , Set up a network connection （NetConnection）
- The second step , Build a network flow （NetStream）.
among , Network connectivity represents the underlying connectivity between server-side applications and clients . Network flow represents the channel for sending multimedia data . Only one network connection can be established between the server and the client , But based on this connection, you can create many network flows . Their relationship is shown in the picture ：
5. Connection process
Play a RTMP Protocol streaming media needs to go through the following steps ：
- Establishing a connection
- Build the flow
RTMP The connection starts with a handshake . The connection establishment phase is used to establish the connection between the client and the server “ network connections ”; The build flow phase is used to establish the “ Network flow ”; The playback phase is used to transmit video and audio data .
6. Parse and decode the video stream
Further study of
- iOS Complete file pull stream parsing and decoding, synchronous rendering of audio and video streams
- FFmpeg Parse video data
- iOS utilize FFmpeg Realize video hard decoding
- iOS utilize VideoToolbox Realize video hard decoding
- iOS utilize FFmpeg Realize audio hard decoding
- iOS Using native audio decoding
So far , The complete streaming process has been introduced , The following process is the reverse process - Pull flow .
Because the receiver gets the encoded video stream and finally wants to render the video to the screen , Broadcast audio through output devices such as speakers , So follow the steps above , The receiving end can pass RTMP The protocol gets the video stream data , Then we need to use FFmpeg parse data , Because we need to separate audio from video in the data , After separating the audio and video data, you need to decode them respectively . The decoded video is YUV/RGB Equiform , The decoded audio is linear PCM data .
It should be noted that , The data we decode cannot be used directly , because , If the mobile terminal wants to play the decoded data, it needs to put it into a specific data structure , stay iOS in , Video data needs to be put into CMSampleBufferRef in , The data structure is composed of CMTime,CMVideoFormatDes,CMBlockBuffer form , So we need to provide the information it needs to form a format that the system can play .
7. Audio and video sync and play
Further study of
When we get the decoded audio and video frames , The first thing to consider is how to synchronize audio and video , Under normal network conditions, audio and video synchronization is not required , Because we parse The received audio and video data itself carries their time stamp at the time of acquisition , As long as we get the audio and video frames within a reasonable time , Send them to the screen and speaker respectively to realize synchronous playback . But considering the network fluctuations , So you may lose some frames or delays before you can get , When this happens, the sound and video will be out of sync , Therefore, it is necessary to synchronize audio and video .
We can think of it this way : There is a ruler An ant （ video ） Follow a benchmark （ Audio ） go , The benchmark is uniform Ants are fast or slow , If it's slow, smoke it Make it run , Pull it soon . So the audio and video can be synchronized . The biggest problem here is that the audio is uniform , Video is nonlinear .
Obtain audio and video data respectively PTS after , We have three options ： Video synchronous audio ( Calculate audio and video PTS The difference between the , To determine whether the video has a delay )、 Audio synchronous video （ According to audio and video PTS Sample value of difference adjustment audio , That is, change the size of the audio buffer ） Synchronize external clock with audio and video （ Same as the previous ）, Because the adjustment audio range is too large , It will cause a sharp sound that makes users uncomfortable , So usually we choose the first .
Our strategy is to compare the previous PTS And current PTS To predict the next frame PTS. meanwhile , We need to sync video to audio . We're going to create one audio clock As an internal variable to track the time when the audio is now playing ,video thread This value will be used to calculate and judge whether the video is broadcast fast or slow .
Now suppose we have a get\_audio\_clock Function to return us audio clock, Then when we get this value , How do we deal with audio and video out of sync ？ If it's just a simple attempt to jump to the right packet To solve the problem is not a good solution . What we need to do is adjust the timing of the next refresh ： If the video is slow, we will speed up the refresh , If the video is fast, we will slow down the refresh . Now that we have adjusted the refresh time , Next use frame\_timer Compare it with the clock of the device .frame\_timer It will always accumulate the delay we calculated during playback . In other words , This frame\_timer Is the time point when the next frame should be played . We are simply in frame\_timer Add up the newly calculated delay, Then compare with the system time , And use the obtained value as the time interval to refresh .
In this paper, from https://juejin.cn/post/6844903889007820813, If there is any infringement , Please contact to delete .
Related video ：
【2021 The latest version 】Android studio Installation tutorial +Android（ Android ） Zero basic tutorial video （ fit Android 0 Basics ,Android Introduction to learning ） Including audio and video _ Bili, Bili _bilibili