当前位置:网站首页>5 minutes introduction MP4 file format

5 minutes introduction MP4 file format

2020-12-08 08:51:42 Program ape card

Write it at the front

The main content of this article includes , What is? MP4、MP4 The basic structure of the document 、Box Basic structure 、 Common and important box Introduce 、 Ordinary MP4 And fMP4 The difference between 、 How to parse through code MP4 file etc. .

Writing background : Recently, I often answer questions about live broadcast from my team partners & The problem with short videos , such as “flv.js Implementation principle of ”、“ Why did the designer give it to me mp4 The file browser doesn't play 、 But it can be played locally ”、“MP4 Good compatibility , Can it be used for live broadcast ” etc. .

In the process of answering , Discoveries often involve MP4 Introduction to the agreement . I had a brief understanding of this piece before and took notes , I'm going to clean up a little bit here , By the way, as a team reference document , Any errors , Please point out .

What is? MP4

First , Introduce the package format . Multimedia packaging format ( Also called container format ), According to certain rules , Put the video data 、 Audio data, etc , Put it in a file . common MKV、AVI And this article introduces MP4 etc. , They're all packaged formats .

MP4 Is one of the most common packaging formats , Because of its cross platform characteristics, it is widely used .MP4 The suffix of the file is .mp4, Basically mainstream players 、 Browsers support MP4 Format .

MP4 The format of the document mainly consists of MPEG-4 Part 12、MPEG-4 Part 14 Two parts define . among ,MPEG-4 Part 12 Defined ISO Basic media file format , Used to store time-based media content .MPEG-4 Part 14 Actually defines MP4 File format , stay MPEG-4 Part 12 On the basis of expansion .

Yes, it's live 、 Audio and video related work students , It is necessary to understand MP4 Format , Here is a brief introduction .

MP4 File format Overview

MP4 The file consists of multiple box form , Every box Store different information , And box There is a tree structure between them , As shown in the figure below .

box There are many types , Here is 3 A more important top level box:

  • ftyp:File Type Box, Describes the MP4 Specifications and versions ;
  • moov:Movie Box, Media metadata Information , There is and only one .
  • mdat:Media Data Box, Store actual media data , There are usually more than one ;

although box There are many types , But the basic structure is the same . The next section will start with box Structure , And then to the common box For further explanation .

The following table is common box, Just take a look and have a general impression , And then go straight to the next section .

MP4 Box brief introduction

1 individual box It's made up of two parts :box header、box body.

  1. box header:box Metadata , such as box type、box size.
  2. box body:box The data part of , What's actually stored is the same as box The type is related to , such as mdat in body Part of the stored media data .

box header in , Only type、size Is a required field . When size==0 when , There is largesize Field . In part box in , There is still version、flags Field , In this way box be called Full Box. When box body Nest others in box when , In this way box be called container box.

Box Header

The fields are defined as follows :

  • type:box type , Include “ Predefined types ”、“ Custom extension types ”, Occupy 4 Bytes ;
    • Predefined types : such as ftyp、moov、mdat And other predefined types ;
    • Custom extension types : If type==uuid, It means that it is a custom extension type .size( or largesize) And then 16 byte , Is the value of the custom type (extended_type)
  • size: contain box header The whole inside box Size , Unit is byte . When size by 0 or 1 when , Require special treatment :
    • size be equal to 0:box The size of the following largesize determine ( Generally, only media data is loaded mdat box use largesize);
    • size be equal to 1: At present box For the last of the documents box, Usually included in mdat box in ;
  • largesize:box Size , Occupy 8 Bytes ;
  • extended_type: Custom extension types , Occupy 16 Bytes ;

Box The pseudo-code is as follows :

aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) {
    unsigned int(32) size;
    unsigned int(32) type = boxtype;
    if (size==1) {
        unsigned int(64) largesize;
    } else if (size==0) {
        // box extends to end of file
    }
    if (boxtype==‘uuid’) {
        unsigned int(8)[16] usertype = extended_type;
    } 
}

Box Body

box Data body , Different box The content is different , You need to refer to the specific box The definition of . yes , we have box body It's simple , such as ftyp. yes , we have box More complicated , Maybe there's something else in it box, such as moov.

Box vs FullBox

stay Box On the basis of , Expanded out FullBox type . comparison Box,FullBox More version、flags Field .

  • version: At present box Version of , Prepare for expansion , Occupy 1 Bytes ;
  • flags: Sign a , Occupy 24 position , The meaning consists of specific box Define your own ;

FullBox The pseudocode is as follows :

aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) {
	unsigned int(8) version = v;
	bit(24) flags = f;
}

FullBox Mainly in the moov Medium box be used , such as moov.mvhd, It will be introduced later .

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) {
	//  The fields are omitted ... 
}

ftyp(File Type Box)

ftyp Used to indicate the specifications that the current file follows , Introducing ftyp Before the details of , SECCO isom.

What is? isom

isom(ISO Base Media file) Is in MPEG-4 Part 12 A basic file format defined in ,MP4、3gp、QT And other common packaging formats , They are all derived from this basic file format .

MP4 The specifications that the document may follow are mp41、mp42, and mp41、mp42 It's based on isom derived .

3gp(3GPP): A container format , It is mainly used for 3G On the cell phone ;
QT:QuickTime Abbreviation ,.qt The document represents Apple QuickTime The media file ;

ftyp Definition

ftyp The definition is as follows :

aligned(8) class FileTypeBox extends Box(‘ftyp’) {  
  unsigned int(32) major_brand;  
  unsigned int(32) minor_version;  
  unsigned int(32) compatible_brands[]; // to end of the box  
}  

Here is brand Description of , In fact, it is the code corresponding to the specific packaging format , use 4 A byte code to indicate , such as mp41.

A brand is a four-letter code representing a format or subformat. Each file has a major brand (or primary brand), and also a compatibility list of brands.

ftyp The meaning of several fields of :

  • major_brand: For example, the common isom、mp41、mp42、avc1、qt etc. . It said “ best ” Based on which format to parse the current file . give an example ,major_brand yes A,compatible_brands yes A1, When the decoder supports A、A1 When standardizing , Best use A Specification to decode the current media file , If not A standard , But support A1 standard , that , have access to A1 Standard to decode ;
  • minor_version: Provide major_brand The description information of , Such as version number , It must not be used to determine whether a media file meets a certain standard / standard ;
  • compatible_brands: File compatible brand list . such as mp41 Compatibility brand by isom. Through the compatibility list brand standard , You can part ( Or all ) Decode it ;

In practical use , Can't take isom As major_brand, It's about using specific brand( such as mp41), therefore , about isom, There is no specific file extension defined 、mime type.

Here are some common brand, And the corresponding file extension 、mime type, more brand You can refer to here .

Here's a screenshot of the actual example , Don't go into .

About AVC/AVC1

In the discussion MP4 When standardizing , mention AVC, Sometimes it means “AVC File format ”, Sometimes it means "AVC Compression standard (H.264)", Here's a simple distinction .

  • AVC File format : be based on ISO Basic file format Derivative , It uses AVC Compression standard , Think of it as MP4 The extended format of , Corresponding brand Usually avc1, stay MPEG-4 PART 15 In the definition of .
  • AVC Compression standard (H.264): stay MPEG-4 Part 10 In the definition of .
  • ISO Basic file format (Base Media File Format) stay MPEG-4 Part 12 In the definition of .

moov(Movie Box)

Movie Box, Storage mp4 Of metadata, Generally located mp4 Beginning of file .

aligned(8) class MovieBox extends Box(‘moov’){ }

moov in , The two most important box yes mvhd and trak:

  • mvhd:Movie Header Box,mp4 The overall information of the document , Such as creation time 、 File length, etc ;
  • trak:Track Box, One mp4 It can contain one or more orbits ( Like video tracks 、 Audio track ), The orbital information is in trak in .trak yes container box, At least two box,tkhd、mdia;

mvhd For the whole film ,tkhd For a single track,mdhd For the media ,vmhd For video ,smhd For audio , We can think of it as from broad > Specifically , The former is generally derived from the latter .

mvhd(Movie Header Box)

MP4 The overall information of the document , With the specific video stream 、 The audio stream has nothing to do with , Such as creation time 、 File length, etc .

The definition is as follows :

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { if (version==1) {
      unsigned int(64)  creation_time;
      unsigned int(64)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(64)  duration;
   } else { // version==0
      unsigned int(32)  creation_time;
      unsigned int(32)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(32)  duration;
}
template int(32) rate = 0x00010000; // typically 1.0
template int(16) volume = 0x0100; // typically, full volume const bit(16) reserved = 0;
const unsigned int(32)[2] reserved = 0;
template int(32)[9] matrix =
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
      // Unity matrix
   bit(32)[6]  pre_defined = 0;
   unsigned int(32)  next_track_ID;
}

The meaning of the field is as follows :

  • creation_time: File creation time ;
  • modification_time: File modification time ;
  • timescale: The unit of time contained in a second ( Integers ). for instance , If timescale be equal to 1000, that , One second contains 1000 Time units ( Back track Waiting time , We have to use this to convert , such as track Of duration by 10,000, that ,track The actual duration of is 10,000/1000=10s);
  • duration: The length of the film ( Integers ), According to the document track The information is derived from , Equal to the longest time track Of duration;
  • rate: Recommended playback rate ,32 An integer , high 16 position 、 low 16 Bits represent integral parts respectively 、 The fractional part ([16.16]), give an example 0x0001 0000 representative 1.0, Normal playback speed ;
  • volume: Play volume ,16 An integer , high 8 position 、 low 8 Bits represent integral parts respectively 、 The fractional part ([8.8]), give an example 0x01 00 Express 1.0, That's maximum volume ;
  • matrix: Video conversion matrix , Generally, it can be ignored ;
  • next_track_ID:32 An integer , Not 0, Generally, it can be ignored . When you want to add a new track When it comes to this film , serviceable track id, It has to be better than what is currently in use track id Be big . in other words , Add new track when , Traversal takes all of track, Confirm available track id;

tkhd(Track Box)

Single track Of metadata, Contains the following fields :

  • version:tkhd box Version of ;
  • flags: To obtain by bit or operation , The default value is 7(0x000001 | 0x000002 | 0x000004), Express this track It's enabled 、 For playing And For preview .
    • Track_enabled: The value is 0x000001, Express this track It's enabled , The duty of 0x000000, Express this track Not enabled ;
    • Track_in_movie: The value is 0x000002, At present track It will be used when playing ;
    • Track_in_preview: The value is 0x000004, At present track For preview mode ;
  • creation_time: At present track The creation time of ;
  • modification_time: At present track Last modified time of ;
  • track_ID: At present track Unique identification of , Not for 0, Can't repeat ;
  • duration: At present track The full duration of ( It needs to be divided by timescale Get the exact number of seconds );
  • layer: The stacking order of video tracks , The smaller the number, the closer it gets to the viewer , such as 1 Than 2 Lean up ,0 Than 1 Lean up ;
  • alternate_group: At present track The grouping ID,alternate_group Same value track In the same group . In the same group track, There can only be one at a time track In play mode . When alternate_group by 0 when , At present track Nothing else track In the same group . In a group , There can be only one track;
  • volume:audio track The volume of , Be situated between 0.0~1.0 Between ;
  • matrix: Video transformation matrix ;
  • width、height: The width and height of the video ;

The definition is as follows :

aligned(8) class TrackHeaderBox 
  extends FullBox(‘tkhd’, version, flags){ 
	if (version==1) {
	      unsigned int(64)  creation_time;
	      unsigned int(64)  modification_time;
	      unsigned int(32)  track_ID;
	      const unsigned int(32)  reserved = 0;
	      unsigned int(64)  duration;
	   } else { // version==0
	      unsigned int(32)  creation_time;
	      unsigned int(32)  modification_time;
	      unsigned int(32)  track_ID;
	      const unsigned int(32)  reserved = 0;
	      unsigned int(32)  duration;
	}
	const unsigned int(32)[2] reserved = 0;
	template int(16) layer = 0;
	template int(16) alternate_group = 0;
	template int(16) volume = {if track_is_audio 0x0100 else 0}; const unsigned int(16) reserved = 0;
	template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // unity matrix
	unsigned int(32) width;
	unsigned int(32) height;
}

Examples are as follows :

hdlr(Handler Reference Box)

Declare that at present track The type of , And the corresponding processor (handler).

handler_type The values of include :

  • vide(0x76 69 64 65),video track;
  • soun(0x73 6f 75 6e),audio track;
  • hint(0x68 69 6e 74),hint track;

name by utf8 character string , Yes handler Describe , such as L-SMASH Video Handler( Reference resources here ).

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { 
	unsigned int(32) pre_defined = 0;
	unsigned int(32) handler_type;
	const unsigned int(32)[3] reserved = 0;
   	string   name;
}

stbl(Sample Table Box)

MP4 The media data section of the file is in mdat box in , and stbl It contains the index of these media data and time information , understand stbl To decode 、 Rendering MP4 Documents are critical .

stay MP4 In file , Media data is divided into multiple chunk, Every chunk Can contain more than one sample, and sample It's made up of frames ( Usually 1 individual sample Corresponding 1 A frame ), Relations are as follows :

Alt text

stbl It's the key part of box contain stsd、stco、stsc、stsz、stts、stss、ctts. Here's a brief introduction , And then go through the details one by one .

stco / stsc / stsz / stts / stss / ctts / stsd summary

Here are some box A brief introduction :

  • stsd: Give the video 、 Audio coding 、 Wide and high 、 Volume and other information , And each sample How many are included in frame;
  • stco:thunk Offset in file ;
  • stsc: Every thunk There are several sample;
  • stsz: Every sample Of size( Unit is byte );
  • stts: Every sample Duration ;
  • stss: Which? sample It's a keyframe ;
  • ctts: Time difference between frame decoding and rendering , Usually used in B The scene of the frame ;

stsd(Sample Description Box)

stsd give sample Description information of , This contains any initialization information needed in the decoding phase , such as code etc. . For video 、 For audio , The required initialization information is different , Take video as an example .

The pseudocode is as follows :

aligned(8) abstract class SampleEntry (unsigned int(32) format) extends Box(format){
	const unsigned int(8)[6] reserved = 0;
	unsigned int(16) data_reference_index;
}

// Visual Sequences
class VisualSampleEntry(codingname) extends SampleEntry (codingname){ 
	unsigned int(16) pre_defined = 0;
	const unsigned int(16) reserved = 0;
	unsigned int(32)[3] pre_defined = 0;
	unsigned int(16) width;
	unsigned int(16) height;
	template unsigned int(32) horizresolution = 0x00480000; // 72 dpi 
	template unsigned int(32) vertresolution = 0x00480000; // 72 dpi 
	const unsigned int(32) reserved = 0;
	template unsigned int(16) frame_count = 1;
	string[32] compressorname;
	template unsigned int(16) depth = 0x0018;
	int(16) pre_defined = -1;
}

// AudioSampleEntry、HintSampleEntry  The definition omits 


aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type) extends FullBox('stsd', 0, 0){
	int i ;
	unsigned int(32) entry_count;
	for (i = 1 ; i u entry_count ; i++) {
	      switch (handler_type){
	        case ‘soun’: // for audio tracks
				AudioSampleEntry();
				break;
			case ‘vide’: // for video tracks
			   VisualSampleEntry();
			   break;
			case ‘hint’: // Hint track
			   HintSampleEntry();
			   break;	         
		}
	}
}

stay SampleDescriptionBox in ,handler_type Parameters by track The type of (soun、vide、hint),entry_count The variable represents the current box in smaple description The number of entries .

stsc in ,sample_description_index It's pointing to these smaple description The index of .

For different handler_type,SampleDescriptionBox Subsequent applications are different SampleEntry type , such as video track by VisualSampleEntry.

VisualSampleEntry Contains the following fields :

  • data_reference_index: When MP4 The data part of the file , Can be divided into multiple segments , Each paragraph corresponds to an index , And passed separately URL Address to get , here ,data_reference_index Point to the corresponding fragment ( Less use of );
  • width、height: The width and height of the video , Unit is pixel ;
  • horizresolution、vertresolution: level 、 Vertical resolution ( Pixels / Inch ),16.16 Fixed-point number , The default is 0x00480000(72dpi);
  • frame_count: One sample How many are included in frame, Yes video track Come on , The default is 1;
  • compressorname: Name for reference only , Usually used to show , Occupy 32 Bytes , such as AVC Coding. First byte , It means that the name is actually occupied N The length of bytes . The first 2 To the first N+1 Bytes , Store the name . The first N+2 To 32 Bytes for padding .compressorname It can be set to 0;
  • depth: Bitmap depth information , such as 0x0018(24), It means not to bring alpha Picture of the channel ;

In video tracks, the frame_count field must be 1 unless the specification for the media format explicitly documents this template field and permits larger values. That specification must document both how the individual frames of video are found (their size information) and their timing established. That timing might be as simple as dividing the sample duration by the frame count to establish the frame duration.

Examples are as follows :

stco(Chunk Offset Box)

chunk Offset in file . For small files 、 A large file , There are two different kinds of box type , Namely stco、co64, They have the same structure , It's just that the length of the field is different .

chunk_offset In the document itself offset, Not some one box The internal offset .

In the build mp4 When you file , Special attention required moov Where it is , For it chunk_offset The value of has an effect on . Somewhat MP4 Of documents moov At the end of the document , To optimize the first frame speed , Need to put moov Move to the front of the file , here , Need to be right chunk_offset To rewrite .

stco The definition is as follows :

# Box Type: ‘stco’, ‘co64’
# Container: Sample Table Box (‘stbl’) Mandatory: Yes
# Quantity: Exactly one variant must be present

aligned(8) class ChunkOffsetBox
	extends FullBox(‘stco’, version = 0, 0) { 
	unsigned int(32) entry_count;
	for (i=1; i u entry_count; i++) {
		unsigned int(32)  chunk_offset;
	}
}

aligned(8) class ChunkLargeOffsetBox
	extends FullBox(‘co64’, version = 0, 0) { 
	unsigned int(32) entry_count;
	for (i=1; i u entry_count; i++) {
		unsigned int(64)  chunk_offset;
	}
}

As shown in the following example , first chunk Of offset yes 47564, the second chunk The deviation of 120579, Other similar .

stsc(Sample To Chunk Box)

sample With chunk Divide units into groups .chunk Of size It can be different ,chunk Inside sample Of size It can be different .

  • entry_count: How many entries are there ( Each table item , contain first_chunk、samples_per_chunk、sample_description_index Information );
  • first_chunk: In the current table entry , The corresponding first one chunk The serial number of ;
  • samples_per_chunk: Every chunk Contains sample Count ;
  • sample_description_index: Point to stsd in sample description The index of the value ( Reference resources stsd Section );
aligned(8) class SampleToChunkBox
	extends FullBox(‘stsc’, version = 0, 0) { 
	unsigned int(32) entry_count;
	for (i=1; i u entry_count; i++) {
		unsigned int(32) first_chunk;
		unsigned int(32) samples_per_chunk; 
		unsigned int(32) sample_description_index;
	}
}

The previous description is quite abstract , Here's an example , This means :

  • Serial number 1~15 Of chunk, Every chunk contain 15 individual sample;
  • Serial number 16 Of chunk, contain 30 individual sample;
  • Serial number 17 And then chunk, Every chunk contain 28 individual sample;
  • All of the above chunk Medium sample, Corresponding sample description The index of all is 1;
first_chunk samples_per_chunk sample_description_index
1 15 1
16 30 1
17 28 1

stsz(Sample Size Boxes)

Every sample Size ( byte ), according to sample_size Field , We can know the present track How many are included sample( Or frame ).

There are two different kinds of box type ,stsz、stz2.

stsz:

  • sample_size: default sample size ( The unit is byte), Usually it is 0. If sample_size Not for 0, that , be-all sample It's all the same size . If sample_size by 0, that ,sample The size may be different .
  • sample_count: At present track Inside sample number . If sample_size==0, that ,sample_count It's equal to the following entry The entry of ;
  • entry_size: Single sample Size ( If sample_size==0 Words );
aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { 
	unsigned int(32) sample_size;
	unsigned int(32) sample_count;
	if (sample_size==0) {
		for (i=1; i u sample_count; i++) {
			unsigned int(32)  entry_size;
		}
	}
}

stz2:

  • field_size:entry In the table , Every entry_size The number of digits occupied (bit), The optional value is 4、8、16.4 A special , When field_size be equal to 4 when , One byte contains two entry, high 4 Position as entry[i], low 4 Position as entry[i+1];
  • sample_count: It's equal to the following entry The entry of ;
  • entry_size:sample Size .
aligned(8) class CompactSampleSizeBox extends FullBox(‘stz2’, version = 0, 0) { 
	unsigned int(24) reserved = 0;
	unisgned int(8) field_size;
	unsigned int(32) sample_count;
	for (i=1; i u sample_count; i++) {
		unsigned int(field_size) entry_size;
	}
}

Examples are as follows :

stts(Decoding Time to Sample Box)

stts Contains DTS To sample number Mapping table , It is mainly used to deduce the duration of each frame .

aligned(8) class TimeToSampleBox extends FullBox(’stts’, version = 0, 0) {
	unsigned int(32)  entry_count;
	int i;
	for (i=0; i < entry_count; i++) {
		unsigned int(32)  sample_count;
		unsigned int(32)  sample_delta;
	}
}
  • entry_count:stts It contains entry Number of entries ;
  • sample_count: Single entry in , Having the same duration (duration or sample_delta) Continuity of sample The number of .
  • sample_delta:sample Duration ( With timescale To measure )

Let's take an example , Here's the picture ,entry_count by 3, front 250 individual sample The duration is 1000, The first 251 individual sample The length of 999, The first 252~283 individual sample The duration is 1000.

hypothesis timescale by 1000, Then the actual duration needs to be divided by 1000.

stss(Sync Sample Box)

mp4 In file , Where the keyframe is sample Serial number . without stss Words , be-all sample It's all keyframes in .

  • entry_count:entry The number of entries , Think of it as the number of keyframes ;
  • sample_number: Keyframes correspond to sample The serial number of ;( from 1 Start calculating )
aligned(8) class SyncSampleBox
   extends FullBox(‘stss’, version = 0, 0) {
   unsigned int(32)  entry_count;
   int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_number;
   }
}

Examples are as follows , The first 1、31、61、91、121...271 individual sample It's a keyframe .

ctts(Composition Time to Sample Box)

From decoding (dts) To render (pts) Difference between .

For only I frame 、P Frame video , Decoding order 、 The rendering order is consistent , here ,ctts There's no need to exist .

For presence B Frame video ,ctts It needs to exist . When PTS、DTS When they are not equal , Need ctts 了 , Formula for CT(n) = DT(n) + CTTS(n) .

aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { unsigned int(32) entry_count;
      int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_count;
      unsigned int(32)  sample_offset;
   }
}

Examples are as follows , Don't go into :

fMP4(Fragmented mp4)

fMP4 With the common mp4 The basic file structure is the same . Ordinary mp4 For on demand scenarios ,fmp4 It's usually used in live scenes .

They have the following differences :

  • Ordinary mp4 Duration 、 The content is usually fixed .fMP4 Duration 、 The content is usually not fixed , You can play while generating ;
  • Ordinary mp4 complete metadata All in moov in , Need to load moov box after , To be able to mdat Decode and render the media data in ;
  • fMP4 in , Media data metadata stay moof box in ,moof Follow mdat ( Usually ) Pairing appears .moof It contains sample duration、sample size Etc , therefore ,fMP4 You can play while generating ;

for instance , Ordinary mp4、fMP4 top floor box The structure may be as follows . The following is written by the author MP4 Parse the widget and print it out , The code is given at the end of the article .

//  Ordinary mp4
ftyp size=32(8+24) curTotalSize=32
moov size=4238(8+4230) curTotalSize=4270
mdat size=1124105(8+1124097) curTotalSize=1128375

// fmp4
ftyp size=36(8+28) curTotalSize=36
moov size=1227(8+1219) curTotalSize=1263
moof size=1252(8+1244) curTotalSize=2515
mdat size=65895(8+65887) curTotalSize=68410
moof size=612(8+604) curTotalSize=69022
mdat size=100386(8+100378) curTotalSize=169408

How to judge mp4 The document is ordinary mp4, still fMP4 Well ? You can generally see if there exists mvex(Movie Extends Box).

mvex(Movie Extends Box)

When it exists mvex when , Indicates that the current file is fmp4( Not rigorous ). here ,sample dependent metadata be not in moov in , It needs to be resolved moof box To obtain a .

The pseudocode is as follows :

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

mehd(Movie Extends Header Box)

mehd It's optional , Used to declare the full length of the movie (fragment_duration). If it doesn't exist , You need to traverse all of the fragment, To get the full length of time . about fmp4 Scene ,fragment_duration There is no way to predict in advance .

aligned(8) class MovieExtendsHeaderBox extends FullBox(‘mehd’, version, 0) {
	if (version==1) {
		unsigned int(64)  fragment_duration;
	} else { // version==0
		unsigned int(32)  fragment_duration;
	}
}

trex(Track Extends Box)

To give fMP4 Of sample Set various default values , For example, duration 、 Size, etc .

aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ 
	unsigned int(32) track_ID;
	unsigned int(32) default_sample_description_index; 
	unsigned int(32) default_sample_duration;
	unsigned int(32) default_sample_size;
	unsigned int(32) default_sample_flags
}

The meaning of the field is as follows :

  • track_id: Corresponding track Of ID, such as video track、audio track Of ID;
  • default_sample_description_index:sample description Default index( Point to stsd);
  • default_sample_duration:sample Default duration , It's usually 0;
  • default_sample_size:sample Default size , It's usually 0;
  • default_sample_flags:sample Default flag, It's usually 0;

default_sample_flags Occupy 4 Bytes , More complicated , The structure is as follows :

In the old version of the specification , front 6 Bits are reserved bits , In the new specification , Only the front 4 Bit is reserved bit .is_leading The meaning is not very intuitive , The next section will focus on .

  • reserved:4 bits, Keep a ;
  • is_leading:2 bits, whether leading sample, Possible values include :
    • 0: At present sample I'm not sure if leading sample;( It is generally set to this value )
    • 1: At present sample yes leading sample, And depend on referenced I frame Ahead sample, So it can't be decoded ;
    • 2: At present sample No leading sample;
    • 3: At present sample yes leading sample, Don't depend on referenced I frame Ahead sample, So it can be decoded ;
  • sample_depends_on:2 bits, Whether to rely on others sample, Possible values include :
    • 0: It's not clear if it depends on other sample;
    • 1: Depend on others sample( No I frame );
    • 2: Don't rely on others sample(I frame );
    • 3: Reserved values ;
  • sample_is_depended_on:2 bits, Whether or not by others sample rely on , Possible values include :
    • 0: It's not clear if there are other sample Rely on the present sample;
    • 1: other sample May rely on the present sample;
    • 2: other sample Not dependent on the present sample;
    • 3: Reserved values ;
  • sample_has_redundancy:2 bits, Whether there are redundant codes , Possible values include :
    • 0: It is not clear whether there is redundant coding ;
    • 1: There are redundant codes ;
    • 2: There is no redundant coding ;
    • 3: Reserved values ;
  • sample_padding_value:3 bits, Fill value ;
  • sample_is_non_sync_sample:1 bits, Not keyframes ;
  • sample_degradation_priority:16 bits, Degrade the priority of processing ( It is generally aimed at problems in the process of spreading );

Examples are as follows :

About is_leading

is_leading It's not particularly easy to explain , The original text is pasted here , So that you can understand .

A leading sample (usually a picture in video) is defined relative to a reference sample, which is the immediately prior sample that is marked as “sample_depends_on” having no dependency (an I picture). A leading sample has both a composition time before the reference sample, and possibly also a decoding dependency on a sample before the reference sample. Therefore if, for example, playback and decoding were to start at the reference sample, those samples marked as leading would not be needed and might not be decodable. A leading sample itself must therefore not be marked as having no dependency.

For the convenience of explanation , Below leading frame Corresponding leading sample,referenced frame Corresponding referenced samle.

With H264 code For example ,H264 in I frame 、P frame 、B frame . because B frame The existence of , Video frame Decoding order 、 Rendering order It may not be the same .

mp4 One of the characteristics of the document , It supports random position playback . such as , On video sites , You can drag the progress bar to fast forward .

A lot of times , The moment the progress bar is positioned , It doesn't have to be I frame . In order to be able to play , We need to look forward to the nearest one I frame , If possible , From the nearest I frame Start decoding and playing ( in other words , Not necessarily the closest from the front I Frame play ).

Locate the frame described above at this moment , Referred to as leading frame.leading frame The nearest one ahead I frame , be called referenced frame.

In retrospect is_leading by 1 or 3 The situation of , The same is leading frame, When to decode (decodable), When can't decode (not decodable)?

1: this sample is a leading sample that has a dependency before the referenced I‐picture (and is therefore not decodable);
3: this sample is a leading sample that has no dependency before the referenced I‐picture (and is therefore decodable);

1、is_leading by 1 Example : As shown below , frame 2(leading frame) Decoding depends on frame 1、 frame 3(referenced frame). In the video stream , from frame 2 Look forward to , Current I frame yes frame 3. Even if it's decoded frame 3, frame 2 I can't figure it out .

2、is_leading by 3 Example : As shown below , here , frame 2(leading frame) It can be decoded .

moof(Movie Fragment Box)

moof It's a container box, relevant metadata To embed box in , such as mfhd、 tfhd、trun etc. .

The pseudocode is as follows :

aligned(8) class MovieFragmentBox extends Box(‘moof’){ }

mfhd(Movie Fragment Header Box)

Simple structure ,sequence_number by movie fragment The serial number of . according to movie fragment The order of production , from 1 Began to increase .

aligned(8) class MovieFragmentHeaderBox extends FullBox(‘mfhd’, 0, 0){
	unsigned int(32)  sequence_number;
}

traf(Track Fragment Box)

aligned(8) class TrackFragmentBox extends Box(‘traf’){ }

Yes fmp4 Come on , The data is more than one movie fragment. One movie fragment Can contain more than one track fragment( Every track contain 0 Or more track fragment). Every track fragment in , It can contain more than one track Of sample.

Every track fragment in , Contains multiple track run, Every track run Represents a continuous set of sample.

tfhd(Track Fragment Header Box)

tfhd Used to set track fragment in Of sample Of metadata The default value of .

The pseudocode is as follows , except track_ID, Others are Optional fields .

aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){
	unsigned int(32) track_ID;
	// all the following are optional fields 
	unsigned int(64) base_data_offset; 
	unsigned int(32) sample_description_index; 
	unsigned int(32) default_sample_duration; 
	unsigned int(32) default_sample_size; 
	unsigned int(32) default_sample_flags
}

sample_description_index、default_sample_duration、default_sample_size Nothing to say , I'm just going to talk about tf_flags、base_data_offset.

First of all tf_flags, Different flag The values are as follows ( It's the same as seeking or by position ) :

  • 0x000001 base‐data‐offset‐present: There is base_data_offset Field , Express Data location Relative to the entire document Base offset .
  • 0x000002 sample‐description‐index‐present: There is sample_description_index Field ;
  • 0x000008 default‐sample‐duration‐present: There is default_sample_duration Field ;
  • 0x000010 default‐sample‐size‐present: There is default_sample_size Field ;
  • 0x000020 default‐sample‐flags‐present: There is default_sample_flags Field ;
  • 0x010000 duration‐is‐empty: Indicates that the current time period does not exist sample,default_sample_duration If it exists, it is 0 ,;
  • 0x020000 default‐base‐is‐moof: If base‐data‐offset‐present by 1, Ignore this flag. If base‐data‐offset‐present by 0, Is the current track fragment Of base_data_offset It's from moof The first byte of begins to count ;

sample The formula for calculating the position is base_data_offset + data_offset, among ,data_offset Every sample Individually define . If not explicitly provided base_data_offset, be sample The position of is usually based on moof Relative position of .

for instance , such as tf_flags be equal to 57, Express There is base_data_offset、default_sample_duration、default_sample_flags.

base_data_offset by 1263 (ftyp、moov Of size The sum is 1263).

trun(Track Fragment Run Box)

trun The pseudocode is as follows :

aligned(8) class TrackRunBox extends FullBox(‘trun’, version, tr_flags) {
   unsigned int(32)  sample_count;
   // the following are optional fields
   signed int(32) data_offset;
   unsigned int(32)  first_sample_flags;
   // all fields in the following array are optional
   {
      unsigned int(32)  sample_duration;
      unsigned int(32)  sample_size;
      unsigned int(32)  sample_flags
      if (version == 0)
         { unsigned int(32) sample_composition_time_offset; }
      else
         { signed int(32) sample_composition_time_offset; }
   }[ sample_count ]
}

I've heard of ,track run Denotes a continuous set of sample, among :

  • sample_count:sample Number of ;
  • data_offset: The offset of the data part ;
  • first_sample_flags: Optional , Aiming at the present track run in first sample Set up ;

tr_flags as follows , Be the same in essentials while differing in minor points :

  • 0x000001 data‐offset‐present: There is data_offset Field ;

  • 0x000004 first‐sample‐flags‐present: There is first_sample_flags Field , The value of this field , Only the first one will be covered sample Of flag Set up ; When first_sample_flags In existence ,sample_flags There is no ;

  • 0x000100 sample‐duration‐present: Every sample All have their own sample_duration, Otherwise use the default value ;

  • 0x000200 sample‐size‐present: Every sample All have their own sample_size, Otherwise use the default value ;

  • 0x000400 sample‐flags‐present: Every sample All have their own sample_flags, Otherwise use the default value ;

  • 0x000800 sample‐composition‐time‐offsets‐present: Every sample All have their own sample_composition_time_offset;

  • 0x000004 first‐sample‐flags‐present, Cover the first one sample Set up , So you can put a group of sample The first frame in is keyed , Other settings are non keyframes ;

Examples are as follows ,tr_flags by 2565. here , There is data_offset 、first_sample_flags、sample_size、sample_composition_time_offset.

Programming practice : analysis MP4 File structure

It's on paper , Never know it's going to be coding. according to mp4 Document specification , You can write a simple mp4 File parsing tool , For example, the above comparison Ordinary mp4、fMP4 Of box structure , It is the analysis script written by the author himself .

The core code is as follows , The complete code is a bit long , Can be in Author's github Found on the .

class Box {
	constructor(boxType, extendedType, buffer) {
		this.type = boxType; //  Mandatory , character string ,4 Bytes ,box type 
		this.size = 0; //  Mandatory , Integers ,4 Bytes ,box Size , Unit is byte 
		this.headerSize = 8; // 
		this.boxes = [];

		// this.largeSize = 0; //  Optional ,8 Bytes 
		// this.extendedType = extendedType || boxType; //  Optional ,16 Bytes 
		this._initialize(buffer);
	}

	_initialize(buffer) {				
		this.size = buffer.readUInt32BE(0); // 4 Bytes 
		this.type = buffer.slice(4, 8).toString(); // 4 Bytes 

		let offset = 8;

		if (this.size === 1) {
			this.size = buffer.readUIntBE(8, 8); // 8 Bytes ,largeSize
			this.headerSize += 8;
			offset = 16;
		} else if (this.size === 1) {
			// last box
		}

		if (this.type === 'uuid') {
			this.type = buffer.slice(offset, 16); // 16 Bytes 
			this.headerSize += 16;
		}
	}

	setInnerBoxes(buffer, offset = 0) {
		const innerBoxes = getInnerBoxes(buffer.slice(this.headerSize + offset, this.size));

		innerBoxes.forEach(item => {
			let { type, buffer } = item;

			type = type.trim(); //  remarks , There are some box The type doesn't have to be four letters , such as  url、urn

			if (this[type]) {
				const box = this[type](buffer);
				this.boxes.push(box);
			} else {
				this.boxes.push('TODO  To be realized ');
				// console.log(`unknowed type: ${type}`);
			}
		});
	}
}

class FullBox extends Box {
	constructor(boxType, buffer) {
		super(boxType, '', buffer);

		const headerSize = this.headerSize;

		this.version = buffer.readUInt8(headerSize); //  Mandatory ,1 Bytes 
		this.flags = buffer.readUIntBE(headerSize + 1, 3); //  Mandatory ,3 Bytes 

		this.headerSize = headerSize + 4;
	}
}

// FileTypeBox、MovieBox、MediaDataBox、MovieFragmentBox  The code is a little long, so I don't want to post it here 
class Movie {
	constructor(buffer) {

		this.boxes = [];
		this.bytesConsumed = 0;

		const innerBoxes = getInnerBoxes(buffer);

		innerBoxes.forEach(item => {
			const { type, buffer, size } = item;
			if (this[type]) {
				const box = this[type](buffer);
				this.boxes.push(box);
			} else {
				//  Customize  box  type 
			}
			this.bytesConsumed += size;
		});
	}

	ftyp(buffer) {
		return new FileTypeBox(buffer);
	}

	moov(buffer) {
		return new MovieBox(buffer);
	}

	mdat(buffer) {
		return new MediaDataBox(buffer);
	}

	moof(buffer) {
		return new MovieFragmentBox(buffer);
	}
}

function getInnerBoxes(buffer) {
	let boxes = [];
	let offset = 0;
	let totalByteLen = buffer.byteLength;

	do {
		let box = getBox(buffer, offset);
		boxes.push(box);

		offset += box.size;
	} while(offset < totalByteLen);

	return boxes;
}

function getBox(buffer, offset = 0) {
	let size = buffer.readUInt32BE(offset); // 4 Bytes 
	let type = buffer.slice(offset + 4, offset + 8).toString(); // 4 Bytes 

	if (size === 1) {
		size = buffer.readUIntBE(offset + 8, 8); // 8 Bytes ,largeSize
	} else if (size === 0) {
		// last box
	}

	let boxBuffer = buffer.slice(offset, offset + size);

	return {
		size,
		type,
		buffer: boxBuffer
	};
}

Written in the back

Limited by time , At the same time, for the convenience of explanation , Some of the content may not be very rigorous , Any errors , Please point out . If there is a problem , Also welcome to communicate at any time .

ISO/IEC 14496-12:2015 Information technology — Coding of audio-visual objects — Part 12: ISO base media file format
https://www.iso.org/standard/68960.html

Introduction to QuickTime File Format Specification
https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFPreface/qtffPreface.html#//apple_ref/doc/uid/TP40000939-CH202-TPXREF101

AVC_(file_format)
http://fileformats.archiveteam.org/wiki/AVC_(file_format)

AV1 Codec ISO Media File Format Binding
https://aomediacodec.github.io/av1-isobmff/

版权声明
本文为[Program ape card]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201208085115168a.html