当前位置:网站首页>5 minutes introduction MP4 file format

5 minutes introduction MP4 file format

2020-12-08 10:54:03 itread01

## Write it at the front The main contents of this paper include , What is MP4、MP4 The basic structure of Archives 、Box The basic structure of 、 Common and important box Introduce 、 Ordinary MP4 And fMP4 The difference between 、 How to parse through code MP4 Archives etc. . Writing background : Recently, I often answer questions about live broadcast from my team partners & The problem with SMS , such as “flv.js The implementation principle of ”、“ Why did you design it from your classmates mp4 The file browser doesn't play 、 But it can be played locally ”、“MP4 Good compatibility , Can it be used for live broadcast ” etc. . In the process of solving , Discoveries often involve MP4 Introduction to the agreement . I had a brief understanding of this piece before and took notes , Here's a little bit of tidying up , By the way, as a team reference document , If there are mistakes and omissions , Please point out . ## What is MP4 First , Introduce the package format . Multimedia packaging format ( Also called container format ), According to certain rules , Put the video data 、 Audio information, etc , Put it in a file . Common MKV、AVI And this article introduces MP4 etc. , They're all packaged formats . MP4 Is one of the most common packaging formats , Because of its cross platform characteristics, it is widely used .MP4 The end of the file is .mp4, Basically mainstream players 、 Browsers all support MP4 Format . >MP4 The format of the file mainly consists of MPEG-4 Part 12、MPEG-4 Part 14 Two parts define . among ,MPEG-4 Part 12 Defined ISO Basic media file format , Used to store time-based media content .MPEG-4 Part 14 Actually defines MP4 File format , stay MPEG-4 Part 12 On the basis of the Expansion Kit . Yes, it's live 、 Audio visual communication related work students , It is necessary to understand MP4 Format , Here is a brief introduction to . ## MP4 File format Overview MP4 Files consist of multiple box Make up , Every box Store different information , And box There is a tree structure between them , As shown in the figure below . ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/be249051d03f45748b3055c842d75793~tplv-k3u1fbpfcp-watermark.image) box There are many types , Here is 3 A more important top level box: * ftyp:File Type Box, Describe the file compliance MP4 Specifications and versions ; * moov:Movie Box, Media metadata Information , There is and only one . * mdat:Media Data Box, Storage of actual media materials , There are usually more than one ; ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/79de26cf6b5145228a9f6c71c8ca572d~tplv-k3u1fbpfcp-watermark.image) Although box There are many types , But the basic structure is the same . The next section will start with box Structure of , And then to the common box To further explain . The following table is common box, Just take a look and have a general impression , And then go straight to the next section . ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/41ee4c90565a49f6aca5c350934ebb3d~tplv-k3u1fbpfcp-watermark.image) ## MP4 Box Introduction 1 One box It consists of two parts :box header、box body. 1. box header:box Yuan data of , such as box type、box size. 2. box body:box The information section of , What's actually stored is the same as box Type is related to , such as mdat in body Part of the stored media . box header in , Only type、size It's a required field . When size==0 When , There is largesize Hurdles . In part box in , Still exists version、flags Hurdles , This is it box be called Full Box. When box body Nest like other box When , This is it box be called container box. ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/f6b3386f5de345cc9688a6f5e968ae28~tplv-k3u1fbpfcp-watermark.image) ### Box Header The fields are defined as follows : * type:box Type , Include “ Predefined types ”、“ Custom extension type ”, Occupy 4 Bytes ; * Predefined types : such as ftyp、moov、mdat And other predefined types ; * Custom extension type : If type==uuid, Then it means that it is a custom extension type .size( or largesize) And then 16 Byte , Is the value of the custom type (extended_type) * size: contain box header The whole inside box Size , The unit is a byte . When size For 0 or 1 When , Need special treatment : * size Equal to 0:box The size of the following largesize Make sure ( Generally, it's only for media mdat box Will use largesize); * size Equal to 1: At present box For the last of the files box, Usually included in mdat box in ; * largesize:box Size , Occupy 8 Bytes ; * extended_type: Custom extension type , Occupy 16 Bytes ; Box The virtual code is as follows : ``` aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) { unsigned int(32) size; unsigned int(32) type = boxtype; if (size==1) { unsigned int(64) largesize; } else if (size==0) { // box extends to end of file } if (boxtype==‘uuid’) { unsigned int(8)[16] usertype = extended_type; } } ``` ### Box Body box Data body , Different box The content is different , You need to refer to the specific box The definition of . yes , we have box body It's simple , such as ftyp. yes , we have box It's more complicated , Maybe there's something else in it box, such as moov. ### Box vs FullBox stay Box Based on , It expanded to FullBox Type . comparison Box,FullBox More version、flags Hurdles . * version: At present box Version of , Preparing for expansion kits , Occupy 1 Bytes ; * flags: Sign bit , Occupy 24 position , The meaning consists of specific box Define yourself ; FullBox The virtual code is as follows : ``` aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) { unsigned int(8) version = v; bit(24) flags = f; } ``` FullBox Mainly in the moov Medium box be used , such as `moov.mvhd`, It will be introduced later . ``` aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { // The column is omitted ... } ``` ## ftyp(File Type Box) ftyp Used to indicate the specifications followed by the current file , Introducing ftyp Before the details of , SECCO isom. ### What is isom isom(ISO Base Media file) Is in MPEG-4 Part 12 A basic file format defined in ,MP4、3gp、QT And other common packaging formats , They are all derived from this basic file format . MP4 The norms that archives may follow are mp41、mp42, and mp41、mp42 Based on isom Derived from . >3gp(3GPP): A container format , It is mainly used for 3G On the cell phone ; >QT:QuickTime The abbreviation for ,.qt The file stands for apple QuickTime Media files ; ### ftyp Define ftyp It's defined as : ``` aligned(8) class FileTypeBox extends Box(‘ftyp’) { unsigned int(32) major_brand; unsigned int(32) minor_version; unsigned int(32) compatible_brands[]; // to end of the box } ``` Here is brand Description of , In fact, it is the code corresponding to the specific packaging format , use 4 The encoding of bytes is used to represent , such as mp41. >A brand is a four-letter code representing a format or subformat. Each file has a major brand (or primary brand), and also a compatibility list of brands. ftyp The meaning of several fields of : * major_brand: For example, common isom、mp41、mp42、avc1、qt etc. . It said “ best ” Based on which format to parse the current file . Examples ,major_brand yes A,compatible_brands yes A1, When the decoder supports A、A1 When standardizing , Best use A Specification to decode current media archives , If not A Regulate , But support A1 Regulate , So , have access to A1 Specification to decode ; * minor_version: Provide major_brand Description of , Like the version number , It must not be used to determine whether a media file meets a certain standard / Regulate ; * compatible_brands: File compatible brand list . such as mp41 The compatibility of brand For isom. Through the compatibility list brand Regulate , You can put the file part ( Or all ) Decode it ; > In practice , Can't take isom As major_brand, It's about using specific brand( such as mp41), therefore , For isom, There is no specific extension defined 、mime type. Here are some common brand, And the corresponding extension 、mime type, more brand You can refer to [ Here ](http://fileformats.archiveteam.org/wiki/Boxes/atoms_format#Brands) . ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/9290f510d66a45699e0c6c9d3b32562e~tplv-k3u1fbpfcp-watermark.image) Here's a screenshot of the actual example , Don't go over . ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7af192a4acc141c69ee75b932874f418~tplv-k3u1fbpfcp-watermark.image) ### About AVC/AVC1 In discussion MP4 When standardizing , mention AVC, Sometimes it means “AVC File format ”, Sometimes it means "AVC Compression standards (H.264)", Here's a simple distinction . * AVC File format : Based on ISO Basic file format Derivative , It uses AVC Compression standards , It can be regarded as MP4 The extension package format of , Corresponding brand Usually avc1, stay MPEG-4 PART 15 Defined in . * AVC Compression standards (H.264): stay MPEG-4 Part 10 Defined in . * ISO Basic file format (Base Media File Format) stay MPEG-4 Part 12 Defined in . ## moov(Movie Box) Movie Box, Store mp4 Of metadata, Usually located in mp4 The beginning of the file . ``` aligned(8) class MovieBox extends Box(‘moov’){ } ``` moov in , The two most important box yes mvhd and trak: * mvhd:Movie Header Box,mp4 The overall information file , For example, the setup time 、 File duration, etc ; * trak:Track Box, One mp4 It can contain one or more orbits ( Like video track 、 Audio track ), Orbit related information is in trak Li .trak yes container box, At least two box,tkhd、mdia; >mvhd For the whole film ,tkhd For a single track,mdhd For the media ,vmhd For video ,smhd For audio , We can think of it as from Broad > Specifically , The former is generally derived from the latter . ### mvhd(Movie Header Box) MP4 The overall information file , With the specific video stream 、 The audio stream has nothing to do with , For example, the setup time 、 File duration, etc . It's defined as : ``` aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) timescale; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) timescale; unsigned int(32) duration; } template int(32) rate = 0x00010000; // typically 1.0 template int(16) volume = 0x0100; // typically, full volume const bit(16) reserved = 0; const unsigned int(32)[2] reserved = 0; template int(32)[9] matrix = { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // Unity matrix bit(32)[6] pre_defined = 0; unsigned int(32) next_track_ID; } ``` The meaning of the field is as follows : * creation_time: File creation time ; * modification_time: File modification time ; * timescale: The unit of time contained in a second ( Integers ). For example , If timescale Equal to 1000, So , One second contains 1000 Time units ( Behind track Waiting time , We have to use this to convert , such as track Of duration For 10,000, So ,track The actual duration of is 10,000/1000=10s); * duration: The length of the film ( Integers ), According to the file track We've been able to push the information out of , Equal to the longest time track Of duration; * rate: Recommended playback rate ,32 Bit integers , high 16 position 、 low 16 Bits represent integral parts respectively 、 The fractional part ([16.16]), Examples 0x0001 0000 representative 1.0, Normal playback speed ; * volume: Play volume ,16 Bit integers , high 8 position 、 low 8 Bits represent integral parts respectively 、 The fractional part ([8.8]), Examples 0x01 00 Express 1.0, That's maximum volume ; * matrix: Video conversion matrix , Generally, it can be ignored ; * next_track_ID:32 Bit integers , Not 0, Generally, it can be ignored . When you want to add a new track When it comes to this film , serviceable track id, It has to be better than what is currently in use track id Be big . That is to say , Add new track When , Need to traverse all track, Confirm available track id; ### tkhd(Track Box) Single track Of metadata, Include the following fields : * version:tkhd box Version of ; * flags: To obtain by bit or operation , The default value is 7(0x000001 | 0x000002 | 0x000004), It means this track It's enabled 、 For playing And For preview . * Track_enabled: The value is 0x000001, It means this track It's enabled , It's worth it 0x000000, It means this track Not enabled ; * Track_in_movie: The value is 0x000002, Means the present track It will be used when playing ; * Track_in_preview: The value is 0x000004, Means the present track For preview mode ; * creation_time: At present track Establishment time of ; * modification_time: At present track Last modified time of ; * track_ID: At present track The only sign of , Cannot be 0, Can't repeat ; * duration: At present track The full duration of ( It needs to be divided by timescale Get the exact number of seconds ); * layer: The sequence of video track stacking , The smaller the number, the closer it gets to the viewer , such as 1 Than 2 Lean up ,0 Than 1 Lean up ; * alternate_group: At present track Group of ID,alternate_group Same value track In the same group . In the same group track, There can only be one at a time track In play mode . When alternate_group For 0 When , Means the present track Nothing else track In the same group . In a group , It can be just one track; * volume:audio track The volume of , As in 0.0~1.0 Between ; * matrix: Video transformation matrix ; * width、height: The width and height of video ; It's defined as : ``` aligned(8) class TrackHeaderBox extends FullBox(‘tkhd’, version, flags){ if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) track_ID; const unsigned int(32) reserved = 0; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) track_ID; const unsigned int(32) reserved = 0; unsigned int(32) duration; } const unsigned int(32)[2] reserved = 0; template int(16) layer = 0; template int(16) alternate_group = 0; template int(16) volume = {if track_is_audio 0x0100 else 0}; const unsigned int(16) reserved = 0; template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // unity matrix unsigned int(32) width; unsigned int(32) height; } ``` Examples are as follows : ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/e980ad2bc583418f8bdd7ce122e9d035~tplv-k3u1fbpfcp-watermark.image) ### hdlr(Handler Reference Box) Declare that at present track The type of , And the corresponding processor (handler). handler_type The values of include : * vide(0x76 69 64 65),video track; * soun(0x73 6f 75 6e),audio track; * hint(0x68 69 6e 74),hint track; name For utf8 String , Yes handler Describe , such as L-SMASH Video Handler( Refer to [ Here ](http://avisynth.nl/index.php/LSMASHSource)). ``` aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { unsigned int(32) pre_defined = 0; unsigned int(32) handler_type; const unsigned int(32)[3] reserved = 0; string name; } ``` ![](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/18839d47f31841e69ab1f902f854131b~tplv-k3u1fbpfcp-watermark.image) ## stbl(Sample Table Box) MP4 The media section of the archive is in mdat box Li , and stbl It contains the index of these media materials and time information , Understanding stbl To decode 、 Rendering MP4 Files are critical . stay MP4 In Archives , The media material is divided into several chunk, Every chunk Can contain more than one sample, and sample It's made up of frames ( Usually 1 One sample Correspondence 1 Frames ), The relationship is as follows : ![Alt text](./1607019210268.png) stbl It's the key part of box contain stsd、stco、stsc、stsz、stts、stss、ctts. Here's a brief introduction , And then go through the details one by one . ### stco / stsc / stsz / stts / stss / ctts / stsd summary Here are some box A brief introduction : * stsd: Give the video 、 The coding of audio messages 、 Width and height 、 Volume and other information , And each sample How many frame; * stco:thunk The offset in the file ; * stsc: Every thunk There are several sample; * stsz: Every sample Of size( The unit is a byte ); * stts: Every sample Duration of ; * stss: Which? sample It's keyframes ; * ctts: Time difference between frame decoding and rendering , Usually used in B The scene of the frame ; ### stsd(Sample Description Box) stsd Give sample Description of , This contains any initialization information needed in the decoding phase , such as Code etc. . For video 、 The message says , The required initialization information is different , Take video as an example . The virtual code is as follows : ``` aligned(8) abstract class SampleEntry (unsigned int(32) format) extends Box(format){ const unsigned int(8)[6] reserved = 0; unsigned int(16) data_reference_index; } // Visual Sequences class VisualSampleEntry(codingname) extends SampleEntry (codingname){ unsigned int(16) pre_defined = 0; const unsigned int(16) reserved = 0; unsigned int(32)[3] pre_defined = 0; unsigned int(16) width; unsigned int(16) height; template unsigned int(32) horizresolution = 0x00480000; // 72 dpi template unsigned int(32) vertresolution = 0x00480000; // 72 dpi const unsigned int(32) reserved = 0; template unsigned int(16) frame_count = 1; string[32] compressorname; template unsigned int(16) depth = 0x0018; int(16) pre_defined = -1; } // AudioSampleEntry、HintSampleEntry The definition omits aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type) extends FullBox('stsd', 0, 0){ int i ; unsigned int(32) entry_count; for (i = 1 ; i u entry_count ; i++) { switch (handler_type){ case ‘soun’: // for audio tracks AudioSampleEntry(); break; case ‘vide’: // for video tracks VisualSampleEntry(); break; case ‘hint’: // Hint track HintSampleEntry(); break; } } } ``` stay SampleDescriptionBox in ,handler_type Arguments For track The type of (soun、vide、hint),entry_count The variable represents the current box in smaple description The number of entries for . >stsc in ,sample_description_index It's pointing to these smaple description The index of . For different handler_type,SampleDescriptionBox Subsequent applications are different SampleEntry Type , such as video track For VisualSampleEntry. VisualSampleEntry Include the following fields : * data_reference_index: When MP4 The information section of the file , Can be divided into multiple segments , Each paragraph corresponds to an index , And through URL Address to get , Now ,data_reference_index Point to the corresponding fragment ( Less use of ); * width、height: The width and height of video , The unit is pixel ; * horizresolution、vertresolution: level 、 Vertical resolution ( Picture element / Inch ),16.16 Number of fixed points , The default is 0x00480000(72dpi); * frame_count: One sample How many frame, Yes video track Say , The default is 1; * compressorname: Name for reference only , Usually used to show , Occupy 32 Bytes , such as AVC Coding. First byte , Indicates that the name is actually occupied N Length of bytes . The first 2 To the first N+1 Bytes , Save the name . The first N+2 To 32 Bytes are filled bytes .compressorname Can be set to 0; * depth: Depth information of dot matrix , such as 0x0018(24), It means not to bring alpha Picture of the channel ; >In video tracks, the frame_count field must be 1 unless the specification for the media format explicitly documents this template field and permits larger values. That specification must document both how the individual frames of video are found (their size information) and their timing established. That timing might be as simple as dividing the sample duration by the frame count to establish the frame duration. Examples are as follows : ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/58a74d77e6114719aee48b31eaeb781b~tplv-k3u1fbpfcp-watermark.image) ### stco(Chunk Offset Box) chunk The offset in the file . For small files 、 Big files , There are two different kinds of box Type , The difference is stco、co64, They have the same structure , It's just that the length of the field is different . chunk_offset Refers to in the file itself offset, Not some one box The internal offset . Building mp4 When it comes to filing , Special attention needs to be paid to moov Where it is , It's for chunk_offset The value of has an effect . Somewhat MP4 Archival moov At the end of the file , To optimize the first frame speed , Need to put moov Move to the front of the file , Now , Need to be right about chunk_offset To rewrite . stco It's defined as : ``` # Box Type: ‘stco’, ‘co64’ # Container: Sample Table Box (‘stbl’) Mandatory: Yes # Quantity: Exactly one variant must be present aligned(8) class ChunkOffsetBox extends FullBox(‘stco’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(32) chunk_offset; } } aligned(8) class ChunkLargeOffsetBox extends FullBox(‘co64’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(64) chunk_offset; } } ``` As shown in the following example , First chunk Of offset yes 47564, The second one chunk The deviation of 120579, Others are similar to . ![](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/e481c07776c7476cbfeca5a55d4aff65~tplv-k3u1fbpfcp-watermark.image) ### stsc(Sample To Chunk Box) sample With chunk Divide units into groups .chunk Of size It can be different ,chunk Inside sample Of size It can be different . * entry_count: How many entries are there ( Every entry , contain first_chunk、samples_per_chunk、sample_description_index Information ); * first_chunk: In the current table entry , The corresponding first one chunk Serial number of ; * samples_per_chunk: Every chunk Contains sample Number ; * sample_description_index: Point to stsd in sample description The index of the value ( Refer to stsd Subsection ); ``` aligned(8) class SampleToChunkBox extends FullBox(‘stsc’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(32) first_chunk; unsigned int(32) samples_per_chunk; unsigned int(32) sample_description_index; } } ``` The previous description is quite abstract , Here's an example , This means : * Serial number 1~15 Of chunk, Every chunk contain 15 One sample; * Serial number 16 Of chunk, contain 30 One sample; * Serial number 17 And then chunk, Every chunk contain 28 One sample; * All of the above chunk Medium sample, Corresponding sample description The index of all is 1; | first_chunk | samples_per_chunk | sample_description_index | | :--------: | :--------:| :------: | | 1 | 15 | 1 | | 16 | 30 | 1 | | 17 | 28 | 1 | ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/09390dea426f49f68cf0cebb64547afc~tplv-k3u1fbpfcp-watermark.image) ### stsz(Sample Size Boxes) Every sample Size ( Byte ), According to sample_size Hurdles , You can know the current situation track How many sample( Or frame ). There are two different kinds of box Type ,stsz、stz2. stsz: * sample_size: Preset sample size ( The unit is byte), Usually it is 0. If sample_size Not for 0, So , be-all sample It's all the same size . If sample_size For 0, So ,sample The size may be different . * sample_count: At present track Inside sample Number . If sample_size==0, So ,sample_count Below entry Entry for ; * entry_size: Single sample Size ( If sample_size==0 And then ); ``` aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { unsigned int(32) sample_size; unsigned int(32) sample_count; if (sample_size==0) { for (i=1; i u sample_count; i++) { unsigned int(32) entry_size; } } } ``` stz2: * field_size:entry In the table , Every entry_size The number of digits occupied (bit), The optional values are 4、8、16.4 It's special , When field_size Equal to 4 When , A byte contains two entry, high 4 Bit is entry[i], low 4 Bit is entry[i+1]; * sample_count: Below entry Entry for ; * entry_size:sample Size . ``` aligned(8) class CompactSampleSizeBox extends FullBox(‘stz2’, version = 0, 0) { unsigned int(24) reserved = 0; unisgned int(8) field_size; unsigned int(32) sample_count; for (i=1; i u sample_count; i++) { unsigned int(field_size) entry_size; } } ``` Examples are as follows : ![](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/838cc3afb5a24faf97917a3f3482a1dc~tplv-k3u1fbpfcp-watermark.image) ### stts(Decoding Time to Sample Box) stts Contains DTS To sample number Mapping table of , It is mainly used to deduce the duration of each frame . ``` aligned(8) class TimeToSampleBox extends FullBox(’stts’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_delta; } } ``` * entry_count:stts It contains entry Number of entries ; * sample_count: Single entry in , Having the same duration (duration or sample_delta) The continuity of sample Number of . * sample_delta:sample Duration of ( With timescale To measure ) Let's take an example , Here's the picture ,entry_count For 3, front 250 One sample The length of time is 1000, The first 251 One sample The duration is 999, The first 252~283 One sample The length of time is 1000. > Suppose timescale For 1000, Then the actual duration needs to be divided by 1000. ![](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/9d4eb5af45ab42b68662f08267c2cdef~tplv-k3u1fbpfcp-watermark.image) ### stss(Sync Sample Box) mp4 In Archives , Where the keyframe is sample Serial number . If not stss And then , be-all sample It's all keyframes in . * entry_count:entry The number of entries for , Think of it as the number of keyframes ; * sample_number: Keyframes correspond to sample Serial number of ;( From 1 Start calculation ) ``` aligned(8) class SyncSampleBox extends FullBox(‘stss’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_number; } } ``` Examples are as follows , The first 1、31、61、91、121...271 One sample It's keyframes . ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/0900b5bd5b8046c7ae5781e8f85feb6d~tplv-k3u1fbpfcp-watermark.image) ### ctts(Composition Time to Sample Box) From decoding (dts) To render (pts) Difference between . For only I Frame 、P In terms of frame video , Decoding order 、 The rendering order is consistent , Now ,ctts There's no need to exist . For being B In terms of frame video ,ctts It needs to exist . When PTS、DTS When it's not equal , Need ctts 了 , The formula is CT(n) = DT(n) + CTTS(n) . ``` aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_offset; } } ``` Examples are as follows , Don't go over : ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/d6a54659994b47339fbffddd8f5397ae~tplv-k3u1fbpfcp-watermark.image) ## fMP4(Fragmented mp4) fMP4 With the common mp4 The basic file structure is the same . Ordinary mp4 For on demand scenarios ,fmp4 It's usually used in live scenes . They have the following differences : * Ordinary mp4 Duration of 、 The content is usually fixed .fMP4 The length of time 、 The content is usually not fixed , You can play while generating ; * Ordinary mp4 complete metadata All in moov Li , It needs to be loaded moov box After , To be right mdat To decode and render the media materials in ; * fMP4 in , Media information metadata stay moof box in ,moof Follow mdat ( Usually ) Pairs appear .moof It contains sample duration、sample size Wait for information , therefore ,fMP4 You can play while generating ; For example , Ordinary mp4、fMP4 Top floor box The structure may be as follows . The following is written by the author MP4 Parse the widget and print it out , The code is given at the end of the article . ``` // Ordinary mp4 ftyp size=32(8+24) curTotalSize=32 moov size=4238(8+4230) curTotalSize=4270 mdat size=1124105(8+1124097) curTotalSize=1128375 // fmp4 ftyp size=36(8+28) curTotalSize=36 moov size=1227(8+1219) curTotalSize=1263 moof size=1252(8+1244) curTotalSize=2515 mdat size=65895(8+65887) curTotalSize=68410 moof size=612(8+604) curTotalSize=69022 mdat size=100386(8+100378) curTotalSize=169408 ``` How to judge mp4 Files are ordinary mp4, Or fMP4 Well ? You can generally see if there exists mvex(Movie Extends Box). ![](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/1e3c4d17857f464ea0afe8da5d2e677c~tplv-k3u1fbpfcp-watermark.image) ## mvex(Movie Extends Box) When there is mvex When , Indicates that the current file is fmp4( Not rigorous ). Now ,sample Relevant metadata be not in moov Li , It needs to be resolved moof box To get . The virtual code is as follows : ``` aligned(8) class MovieExtendsBox extends Box(‘mvex’){ } ``` ### mehd(Movie Extends Header Box) mehd It's optional , Used to announce the full length of the film (fragment_duration). If it doesn't exist , You need to traverse all of the fragment, To get the full length of time . For fmp4 Scene of ,fragment_duration There is no way to predict in advance . ``` aligned(8) class MovieExtendsHeaderBox extends FullBox(‘mehd’, version, 0) { if (version==1) { unsigned int(64) fragment_duration; } else { // version==0 unsigned int(32) fragment_duration; } } ``` ### trex(Track Extends Box) To give to fMP4 Of sample Set various default values , Like the length of time 、 Size, etc . ``` aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ unsigned int(32) track_ID; unsigned int(32) default_sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags } ``` The meaning of the field is as follows : * track_id: Corresponding track Of ID, such as video track、audio track Of ID; * default_sample_description_index:sample description Presupposition of index( Point to stsd); * default_sample_duration:sample The default duration , Generally 0; * default_sample_size:sample Preset size , Generally 0; * default_sample_flags:sample Presupposition of flag, Generally 0; default_sample_flags Occupy 4 Bytes , It's more complicated , The structure is as follows : > In the old version of the specification , front 6 Bits are reserved bits , In the new specification , Only the front 4 Bit is reserved bit .is_leading The meaning is not very intuitive , The next section will focus on . * reserved:4 bits, Keep a ; * is_leading:2 bits, whether leading sample, Possible values include : * 0: At present sample I'm not sure if leading sample;( It is generally set to this value ) * 1: At present sample yes leading sample, And depend on referenced I frame Ahead sample, So it can't be decoded ; * 2: At present sample No leading sample; * 3: At present sample yes leading sample, Does not depend on referenced I frame Ahead sample, So it can be decoded ; * sample_depends_on:2 bits, Whether to rely on others sample, Possible values include : * 0: It's not clear if it depends on other sample; * 1: Rely on others sample( No I Frame ); * 2: Don't rely on others sample(I Frame ); * 3: Reserved values ; * sample_is_depended_on:2 bits, Whether or not by others sample Rely on , Possible values include : * 0: It's not clear if there are other sample Depend on the present sample; * 1: other sample May rely on the present sample; * 2: other sample Not dependent on the present sample; * 3: Reserved values ; * sample_has_redundancy:2 bits, Whether there are redundant codes , Possible values include : * 0: It is not clear whether there is redundant coding ; * 1: There are redundant codes ; * 2: There is no redundant coding ; * 3: Reserved values ; * sample_padding_value:3 bits, Fill value ; * sample_is_non_sync_sample:1 bits, Not keyframes ; * sample_degradation_priority:16 bits, The priority of degraded processing ( It is generally aimed at problems in the process of spreading ); Examples are as follows : ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7c9729a775a84c0b9b82fc476372a427~tplv-k3u1fbpfcp-watermark.image) ### About is_leading is_leading It's not particularly easy to explain , The original text is pasted here , So that you can understand . >A leading sample (usually a picture in video) is defined relative to a reference sample, which is the immediately prior sample that is marked as “sample_depends_on” having no dependency (an I picture). A leading sample has both a composition time before the reference sample, and possibly also a decoding dependency on a sample before the reference sample. Therefore if, for example, playback and decoding were to start at the reference sample, those samples marked as leading would not be needed and might not be decodable. A leading sample itself must therefore not be marked as having no dependency. For the convenience of explanation , Below leading frame Correspondence leading sample,referenced frame Correspondence referenced samle. With H264 Code For example ,H264 in I Frame 、P Frame 、B Frame . Because of B Frame The existence of , Video frame Decoding order 、 Rendering order It may not be the same . mp4 One of the characteristics of Archives , That is to support random position playback . such as , On the video website , You can drag the progress bar to fast forward . Many times , The moment the progress bar is positioned , Not necessarily I Frame . In order to be able to play , You need to look forward to the nearest one I Frame , If possible , From the latest I Frame Start decoding and playing ( That is to say , Not necessarily the closest from the front I Frame play ). Locate the frame described above at this moment , It's called leading frame.leading frame The nearest one ahead I Frame , be called referenced frame. Look back on is_leading For 1 or 3 Situation of , All the same leading frame, When can decode (decodable), When can't decode (not decodable)? >1: this sample is a leading sample that has a dependency before the referenced I‐picture (and is therefore not decodable); >3: this sample is a leading sample that has no dependency before the referenced I‐picture (and is therefore decodable); 1、is_leading For 1 Example : As shown below , Frame 2(leading frame) Decoding depends on Frame 1、 Frame 3(referenced frame). In the video stream , From Frame 2 Look forward to , Current I Frame yes Frame 3. Even if it's decoded Frame 3, Frame 2 I can't figure it out . ![](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/f8e86ab8545e4d98a302239ec4e6646f~tplv-k3u1fbpfcp-watermark.image) 2、is_leading For 3 Example : As shown below , Now , Frame 2(leading frame) It can be decoded . ![](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/e268b132dfc5473ba225e74da3a186d9~tplv-k3u1fbpfcp-watermark.image) ## moof(Movie Fragment Box) moof It's a container box, Related to metadata Embedded in box Li , such as mfhd、 tfhd、trun etc. . The virtual code is as follows : ``` aligned(8) class MovieFragmentBox extends Box(‘moof’){ } ``` ![](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/a12d514de0264993bc7a585e4a771ca3~tplv-k3u1fbpfcp-watermark.image) ### mfhd(Movie Fragment Header Box) The structure is relatively simple ,sequence_number For movie fragment The serial number of . According to movie fragment The order of production , From 1 Start increasing . ``` aligned(8) class MovieFragmentHeaderBox extends FullBox(‘mfhd’, 0, 0){ unsigned int(32) sequence_number; } ``` ### traf(Track Fragment Box) ``` aligned(8) class TrackFragmentBox extends Box(‘traf’){ } ``` Yes fmp4 Say , The information is multiple movie fragment. One movie fragment Can contain more than one track fragment( Every track contain 0 Or more track fragment). Every track fragment in , It can contain more than one track Of sample. > Every track fragment in , Contains multiple track run, Every track run Represents a continuous set of sample. ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/108fa9e7b05a4be99541472e3f38cdf2~tplv-k3u1fbpfcp-watermark.image) ### tfhd(Track Fragment Header Box) tfhd Used to set track fragment in Of sample Of metadata Default value of . The virtual code is as follows , except track_ID, Others are Optional fields . ``` aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){ unsigned int(32) track_ID; // all the following are optional fields unsigned int(64) base_data_offset; unsigned int(32) sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags } ``` sample_description_index、default_sample_duration、default_sample_size There's nothing to talk about , I'm just going to talk about tf_flags、base_data_offset. First of all tf_flags, Different flag The values are as follows ( It's the same as seeking or by position ) : * 0x000001 base‐data‐offset‐present: There is base_data_offset Hurdles , Express Data location Relative to the whole file Base offset . * 0x000002 sample‐description‐index‐present: There is sample_description_index Hurdles ; * 0x000008 default‐sample‐duration‐present: There is default_sample_duration Hurdles ; * 0x000010 default‐sample‐size‐present: There is default_sample_size Hurdles ; * 0x000020 default‐sample‐flags‐present: There is default_sample_flags Hurdles ; * 0x010000 duration‐is‐empty: Indicates that the current time period does not exist sample,default_sample_duration If it exists, it is 0 ,; * 0x020000 default‐base‐is‐moof: If base‐data‐offset‐present For 1, Ignore this flag. If base‐data‐offset‐present For 0, Now track fragment Of base_data_offset It's from moof The first byte of begins to evaluate ; sample The formula for calculating the position is base_data_offset + data_offset, among ,data_offset Every sample Individually define . If not explicitly provided base_data_offset, Then sample The position of is usually based on moof The relative position of . For example , such as tf_flags Equal to 57, Express There is base_data_offset、default_sample_duration、default_sample_flags. ![](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/f34233d518c9489a9b624112356a22e6~tplv-k3u1fbpfcp-watermark.image) base_data_offset For 1263 (ftyp、moov Of size The sum of them is 1263). ![](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7c5d1f7ada804204a194e47e85a87b5f~tplv-k3u1fbpfcp-watermark.image) ### trun(Track Fragment Run Box) trun The virtual code is as follows : ``` aligned(8) class TrackRunBox extends FullBox(‘trun’, version, tr_flags) { unsigned int(32) sample_count; // the following are optional fields signed int(32) data_offset; unsigned int(32) first_sample_flags; // all fields in the following array are optional { unsigned int(32) sample_duration; unsigned int(32) sample_size; unsigned int(32) sample_flags if (version == 0) { unsigned int(32) sample_composition_time_offset; } else { signed int(32) sample_composition_time_offset; } }[ sample_count ] } ``` I've heard of ,track run Denotes a continuous set of sample, among : * sample_count:sample The number of ; * data_offset: The offset of the data part ; * first_sample_flags: Optional , For the present track run in First sample Settings for ; tr_flags as follows , It's the same : * 0x000001 data‐offset‐present: There is data_offset Hurdles ; * 0x000004 first‐sample‐flags‐present: There is first_sample_flags Hurdles , The value of this field , Only the first one will be covered sample Of flag Set ; When first_sample_flags When there is ,sample_flags There is no such thing ; * 0x000100 sample‐duration‐present: Every sample All have their own sample_duration, Otherwise use the default value ; * 0x000200 sample‐size‐present: Every sample All have their own sample_size, Otherwise use the default value ; * 0x000400 sample‐flags‐present: Every sample All have their own sample_flags, Otherwise use the default value ; * 0x000800 sample‐composition‐time‐offsets‐present: Every sample All have their own sample_composition_time_offset; * 0x000004 first‐sample‐flags‐present, Cover the first one sample Settings for , So you can put a group of sample The first frame in is set as a keyframe , The others are set as non keyframes ; Here's an example ,tr_flags For 2565. Now , There is data_offset 、first_sample_flags、sample_size、sample_composition_time_offset. ![](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/544c682c84034ccc8f020046418a5da4~tplv-k3u1fbpfcp-watermark.image) ![](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/380fd57d872848148bd51b9fd82d45bb~tplv-k3u1fbpfcp-watermark.image) ## Programming practice : analysis MP4 File structure It's on paper , I know it's going to be coding. According to mp4 File specification , You can write a simple mp4 File analysis tool , For example, the above comparison Ordinary mp4、fMP4 Of box Structure , It is the analysis instruction code written by the author himself . The core code is as follows , The complete code is a bit long , Can be in [ Author's github](https://github.com/chyingp/blog/tree/master/demo/2020.08.24-mp4-analyze) Found on the . ```javascript class Box { constructor(boxType, extendedType, buffer) { this.type = boxType; // Must choose , String ,4 Bytes ,box Type this.size = 0; // Must choose , Integers ,4 Bytes ,box Size , The unit is a byte this.headerSize = 8; // this.boxes = []; // this.largeSize = 0; // Optional ,8 Bytes // this.extendedType = extendedType || boxType; // Optional ,16 Bytes this._initialize(buffer); } _initialize(buffer) { this.size = buffer.readUInt32BE(0); // 4 Bytes this.type = buffer.slice(4, 8).toString(); // 4 Bytes let offset = 8; if (this.size === 1) { this.size = buffer.readUIntBE(8, 8); // 8 Bytes ,largeSize this.headerSize += 8; offset = 16; } else if (this.size === 1) { // last box } if (this.type === 'uuid') { this.type = buffer.slice(offset, 16); // 16 Bytes this.headerSize += 16; } } setInnerBoxes(buffer, offset = 0) { const innerBoxes = getInnerBoxes(buffer.slice(this.headerSize + offset, this.size)); innerBoxes.forEach(item => { let { type, buffer } = item; type = type.trim(); // Note , There are some box Types don't have to have four letters , such as url、urn if (this[type]) { const box = this[type](buffer); this.boxes.push(box); } else { this.boxes.push('TODO To be realized '); // console.log(`unknowed type: ${type}`); } }); } } class FullBox extends Box { constructor(boxType, buffer) { super(boxType, '', buffer); const headerSize = this.headerSize; this.version = buffer.readUInt8(headerSize); // Must choose ,1 Bytes this.flags = buffer.readUIntBE(headerSize + 1, 3); // Must choose ,3 Bytes this.headerSize = headerSize + 4; } } // FileTypeBox、MovieBox、MediaDataBox、MovieFragmentBox The code is a little long, so I won't paste it here class Movie { constructor(buffer) { this.boxes = []; this.bytesConsumed = 0; const innerBoxes = getInnerBoxes(buffer); innerBoxes.forEach(item => { const { type, buffer, size } = item; if (this[type]) { const box = this[type](buffer); this.boxes.push(box); } else { // Customize box Type } this.bytesConsumed += size; }); } ftyp(buffer) { return new FileTypeBox(buffer); } moov(buffer) { return new MovieBox(buffer); } mdat(buffer) { return new MediaDataBox(buffer); } moof(buffer) { return new MovieFragmentBox(buffer); } } function getInnerBoxes(buffer) { let boxes = []; let offset = 0; let totalByteLen = buffer.byteLength; do { let box = getBox(buffer, offset); boxes.push(box); offset += box.size; } while(offset < totalByteLen); return boxes; } function getBox(buffer, offset = 0) { let size = buffer.readUInt32BE(offset); // 4 Bytes let type = buffer.slice(offset + 4, offset + 8).toString(); // 4 Bytes if (size === 1) { size = buffer.readUIntBE(offset + 8, 8); // 8 Bytes ,largeSize } else if (size === 0) { // last box } let boxBuffer = buffer.slice(offset, offset + size); return { size, type, buffer: boxBuffer }; } ``` ## Write it at the back Limited by time , At the same time, for the convenience of explanation , Some of the content may not be very rigorous , If there are mistakes and omissions , Please point out . If there is a problem , Also welcome to communicate at any time . ## Related links ISO/IEC 14496-12:2015 Information technology — Coding of audio-visual objects — Part 12: ISO base media file format https://www.iso.org/standard/68960.html Introduction to QuickTime File Format Specification https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFPreface/qtffPreface.html#//apple_ref/doc/uid/TP40000939-CH202-TPXREF101 AVC_(file_format) http://fileformats.archiveteam.org/wiki/AVC_(file_format) AV1 Codec ISO Media File Format Binding https://aomediacodec.github.io/av1-isobmff/

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201208105334079s.html