当前位置:网站首页>Handle Unicode, byte order, UTF at one time-*

Handle Unicode, byte order, UTF at one time-*

2020-11-14 20:36:32 Tomson

Pre reading :

What is character set

seeing the name of a thing one thinks of its function , A character set is a set of characters .

What is? ASCII

ASCII ((American Standard Code for Information Interchange): American standard code for information exchange ) It's a computer coding system based on the Latin alphabet , Mainly used to show modern English and other western European languages .

In the computer , All data should be represented by binary numbers in storage and operation ( Because computers are represented by high and low levels respectively 1 and 0), for example , image a、b、c、d In this way 52 Letters ( Include capital ) as well as 0、1 There are also some common symbols such as numbers ( for example *、#、@ etc. )
Since binary numbers are used to represent characters , Which binary numbers are used to represent which symbol , Of course, everyone can agree on their own set of ( This is called coding ), And if you want to communicate with each other without causing confusion , Then you have to use the same coding rules , So the standardization organizations in the United States came out ASCII code , It uniformly specifies which binary numbers are used to express the above common symbols .

American standard code for information exchange By American National Standards Institute (American National Standard Institute , ANSI ) To formulate the , Is a standard single byte character encoding scheme , For text-based data .
It was originally an American national standard , For different computers to communicate with each other as a common compliance of the western character coding standards , And then it was International Organization for Standardization (International Organization for Standardization, ISO) As an international standard , be called ISO 646 standard .

To put it simply ASCII Code table is a set of the most basic code library or character set , It includes the letters 、 Numbers 、 Symbols and some control characters .

What is? Unicode

Unicode It is a character encoding scheme developed by international organizations that can accommodate all the characters and symbols in the world , adopt Unicode The set of characters produced by coding rules is called Unicode Character set .

Computers can only deal with numbers (0、1), If you want to work with text , You have to convert text to numbers before you can process it . The first computers were designed with 8 A bit (bit) As a byte (byte).
The largest integer a byte can represent is 255(28-1=255), and ASCII code , Occupy 0 ~ 127 Used to indicate upper and lower case letters 、 Numbers and symbols , This code table is called ASCII code , Like capital letters A The code of is 65, Lowercase letters z The code of is 122.

If you want to represent Chinese , Obviously a byte is not enough , At least two bytes required , And not with ASCII Encoding conflict , therefore , China made GB2312 code , It's used to compile Chinese .
Allied , Other languages such as Japanese and Korean also have this problem . In order to unify the coding of all words ,Unicode emerge as the times require .Unicode Unify all languages into one set of codes , So there won't be any more confusion .

Unicode Use a number of bytes to represent a character , If you want to change the original double byte encoding from English , Just fill in all the high bytes as 0 Can . It means a Unicode The characters of , Usually I use “U+” Then a set of hexadecimal numbers is followed to represent this character .

current Unicode The characters are divided into 17 Group arrangement ,0x0000 to 0x10FFFF, Each group is called Plane (Plane), And every plane has 65536 ( namely 216) A code bit , common 1,114,112 individual . At present, however, only a few planes are used .

Plane Start and end character values Chinese name English name
0 Plane No U+0000 - U+FFFF Basic Multilingual plane Basic Multilingual Plane, abbreviation BMP
1 Plane No U+10000 - U+1FFFF Multilingual supplementary plane Supplementary Multilingual Plane, abbreviation SMP
2 Plane No U+20000 - U+2FFFF Ideographs supplement the plane Supplementary Ideographic Plane, abbreviation SIP
3 Plane No U+30000 - U+3FFFF The third plane of ideograph Tertiary Ideographic Plane, abbreviation TIP
4 Plane No to 13 Plane No U+40000 - U+DFFFF ( Not yet used )
14 Plane No U+E0000 - U+EFFFF Special purpose supplementary plane Supplementary Special-purpose Plane, abbreviation SSP
15 Plane No U+F0000 - U+FFFFF Reserved for private use (A District ) Private Use Area-A, abbreviation PUA-A
16 Plane No U+100000 - U+10FFFF Reserved for private use (B District ) Private Use Area-B, abbreviation PUA-B

And the plane 15 Peace surface 16 It's just a definition of two parts 65534 Code point Special area (Private Use Area), Namely 0xF0000 - 0xFFFFD and 0x100000 -0x10FFFD. The so-called special zone , It's the area reserved for you to put custom characters , I could just write it as PUA.

Plane 0 There are also special zones :

  • 0xD800 - 0xDFFF, common 2048 A code bit , It's one called Agency area (Surrogate) Special areas of . The purpose of the agency area is to use two UTF-16 Character representation BMP Characters other than .

Unicode Only the code point of each character is specified , What kind of byte order is used to represent the code point , It's about coding .

Unicode Character sets can be encoded in many ways , Namely UTF-8UTF-16 and UTF-32 /...

What is? UTF-32

because Unicode Code point maximum occupancy 4 Bytes , If Each code point is represented by four bytes , The byte content corresponds to the code point one by one . This coding method is called UTF-32. such as , Code points 0 Just use four bytes of 0 Express , Code points 59 7D Two bytes in front of it 0 .

UTF-32 The advantage is , The conversion rules are simple and intuitive , High search efficiency . The disadvantage is a waste of space , English text with the same content , It will compare with ASCII The code is four times bigger . This shortcoming is fatal , So no one actually uses this coding method ,HTML 5 The standard stipulates that , Web pages must not be encoded as UTF-32.

What people really need is a space efficient coding method , This led to the UTF-8 The birth of .

What is? UTF-8

UTF-8 It's a variable length encoding method , Character length from 1 Byte to 4 Different bytes .

The more commonly used characters , The shorter the byte , The front of 128 Characters , Use only 1 Byte representation , And ASCII It's the same size .

UTF-8 In pairs of bytes Unicode Encoding . UTF-8 The encoding method of is as follows :

  • For single byte symbols , The first bit of the byte is set to 0, Back 7 Bit by bit Unicode code . So for English letters ,UTF-8 Coding and ASCII The code is the same .
  • about n Symbol of byte (n > 1), Before the first byte n All places are set as 1, The first n + 1 Set as 0, The first two bits of the next byte are all set to 10. The remaining bits not mentioned , All for this symbol Unicode code .
Unicode code ( Hexadecimal ) UTF-8 Byte stream ( Binary system ) x Number
0000 0000 - 0000 007F 0xxxxxxx 7
0000 0080 - 0000 07FF 110xxxxx 10xxxxxx 11
0000 0800 - 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 16
0001 0000 - 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21

Through the rules, we can see that it is not a direct hexadecimal Unicode Code directly into binary is UTF-8 code , Instead, it's converted by encoding , The original Unicode The maximum code point under encoding is 10FFFF Not larger than 3 Bytes , But it turns out to be UTF-8 After the biggest need to use 4 Bytes to represent .

The advantage of this rule is that it allows parsing to quickly confirm how many bytes the current character needs .<u>UTF-8 The maximum length of the code is 4 Bytes </u>.

UTF-8 It is characterized by the use of different length codes for different ranges of characters . about 0x00 - 0x7F Characters between ,UTF-8 Code and ASCII The code is exactly the same .

As can be seen from the table above ,4 Byte templates have 21 individual x, That is, it can hold 21 Bit binary number .Unicode The maximum code point of 0x10FFFF There is only a 21 position . Examples of coding are as follows :

  • “ han ” The word Unicode Encoding is 0x6C49.0x6C49 stay 0x0800-0xFFFF Between , Need to use 3 Byte template :1110xxxx 10xxxxxx 10xxxxxx. take 0x6C49 Written as binary is :0110 1100 0100 1001, Use this bitstream instead of... In the template x, obtain :11100110 10110001 10001001, namely E6 B1 89. there E6 B1 89 That is “ han ” The word UTF-8 code . Of course, the original two byte Unicode The implementation of the code point UTF-8 The code points after encoding occupy 3 Bytes ;
  • Unicode code 0x20C30 stay 0x010000 - 0x10FFFF Between , Use 4 Byte template :11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. take 0x20C30 It's written in 21 Bit binary number ( Insufficient 21 It's in front of you 0):0 0010 0000 1100 0011 0000, Use this bitstream instead of... In the template x, obtain :11110000 10100000 10110000 10110000, namely `F0 A0 B0 B0;

What is? UTF-16

UTF-16 The code is between UTF-32 And UTF-8 Between , At the same time, it combines the characteristics of fixed length and variable length coding methods .

Its coding rules are very simple : Character occupation of basic plane 2 Bytes , Character occupation of auxiliary plane 4 Bytes . in other words ,UTF-16 The code length of is either 2 Bytes (U+0000 To U+FFFF), Or 4 Bytes (U+010000 To U+10FFFF).

So there's a problem , When we encounter two bytes , How to see that it itself is a character , It still needs to be read with the other two bytes ?

In the front Unicode Part of it is about , In the basic plane , from U+D800 To U+DFFF It's a gap , That is, these code points do not correspond to any characters . therefore , This space can be used to map the characters of the auxiliary plane .

say concretely , The character bits of the auxiliary plane are in total 220 individual , in other words , These characters need at least 20 Binary bits .UTF-16 Will this 20 The bit is split in two , front 10 The bits are mapped to U+D800 To U+DBFF( The size 210), Called high (H), after 10 The bits are mapped to U+DC00 To U+DFFF( The size 210), It's called the low position (L). It means , A character in the auxiliary plane , A character representation that is split into two basic planes .

therefore , stay UTF-16 Code , When we encounter two bytes , Found its code point in U+D800 To U+DBFF Between , It can be concluded that , The next two bytes of code points , belong U+DC00 To U+DFFF Between , These four bytes must be read together .

What is byte order

Let's look at the definition of byte order , quoted Wikipedia

Endianness is the sequential order in which bytes are arranged into larger numerical values when stored in memory or when transmitted over digital links.

Simply speaking , Byte order is the order between bytes , When transmitting or storing , If Number exceeds 1 Bytes , You need to specify the order between bytes . Byte order in English means byte order mark, abbreviation BOM.

The byte order problem only exists when the size of a character read by a computer at a time is larger than one byte under a certain encoding .

Byte order is generally divided into big end byte order or small end byte order , The coexistence of the two byte orders is entirely historical . What order is used is more for numerical considerations . For different operations of values, different byte order performance is also different , Of course, if the value size is no more than one byte, there is no need to worry about the order .

Big end order is the main order in network protocol , For example, in Internet In the protocol suite , It is called Network order , Transfer the most significant byte first . contrary , Small byte order is the processor architecture (x86, majority ARM Realization , basic RISC-V Realization ) And the main sorting mode of its associated memory .

File formats can be in any order : Some formats mix the two . How does the computer know which way a file is encoded in the file system ?

Unicode Specification definition , Each file is preceded by a character indicating the encoding order , The name of this character is " Zero width non newline space "(zero width no-break space).

If the first two bytes of a text file are FE FF, This means that the file is in the big end mode ; If the first two bytes are FF FE, This means that the file is in small end mode .

UTF The byte order of and BOM

Reference to How to teach endian A passage from :

One of the major disciplines in computer science is parsing/formatting. This is the process of converting the external format of data (file formats, network protocols, hardware registers) into the internal format (the data structures that software operates on).

An external format must be well-defined. What the first byte means must be written down somewhere, then what the second byte means, and so on. For Internet protocols, these formats are written in RFCs, such as RFC 791 for the "Internet Protocol". For file formats, these are written in documents, such as those describing GIF files, JPEG files, MPEG files, and so forth.

Which translates as :

analysis / Formatting is one of the main subjects of computer science . This is the external format of the data ( File format , Network protocol , Hardware register ) Convert to internal format ( The data structure operated by the software ) The process of .

The external format must be clearly defined . The meaning of the first byte must be written somewhere , Then write down the meaning of the second byte , And so on . about Internet agreement , These formats are RFC To write , for example “ Internet agreement ” Of RFC 791. For file formats , These formats are written in the document , For example, describe GIF file ,JPEG file ,MPEG Documents, etc .

Unicode The recommended way to mark byte order in the specification is BOM.BOM No “Bill Of Material” Of BOM surface , It is Byte Order Mark.

stay UCS There is a code called "ZERO WIDTH NO-BREAK SPACE” The characters of , Its code is FEFF. and FFFE stay UCS Is a character that does not exist , So it shouldn't be in the actual transmission .

UCS The specification suggests that we transfer the byte stream before , Transmit the characters first "ZERO WIDTH NO-BREAK SPACE".

So if the recipient receives FEFF, It means that the byte stream is Big-Endian Of ; If you receive FFFE , It means that the byte stream is Little-Endian Of . So character "ZERO WIDTH NO-BREAK SPACE” Also known as BOM.

UTF-8 Unwanted BOM To indicate byte order , But you can use BOM To show how to code . character "ZERO WIDTH NO-BREAK SPACE” Of UTF-8 Encoding is EF BB BF So if the recipient receives the EF BB BF The first byte stream , I knew it was UTF-8 It's encoded .

image.png

Examples are as follows ( This assumes that each memory address holds a byte of data ):

Byte format Text content 16 Base number Occupied memory size Memory address
UTF-8 1 31 1 byte 01
UTF-8 belt BOM 1 EF BB BF 31 4 byte 01~04

As shown in the table above : If we simply save a content as “1”(Unicode Encoded as 0x31) A text data of , Choose whether to bring or not BOM The content saved is inconsistent , For the first one, we explicitly use UTF-8 The code for storage , If you need to reload the text data , Take data from memory ( Or specify the storage structure ) Load it out ( There's only one byte here , File size, file address, etc. can be obtained in the file meta information ), How this load data is displayed depends on what encoding method we choose , Because we use a specific format (UTF-8) write in , So we need to specify a specific format ( UTF-8) To read the text correctly “1”, Otherwise, if we choose to show directly 16 In case of binary data, the content is 31.

Now most data loaders will set the default encoding method , For example, text tools and other basic default encoding methods are UTF-8. In order that some data loading tools without default reading mode can correctly recognize the data encoding , We need to write data first BOM, Above table Text 1 close BOM After the write content will occupy more 3 Bytes . With this 3 Bytes of BOM Data loading is easy and clear. Next use UTF-8 To parse the content .

Byte format Text content 16 Base number Occupied memory size Memory address
UTF-8 I E6 88 91 3 byte 01~03
UTF-8 belt BOM I EF BB BF E6 88 91 6 byte 01~06

chinese “ I ” Of Unicode Encoded as 0x6211, Binary for 0110 0010 0001 0001 According to the above Unicode UTF -8 The coding rules lead to “ I ” Of UTF-8

Within the scope of 0000 0800 - 0000 FFFF , Replace the template with 1110xxxx 10xxxxxx 10xxxxxx After the template is brought in, the binary is : 11100110 10001000 10010001 convert to 16 The base number is E6 88 91.

If using UTF-16 code :

UTF-16 The code length of is either 2 Bytes (U+0000 To U+FFFF), Or 4 Bytes (U+010000 To U+10FFFF).

Byte format Text content 16 Base number Occupied memory size Memory address
UTF-16 LE I 11 62 2 byte 01~02
UTF-16 LE belt BOM I FF FE 11 62 4 byte 01~04
UTF-16 BE I 62 11 2 byte 01~02
UTF-16 BE belt BOM I FE FF 62 11 4 byte 01~04

chinese “ I ” Of Unicode Encoded as 6211,UTF-16 There are no special conversion rules , Each read or write data must be 2 Bytes , I just occupied the word 2 Bytes .

As shown in the table above , If you use the big end byte sort (UTF-16 BE) The representation of stored data in memory is consistent with the way humans read , If it's a little bit of a byte order, it's a bit anti human .

Host byte order

Network byte order and host byte order are two concepts that often lead to confusion , The network byte order is certain , The diversity of host byte order is often the cause of confusion .

basic x86 series CPU All are little-endian Byte order of .

Illustrate with examples :

Suppose there is one byte in each memory address unit

Example : Double word in memory 0x01020304(DWORD) Storage mode

Memory address 4000 4001 4002 4003
LE The small end 04 03 02 01
BE Big end 01 02 03 04

Or :

Memory address LE The small end BE Big end
4000 04 01
4001 03 02
4002 02 03
4003 01 04

Example : If we were to 0x1234abcd Write to 0x0000 In memory at the beginning , The result is

Memory address 0x0000 0x0001 0x0002 0x0003
LE The small end 0xcd 0xab 0x34 0x12
BE Big end 0x12 0x34 0xab 0xcd

Or :

Memory address LE The small end BE Big end
0x0000 0xcd 0x12
0x0001 0xab 0x34
0x0002 0x34 0xab
0x0003 0x12 0xcd

Basic is :

  • Byte angle : The high byte order in memory represents the small end byte order , On the contrary, the high byte order at the low memory level is the big endian byte order ;
  • Memory angle : The low bit of memory in the low byte order is the small end byte order , On the contrary, the low order of memory is the big end byte order ;
  • summary : Byte high memory high or byte low memory low is small end byte order , Otherwise, it's big endian order .

image.png image.png

Normally, we don't need to care about host byte order , Because how to store is the external memory or make the driver need to care about , We just need to write and read data normally , Other byte order processing doesn't need our attention , We need to pay more attention to the network byte order

Network byte order

The Internet Byte order yes TCP/IP A data representation format specified in a protocol , It's with concrete CPU type 、 operating system It's nothing to do with it , It can ensure that the data can be correctly interpreted when it is transmitted between different hosts . The network byte order uses big endian( Big end ) sort order .

Because the network protocol uses big endian byte order by default , Everyone follows this rule , Naturally, we directly use the big endian byte order to analyze the network data, then we can get the corresponding information , You don't need to manually declare that the packet is big endian . Of course, the premise is that the data you send out is also big endian ( If the data you send out is a small end byte order code, the natural receiver also has to take the initiative to use the small end to parse ).

File byte order

In addition to host byte order and network byte order , There is also a way to actively declare the byte order used when saving files .

For example, when we save a text file, we can set the saving format to :UTF-8 / UTF-8 belt BOM / UTF-16 LE / UTF-16 LE belt BOM / UTF-16 BE / UTF-16 BE belt BOM / UTF-32 LE / UTF-32 LE belt BOM / UTF-32 BE / UTF-32 BE belt BOM / UCS….

When to care about byte order

Use underlying languages such as C Language and so on , We need to manually convert the read memory to the specified numeric type , At this time, you need to know the byte order of the specific memory , For example, if it is the big end, a few bytes are taken out and put together directly, which can be used as a numerical value , Otherwise, if it's a small end, you have to take the corresponding processing to get the final value . High level languages generally do not require us to manipulate memory directly , The operating environment helps us deal with it directly .

Variables declared during general development , If it's a numerical type , At run time , The running environment will automatically help us to deal with the representation of values in memory , Don't worry too much about , Values in memory can be either large byte order storage or small end byte order storage , Depending on the needs of the system . The same is true for character types , The running environment will automatically handle it for us , What is the value when it is declared and what value is used , Don't care about the performance in memory .

It is necessary to confirm the byte order of the data sent out when the cross host network is connected , The receiver needs to use the same byte order to parse ,TCP/IP The protocol specifies that large endian byte order is generally used .

Hashbang notes

Hashbang notes It's a kind of specialized annotation grammar

about JavaScript for ,Hashbang The notes are ECMAScript Standardization in ( See Hashbang Grammatical suggestions [https://github.com/tc39/propo...]).

Hashbang The behavior of annotation and single line (//) The notes are exactly the same , But it uses #! Start and only valid at the absolute beginning of a script or module . Also pay attention to , stay #! No space of any kind was allowed before .

Note by #! All subsequent characters are made up to the end of the first line ; Only one such comment is allowed .Hashbang Comments specify specific JavaScript The path of the interpreter is used to execute the script . Examples are as follows :

#!/usr/bin/env node
console.log("Hello world");

Be careful :JavaScript Medium Hashbang Annotation imitates Unix Medium Shebangs, Used to specify the appropriate interpreter run file .

Although in Hashbang Before the note BOM It works in the browser , But it is not recommended to have Hashbang Used in the script BOM. When you try in Unix/Linux When you run a script in , Yes BOM Will not work . therefore , If you want to go straight from shell Run script , Please use no BOM Of UTF-8. You can only use #! Annotation style to specify JavaScript Interpreter . In all other cases , Just use // notes ( Or multiple lines of comment ).

stay JavaScript Use in Unicode

Hexadecimal

> 0x00A9 //  This is a  16  Hexadecimal Numbers 
169  //  The default output is  10  Base number 

Hexadecimal represents an escape sequence

> '\xA9' //  A character passes through  16  Hexadecimal said , Or two here  16  The character represented by the base number passes through  '\x'  It means  , Support up to two  16  Base number 
""
> '\xA9' === String.fromCodePoint(0xA9)
true

Unicode The escape sequence requires the \\u After that there are at least four characters (4 individual 16 Base number , For two bytes )

> '\u00A9' // \u  Followed by  4  individual  16  The base number represents a character escape 
""

Unicode Code escape is ECMAScript 6 New features . Use Unicode Code escape , Any character can be escaped to hexadecimal encoding . You can use 0x10FFFF.

Use simple Unicode Escape usually needs to be written in two separate parts to achieve the same effect . You can refer to String.fromCodePoint() and String.prototype.codePointAt() .

'\u{2F804}' //  For some characters beyond two bytes, you have to use  \u{}  Wrap up the corresponding number of  16  Base number , No more than  10FFFF

//  Use simple  Unicode  escape 
'\uD87E\uDC04' //  Use two  UTF-16  The unit comes together into one  4  The character code point of byte 

Different byte character representations

'\xA9'             //  Support a byte to represent the character 
'\u00A9'         //  Supports two bytes to represent characters 
'\u{2F804}'        //  More than two characters support 

Two \\x The character represented cannot be spliced into a double byte character

> '\u01ff' //  Double byte character 
"ǿ"

> '\x01\xff' // \x01  It's a non printable character , It is equivalent to two string splicing 
"ÿ"

Two \\u The characters represented can be spliced into a four byte character

// "𠮷"  stay  UTF-16  It's a  4  Byte character   Code point is  20bb7
> "𠮷".charCodeAt(0) //  Get the first two bytes of  10  Base number 
55362
> "𠮷".charCodeAt(1) //  Get the next two bytes of  10  Base number 
57271

> (55362).toString(16)  //  Get the corresponding  16  Base number 
"d842"
> (57271).toString(16)    //  Get the corresponding  16  Base number 
"dfb7"

> '\ud842\udfb7'
"𠮷"

> "𠮷".codePointAt(0) //  Get the full character of  10  Base number 
134071

(134071).toString(16) //  Get the corresponding  16  Base number 
"20bb7"

relevant API:

  • str.charCodeAt
  • str.codePointAt
  • str.charAt
  • String.fromCodePoint()
  • String.fromCharCode

JavaScript share 6 There are three ways to represent a character :

'\z' === 'z'  // true
'\172' === 'z' // true
'\x7A' === 'z' // true
'\u007A' === 'z' // true
'\u{7A}' === 'z' // true

more ES-6 Yes Unicode Support details for :

ASCII Code table reference

Bin( Binary system ) Oct( octal ) Dec( Decimal system ) Hex( Hexadecimal ) abbreviation / character explain
0000 0000 00 0 0x00 NUL(null) Null character
0000 0001 01 1 0x01 SOH(start of headline) Title start
0000 0010 02 2 0x02 STX (start of text) Text begins
0000 0011 03 3 0x03 ETX (end of text) End of text
0000 0100 04 4 0x04 EOT (end of transmission) End of transmission
0000 0101 05 5 0x05 ENQ (enquiry) request
0000 0110 06 6 0x06 ACK (acknowledge) Receive notice
0000 0111 07 7 0x07 BEL (bell) Ring the bell
0000 1000 010 8 0x08 BS (backspace) Backspace
0000 1001 011 9 0x09 HT (horizontal tab) Horizontal tabs
0000 1010 012 10 0x0A LF (NL line feed, new line) Line feed key
0000 1011 013 11 0x0B VT (vertical tab) Vertical tabs
0000 1100 014 12 0x0C FF (NP form feed, new page) Page feed key
0000 1101 015 13 0x0D CR (carriage return) Enter key
0000 1110 016 14 0x0E SO (shift out) No switching
0000 1111 017 15 0x0F SI (shift in) Enable Toggle
0001 0000 020 16 0x10 DLE (data link escape) Data link escape
0001 0001 021 17 0x11 DC1 (device control 1) Equipment control 1
0001 0010 022 18 0x12 DC2 (device control 2) Equipment control 2
0001 0011 023 19 0x13 DC3 (device control 3) Equipment control 3
0001 0100 024 20 0x14 DC4 (device control 4) Equipment control 4
0001 0101 025 21 0x15 NAK (negative acknowledge) Refuse to accept
0001 0110 026 22 0x16 SYN (synchronous idle) Sync idle
0001 0111 027 23 0x17 ETB (end of trans. block) End transfer block
0001 1000 030 24 0x18 CAN (cancel) Cancel
0001 1001 031 25 0x19 EM (end of medium) Media end
0001 1010 032 26 0x1A SUB (substitute) Instead of
0001 1011 033 27 0x1B ESC (escape) Change the code ( overflow )
0001 1100 034 28 0x1C FS (file separator) File separator
0001 1101 035 29 0x1D GS (group separator) Grouping
0001 1110 036 30 0x1E RS (record separator) Record separator
0001 1111 037 31 0x1F US (unit separator) Cell separator
0010 0000 040 32 0x20 (space) Space
0010 0001 041 33 0x21 ! exclamation mark
0010 0010 042 34 0x22 " Double quotes
0010 0011 043 35 0x23 # Well No
0010 0100 044 36 0x24 $ Dollar symbol
0010 0101 045 37 0x25 % Percent sign
0010 0110 046 38 0x26 & And no.
0010 0111 047 39 0x27 ' Closed single quotation mark
0010 1000 050 40 0x28 ( Open bracket
0010 1001 051 41 0x29 ) close-quote
0010 1010 052 42 0x2A * asterisk
0010 1011 053 43 0x2B + plus
0010 1100 054 44 0x2C , comma
0010 1101 055 45 0x2D - minus sign / Dashes
0010 1110 056 46 0x2E . Full stop
0010 1111 057 47 0x2F / Slash
0011 0000 060 48 0x30 0 character 0
0011 0001 061 49 0x31 1 character 1
0011 0010 062 50 0x32 2 character 2
0011 0011 063 51 0x33 3 character 3
0011 0100 064 52 0x34 4 character 4
0011 0101 065 53 0x35 5 character 5
0011 0110 066 54 0x36 6 character 6
0011 0111 067 55 0x37 7 character 7
0011 1000 070 56 0x38 8 character 8
0011 1001 071 57 0x39 9 character 9
0011 1010 072 58 0x3A : The colon
0011 1011 073 59 0x3B ; A semicolon
0011 1100 074 60 0x3C < Less than
0011 1101 075 61 0x3D = Equal sign
0011 1110 076 62 0x3E > Greater than
0011 1111 077 63 0x3F ? question mark
0100 0000 0100 64 0x40 @ Email symbol
0100 0001 0101 65 0x41 A Capital A
0100 0010 0102 66 0x42 B Capital B
0100 0011 0103 67 0x43 C Capital C
0100 0100 0104 68 0x44 D Capital D
0100 0101 0105 69 0x45 E Capital E
0100 0110 0106 70 0x46 F Capital F
0100 0111 0107 71 0x47 G Capital G
0100 1000 0110 72 0x48 H Capital H
0100 1001 0111 73 0x49 I Capital I
01001010 0112 74 0x4A J Capital J
0100 1011 0113 75 0x4B K Capital K
0100 1100 0114 76 0x4C L Capital L
0100 1101 0115 77 0x4D M Capital M
0100 1110 0116 78 0x4E N Capital N
0100 1111 0117 79 0x4F O Capital O
0101 0000 0120 80 0x50 P Capital P
0101 0001 0121 81 0x51 Q Capital Q
0101 0010 0122 82 0x52 R Capital R
0101 0011 0123 83 0x53 S Capital S
0101 0100 0124 84 0x54 T Capital T
0101 0101 0125 85 0x55 U Capital U
0101 0110 0126 86 0x56 V Capital V
0101 0111 0127 87 0x57 W Capital W
0101 1000 0130 88 0x58 X Capital X
0101 1001 0131 89 0x59 Y Capital Y
0101 1010 0132 90 0x5A Z Capital Z
0101 1011 0133 91 0x5B [ Square bracket
0101 1100 0134 92 0x5C \ The backslash
0101 1101 0135 93 0x5D ] Closed square bracket
0101 1110 0136 94 0x5E ^ De character
0101 1111 0137 95 0x5F _ Underline
0110 0000 0140 96 0x60 ` Open quotation mark
0110 0001 0141 97 0x61 a Lowercase letters a
0110 0010 0142 98 0x62 b Lowercase letters b
0110 0011 0143 99 0x63 c Lowercase letters c
0110 0100 0144 100 0x64 d Lowercase letters d
0110 0101 0145 101 0x65 e Lowercase letters e
0110 0110 0146 102 0x66 f Lowercase letters f
0110 0111 0147 103 0x67 g Lowercase letters g
0110 1000 0150 104 0x68 h Lowercase letters h
0110 1001 0151 105 0x69 i Lowercase letters i
0110 1010 0152 106 0x6A j Lowercase letters j
0110 1011 0153 107 0x6B k Lowercase letters k
0110 1100 0154 108 0x6C l Lowercase letters l
0110 1101 0155 109 0x6D m Lowercase letters m
0110 1110 0156 110 0x6E n Lowercase letters n
0110 1111 0157 111 0x6F o Lowercase letters o
0111 0000 0160 112 0x70 p Lowercase letters p
0111 0001 0161 113 0x71 q Lowercase letters q
0111 0010 0162 114 0x72 r Lowercase letters r
0111 0011 0163 115 0x73 s Lowercase letters s
0111 0100 0164 116 0x74 t Lowercase letters t
0111 0101 0165 117 0x75 u Lowercase letters u
0111 0110 0166 118 0x76 v Lowercase letters v
0111 0111 0167 119 0x77 w Lowercase letters w
0111 1000 0170 120 0x78 x Lowercase letters x
0111 1001 0171 121 0x79 y Lowercase letters y
0111 1010 0172 122 0x7A z Lowercase letters z
0111 1011 0173 123 0x7B { Flowering bracket
0111 1100 0174 124 0x7C \ vertical
0111 1101 0175 125 0x7D } Closed curly bracket
0111 1110 0176 126 0x7E ~ The waves,
0111 1111 0177 127 0x7F DEL (delete) Delete

Add :

版权声明
本文为[Tomson]所创,转载请带上原文链接,感谢

随机推荐