Interviewer: do you understand bitmap? In what scenarios have you used it? What problems have you encountered?

2020-12-06 07:03:09

Click on the above “ Zhu Xiaosi's blog ”, choice “ Set to star ”

The background to reply " book ", obtain Bit-map The basic idea is to use one bit Bit to mark the corresponding Value, and Key That's the element . Because of the adoption of Bit Store data in units , So in terms of storage space , Can save a lot .（PS： Focus on   Save storage space

Suppose there is such a need ： stay 20 Find a number out of 100 million random integers m Whether there is , And suppose 32 Bit operating system ,4G Memory

stay Java in ,int Occupy 4 byte ,1 byte =8 position （1 byte = 8 bit）

If you use int Storage , That's it 20 One hundred million int, So the space occupied is about   (2000000000*4/1024/1024/1024)≈7.45G

If bit by bit storage is different ,20 One hundred million is 20 Billion , The occupied space is about   (2000000000/8/1024/1024/1024)≈0.23G

Degree awarded , Needless to say

that , The problem is coming. , How to express a number ？

Said just now , Each bit represents a number ,0 Does not exist ,1 Indicates presence , That's exactly what binary

In this way, we can easily express {1,2,4,6} These numbers ： The minimum unit of computer memory allocation is bytes , That is to say 8 position , If it means {12,13,15} What shall I do? ？

In the other, of course 8 It means ： In this case , It's like a two-dimensional array

1 individual int Occupy 32 position , Then we only need to apply for one int The array length is int tmp[1+N/32] You can store , among N Represents the maximum of these numbers to store , So ：

tmp： Can be said 0~31

tmp： Can be said 32~63

tmp： Can be said 64~95

...

In this way , Given any integer M, that M/32 You get the subscript ,M%32 I know where it is in this subscript

Here's the problem , How can we put a number in it ？ for example , Want to put 5 Put this number in , How to do it? ？

First ,5/32=0,5%32=5, It should be in tmp Of the 5 A place , Then we put the 1 Move to the left 5 position , Then press the position or To binary is to This is equivalent to 86 | 32 = 118

86 | (1<<5) = 118

b = b | (1<<5)

in other words , To insert a number , take 1 The left shift band represents the digit , Then with the original number by bit or operation

Simplify , Namely 86 + (5/8) | (1<<(5%8))

therefore , The formula can be summed up as ：p + (i/8)|(1<<(i%8)) among ,p It means the present value ,i Indicates the number to be inserted

eliminate

Above is the addition of , So what to do if you want to clean it up ？

Or the example above , Suppose we want to 6 remove , How do you do that ？ See from the graph , Just place the number in 0 that will do

1 Move left 6 position , Just get there 6 The digit represented by this number , And then reverse by bit , Finally, it is bitwise with the original number , In this way, the position is 0 了

b = b & (~(1<<6))

b = b & (~(1<<(i%8)))

lookup

We said that before , Each represents a number ,1 Express （ Or existence ）,0 It means nothing （ Or it doesn't exist ）. By setting it to 1 perhaps 0 To add and remove guys , So to judge whether a data store exists or not is to judge the bit where the number is 0 still 1

hypothesis , We want to know 3 Is here or not , Then just judge  b & (1<<3)  If the value is 0, There is no , If it is 1, It means being

Bitmap What's the usage?

Fast sorting of large amount of data 、 lookup 、 duplicate removal

Quick sort

Suppose we want to be right 0-7 Internal 5 Elements (4,7,2,5,3) Sort （ Let's assume that these elements are not repeated ）, We can use Bit-map To achieve the purpose of sorting .

To express 8 Number , We just need 8 individual Bit（1Bytes）, First we open up 1Byte Space , Put all of these spaces Bit All bits are set to 0, Then the corresponding position is 1.

Last , Go through it Bit Area , Output the number of the one bit （2,3,4,5,7）, In this way, the purpose of sorting is achieved , Time complexity O(n).

• It's very efficient , There is no need to compare and shift ;

• Less memory , such as N=10000000; Just use... Memory N/8=1250000Byte=1.25M

shortcoming :

• All the data can't be repeated . That is, duplicate data cannot be sorted and searched .

• It's only when the data is dense that there are advantages

Get rid of the weight quickly

20 The number of non repeated integers found in billion integers , There's not enough memory for this 20 Million integers .

First , according to “ There's not enough memory for this 05 Million integers ” We can quickly associate with Bit-map. The next key question is how to design our Bit-map To show that 20 The state of a billion numbers . In fact, the problem is very simple , There are only three states of a number , There is no such thing as , only one , Repeat . therefore , We just need 2bits You can store the status of a number , Suppose we set a number that doesn't exist as 00, Once in a while 01, There are two or more times 11. Then we need storage space 2G about .

The next task is to put this 20 Put in a hundred million figures （ Storage ）, If the corresponding status bit is 00, Turn it into 01, It means that there is once ; If the corresponding status bit is 01, Turn it into 11, Indicates that there is already one , That is, many times ; If 11, Then the corresponding status bit remains the same , Still means there are many times .

Last , Statistics status bit is 01 The number of , And you get the number of numbers that don't repeat , The time complexity is O(n).

Quick search

This is what we said earlier ,int One element in the array is 4 Bytes take up 32 position , Then divide by 32 We know the subscript of the element , Yes 32 Mod （%32） Just know who it is , If that bit is 1, It means that there is .

Summary & review

Bitmap It is mainly used to quickly retrieve keyword status , The keyword is usually required to be a continuous sequence （ Or keywords are most of a sequential sequence ）, The most basic situation , Use 1bit Indicates the status of a keyword （ Two states can be marked ）, But you can also use 2bit（ Express 4 States ）,3bit（ Express 8 States ）.

Bitmap The main applications of ： Means continuous （ Or close to continuous , That is, most of them will appear ） The status of the keyword sequence （ Number of States / Number of keywords   The smaller the better. ）.

32 On the bit machine , For an integer number , such as int a=1 In memory 32bit position , This is for the convenience of computer operation . But for some application scenarios , It's a huge waste , Because we can use the corresponding 32bit Bits correspond to decimal bits 0-31 Number , And this is Bit-map The basic idea of .Bit-map The algorithm uses this idea to process a large number of data sorting 、 Inquiry and de duplication .

Without overflowing the numbers , For positive and negative numbers , Moving one bit to the left is equivalent to multiplying by 2 Of 1 Power , Move left n Bit is equal to times 2 Of n Power , To shift one bit to the right is to divide 2, Move right n Bits are equal to dividing by 2 Of n Power .

<< Move left , Equivalent to times 2 Of n Power , for example ：1<<6    amount to 1×64=64,3<<4 amount to 3×16=48

>> Move right , Equivalent to divided by 2 Of n Power , for example ：64>>3 amount to 64÷8=8

^  Exclusive or , It's equivalent to finding the remainder , for example ：48^32 amount to 48%32=16

Don't use third-party variables , Exchange the values of two variables

//  Mode one
a = a + b;
b = a - b;
a = a - b;

//  Mode two
a = a ^ b;
b = a ^ b;
a = a ^ b;

BitSet

BitSet It implements a bit vector , It can grow as needed . Every bit has a Boolean value . One BitSet The bits of can be indexed by nonnegative integers （PS： It means that every bit can represent a non negative integer ）. Can find 、 Set up 、 Clear a person . You can modify another... With logical operators BitSet The content of . By default , All bits have a default value false.     You can see , It's similar to what we thought before

Use one long Array to store , Initial length 64,set Move right first when it's worth 6 position （ Equivalent to divided by 64） Calculate where in the array , Then change the status bit

It doesn't matter if you don't understand anything else , It's enough to understand these two sentences ：

int wordIndex = wordIndex(bitIndex);
words[wordIndex] |= (1L << bitIndex);

Bloom Filters Bloom filter It's a data structure , It can be used to determine whether an element is in a collection , With fast operation , Small memory footprint .

The cost of efficient insertion and query is ,Bloom Filter It's a probability based data structure ： It can only tell us that an element is absolutely not or may be in a collection .

Bloom filter The basic data structure of is a Bit vector （ It can be understood as an array ）.

It is mainly used in large-scale data without precise filtering , Such as checking spam address , Reptiles URL Address to heavy , Solve the problem of cache penetration

If you want to determine whether an element is in a set , The general idea is to save all the elements in the collection , And then through comparison to determine . Linked list 、 Trees 、 Hash table （ Hashtable ） And so on, the data structure is this way of thinking , But as the number of elements in the collection increases , Need more and more storage space ; At the same time, the retrieval speed is getting slower and slower , The complexity of retrieval time is O(n)、O(log n)、O(1).

The principle of Bloom filter is , When an element is added to a collection , adopt K Hash functions map this element to a set of bits （Bit array） Medium K A little bit , Set them as 1 . Search time , Just to see if all these points are 1 To know if the element is in the collection ; If any of these points 0, The inspected element must not be in ; If it's all 1, Then the inspected element is likely to be （ Reason why “ Probably ” It's the error ）.

BloomFilter technological process

1、 The first thing you need to k individual hash function , Every function can put key Hash to 1 It's an integer ;

2、 On initialization , Need a length of n Array of bits , Each bit is initialized to 0;

3、 Some key When you join the assembly , use k individual hash Function calculated k Hash values , And the corresponding bit position in the array is 1;

4、 Judge a certain key Is it in the assembly , use k individual hash Function calculated k Hash values , And query the corresponding bits in the array , If all bits are 1, Think in a collection . <dependency>
<artifactId>guava</artifactId>
<version>28.1-jre</version>
</dependency>

source | https://www.cnblogs.com/cjsblog/p/11613708.html Want to know more ？ sweep Trace the QR code below and follow me The background to reply " technology ", Join the technology group

【 Highlights 】

• Starbucks doesn't use two-phase commit

Point a praise + Looking at , Less bug ????

https://chowdera.com/2020/12/20201206070037504h.html