# Big data algorithm - bloon filter

2020-11-07 20:15:20

Today's article with you to learn a big data field often used algorithm —— The bloon filter . If seen 《 The beauty of Mathematics 》 Our classmates should be familiar with it , It is often used in the judgment of sets , It is used to quickly judge whether an element is in a large set in the massive data scenario . Its principle is not difficult , But the design is very clever , To be honest, I'm looking at 《 The beauty of Mathematics 》 Before , I've never heard of this data structure , So this article is also my own study notes .

## principle

In my previous understanding , If you want to determine whether an element is in a collection , The classic structure should be a balanced tree and hash table. But either way , Can't escape a little , All need to store the original value .

For example, in the reptile scene , We need to record the websites we've crawled before . We're going to store all the previous URLs in containers , And then in the face of a new website to determine whether it has climbed . In this question , We don't care what websites we've crawled before , We only care if the current website has appeared before . That is to say, it doesn't matter what happened before , It's important to see if there's ever been one .

We use balance trees or Trie Or is it AC Automata and other data structures and algorithms can achieve efficient search , But they can't do without storing all the strings . Imagine , A web address is about a hundred characters , about 0.1KB, If it's 100 million websites , Need 10GB 了 , If it's ten billion and one hundred billion ？ Obviously, such a large scale would be troublesome , The Blum filter introduced today can solve this problem , And there's no need to store the original value , This is a very clever way , Let's take a look at how it works .

The structure of the bloon filter itself is very simple , It's a one-dimensional bool Array of type , That is to say, everyone has only 0 perhaps 1, It's a bit, The length of this array is m. For each new item , We use K Different species hash The algorithm calculates hash value . So we can get K individual hash value , We use it hash It's worth it m modulus , The assumption is x. At the beginning, the array was full of 0, We put all x The corresponding position is marked 1.

for instance , Let's say we start with m yes 10,K yes 3. The first value we encounter to insert is ” linear algebra “, We treat it hash And then get 1,3,5, So we mark the corresponding position as 1. Then we came across another value that was ” Advanced mathematics “,hash And then get 1,8,9, We still assign the corresponding position to 1, Will find 1 The value for this position is already 1 了 , We just ignore it . If at this time we want to judge ” probability statistics ” Have you ever seen , What do I do ？ It's simple , We are right. “ probability statistics ” Calculate again hash value . Suppose we get 1,4,5, Let's go through the corresponding positions , Find out 4 This position is 0, The description has not been added before “ probability statistics ”, obviously “ probability statistics ” There is no such thing as .

But if “ probability statistics ”hash The result is 1,3,8 Well ？ We are wrong to judge that it has appeared , The answer is simple , Because though 1,3,8 This hash The combination never appeared before , But the corresponding positions have appeared in other elements , So there's an error . So we can know , The bloon filter is accurate for non-existent , But the judgment of existence may be wrong .

## Code

The principle of the bloon filter is very simple , When I understand , It's easy for us to write code ：

#  Insert elements
def BloomFilter(filter, value, hash_functions):
m = len(filter)
for func in hash_functions:
idx = func(value) % m
filter[idx] = True
return filter

#  Element of judgement
def MemberInFilter(filter, value, hash_functions):
m = len(filter)
for func in hash_functions:
idx = func(value) % m
if not filter[idx]:
return False
return True


## Error rate calculation

The previous examples should show clearly , Although the bloom filter is easy to use , But there will be bad case, That is to say, misjudgment . that , What's the probability of this kind of misjudgment ？

The calculation of this probability is not difficult ： Because the array length is $$m$$, So insert a bit It's set to 1 Is the probability that $$\frac{1}{m}$$, Inserting an element requires inserting k individual hash value , So insert an element , Someone is not set to 1 Is the probability that $$(1-\frac{1}{m})^k$$. Insert n After elements , Someone is still 0 Is the probability that $$(1-\frac{1}{m})^{nk}$$, It becomes 1 Is the probability that $$1-(1-\frac{1}{m})^{nk}$$.

If in some judgment , There's an element that hasn't been seen in the collection , So that means it hash All positions obtained have been previously set to 1 了 , The probability of this time is ：

$\displaystyle\left[1-(1-\frac{1}{m})^{nk}\right]^k \approx (1-e^{-\frac{kn}{m}})^k$

There's a limit here ：

$\displaystyle\lim_{x \to -\infty}(1-\frac{1}{x})^{-x}=e$

Let's find out when the conflict rate is the lowest k The value of , For the sake of calculation , We make $$b=e^{\frac{n}{m}}$$, Plug in ：

$f(k) = (1-b^{-k})^k \\ \ln f(k) = k\ln(1-b^{-k})$

Seek guidance on both sides ：

\begin{aligned} \frac{1}{f(k)}f'(k)&= ln(1-b^{-k}) + \frac{kb^{-k}\ln b}{1-b^{-k}} \end{aligned}

We make the derivative equal to 0, To find its extreme value ：

\begin{aligned} \ln(1-b^{-k})(1-b^{-k})&=-kb^{-k}\ln b\\ \ln(1-b^{-k})(1-b^{-k})&=b^{-k}\ln b^{-k}\\ 1-b^{-k} &=b^{-k}\\ b^{-k} &= \frac{1}{2} \end{aligned}

We will $$b^{-k}=\frac{1}{2}$$ Plug in , When we can find the maximum value $$k=\ln2\cdot\frac{m}{n} \approx 0.7\frac{m}{n}$$

Empathy , We can also preset the set elements n And miscarriage of justice p, To solve the corresponding n, Also use the above formula to calculate , You can get $$m=-\frac{n\ln p}{(\ln2)^2}$$

If we allow some tolerance , And can roughly estimate the number of elements that will appear , So we can use the bloom filter instead of the traditional method of container weight determination . This is not only very efficient , And the storage requirements are very small .

## Soul torture

The principle is clear , The code also understands , Now let's think about a question ： Can the bloom filter delete elements ？

unfortunately , The bloon filter does not support deletion .

Because every one of the bloon filters bit It's not exclusive , It's possible that multiple elements share a certain bit . If we delete this one directly , Will affect other elements .

Use the example above ： We delete linear algebra , The corresponding position of linear algebra is 1,3,5, Although we did not delete Advanced Mathematics , But because we've removed the places that advanced mathematics also uses 1, If we judge the existence of higher mathematics, we will get the wrong result , Although we didn't delete it .

Of course , In some scenarios where deletion is necessary , There is also a way . And it's very simple , Is to modify the data structure , Will be original each one bit Change to a int, When we insert elements , No longer will bit Set to true, But let the corresponding position increase , When it is deleted, it is the corresponding bit minus one . such , We can delete a single result without affecting other elements .

This method is not perfect , Due to the misjudgment of the bloon filter , It's possible that we'll delete values that don't exist , It also affects other elements .

Bloom filter is a data structure with obvious advantages and disadvantages , The advantages are excellent ： Fast enough , Low memory consumption , Simple code implementation . But the disadvantages are obvious ： Deleting elements... Is not supported , There will be miscalculations . This distinctive data structure is really attractive .

Today's article is all about , If you think you've got something , Please pay attention to it , Your help is very important to me . 