Implementation of the Bloom filter Data structure
In this article, I will go through the implementation of the Bloom filter data structure. The Bloom filter is an in-memory data structure used to determine membership in a set. It essentially answers the question: Is this item a member of a given group? For each answer from this algorithm, a "NO" is always 100% accurate, while a "YES" needs to be rechecked. This second behavior is known as a false positive. You should watch the animated video linked below to gain a solid understanding of what a Bloom filter is and how it works.
https://youtu.be/kfFacplFY4Y?si=DEZ6LcHbdfs6GCBp
In this implementation, we will use the Bloom filter data structure to keep track of a list of strings (spam URLs). You can check the article on implementing a URL shortening service like tinyurl.com, with spam detection using a Bloom filter in the link below.
https://unyimeudohdsa.hashnode.dev/scalable-url-shortening-service
So basically, when an item to be tracked in the data structure is provided, let's say a string, we pass it through a hash function that produces a number within the range of the bit array size ( which is fundamentally a list of items present). We then use this number as an index to access the corresponding location in the bit array and set the value to 1, indicating that the string producing this index is present.
In the future, when a string is provided for checking, we hash it again to get a value, and using this value as an index, we check the corresponding location in the bit array. If the value is 1, then the item is considered present; if it is 0, then the item is not present. Now, let’s go through each component of the algorithm.
Bit Array
First, we need a data structure to store all the items we need to track. In our case, this will be a very large array of integers with all initial values set to 0. Each index in this array represents a potential item in the list. If an item is present, the corresponding index is set to 1; otherwise, it remains 0.
private final long[] bitArray;
public BloomFilter() {
int SIZE = 958506;
this.bitArray = new long[SIZE];
HashHouse.setBitArraySize(SIZE);
}
Please note that the number of items that can be tracked by the algorithm is not equal to the size of the array but depends on the acceptable false positive rate. For example, with an array size of 958,506 and two hash functions, we can track approximately 199,452 items at a false positive rate of 1% (which is quite low). With a false positive rate of 10% (which is unreasonable), we can track around 199,909 items, which is still very small compared to the array size. This means that to reduce the false positive rate while maintaining the number of items to track, we need to increase the size of the bit array.
At this point, one might ask why we don’t just use a hashmap where the key is the item and the value is either 1 or 0, depending on whether the item is present or not. This approach would guarantee that we use the full length of the hashmap.
However, this approach is not as efficient as one might think due to memory usage. For example, with an array size of 958,506, which can track about 199,452 items at an acceptable false positive rate of 1%, we will only use about 117 KB (0.114 MB) of memory. And at this point, we still have the luxury of increasing the number items without any increase in the memory usage ( if we can tolerate higher false positive rate).
But if we were to use a hashmap to track the same number of items, we would end up using about 14.45 MB of space, assuming each string is of length 10. This increase in memory usage is due to the need for a hashmap to store the actual items, as well as the additional memory required for the hashmap’s internal structure and overhead.
Hash Functions
The primary role of the hash function(s) is to convert the data we need to track into a numerical value. This number should always be within the range of the bit array size.
public class HashHouse {
private static int BIT_ARRAY_SIZE ;
public static void setBitArraySize(int size) {
BIT_ARRAY_SIZE = size;
}
private static class HashFunc_1 implements ToIntFunction<String> {
@Override
public int applyAsInt(String value) {
int hash = Objects.hash(value);
return Math.abs(hash % BIT_ARRAY_SIZE);
}
}
private static class HashFunc_2 implements ToIntFunction<String> {
@Override
public int applyAsInt(String value) {
int hash = Objects.hash(value);
return Math.abs((hash / 31) % BIT_ARRAY_SIZE);
}
}
public List<ToIntFunction<String>> build(){
return Arrays.asList(new HashFunc_1(), new HashFunc_2());
}
}
Here, I am using two hash functions to reduce collisions (false positives). When two input items produce the same value for a particular hash function, they are likely to produce different values with another hash function due to different hashing logic. Thus, using multiple hash functions can further reduce false positives.
Recall that another method to reduce false positives or collisions is to use a larger bit array. However, it's important to consider the trade-offs: increasing the bit array size increases space complexity, while adding more hash functions increases the time complexity of the algorithm.
Note that these hash functions should be deterministic, meaning they always return the same value for a given input. Additionally, to ensure that the returned number is less than the bit array size, we use the modulus operation. Also, all the functions are returned as a single list so that we just loop through and generate the numbers. Now lets look at the main algorithm class. Now, let’s look at the main algorithm class.
Bloom filter class
After gaining a solid understanding of how the algorithm works, its implementation is relatively straightforward.
public class BloomFilter {
private final long[] bitArray;
private final List<ToIntFunction<String>> hashHouse = new HashHouse().build();
public BloomFilter() {
int SIZE = 958506;
this.bitArray = new long[SIZE];
HashHouse.setBitArraySize(SIZE);
}
public void add(String spamURL){
for(ToIntFunction<String> function: hashHouse){
int hashPosition = function.applyAsInt(spamURL);
bitArray[hashPosition] = 1;
}
}
public boolean mightContain(String unIdentifiedURL){
long[] bitPositions = new long[2];
int i = 0;
for(ToIntFunction<String> function: hashHouse){
int hashPosition = function.applyAsInt(unIdentifiedURL);
bitPositions[i] = bitArray[hashPosition];
i++;
}
return (bitPositions[0] == 1 && bitPositions[1] == 1);
}
}
First, we initialize the array with the desired size. To add or track the presence of an item, we loop through all the hash functions in the hash function class and generate numbers that serve as indexes. We then set the values at these indexes in the bit array to 1, indicating that the item is now present. A previous value of 0 meant the item was not present.
To check if an item exists, we loop through the hash functions again to generate their respective values. For each value obtained, we check the corresponding position in the bit array and store the result in an array called bitPositions
. Since these hash functions are deterministic, we will get the same values we had when adding the item to the list.
Finally, we check if all elements in the bitPositions
array are 1. If they are, the item is considered present; if any element is 0, the item is not present.
If you found this article helpful, you can follow me through the links below: