@InterfaceAudience.Private public class ByteBloomFilter extends Object implements BloomFilter, BloomFilterWriter
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Originally inspired by European Commission One-Lab Project 034819. Bloom filters are very sensitive to the number of elements inserted into them. For HBase, the number of entries depends on the size of the data stored in the column. Currently the default region size is 256MB, so entry count ~= 256MB / (average value size for column). Despite this rule of thumb, there is no efficient way to calculate the entry count after compactions. Therefore, it is often easier to use a dynamic bloom filter that will add extra space instead of allowing the error rate to grow. ( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey .pdf ) m denotes the number of bits in the Bloom filter (bitSize) n denotes the number of elements inserted into the Bloom filter (maxKeys) k represents the number of hash functions used (nbHash) e represents the desired false positive rate for the bloom (err) If we fix the error rate (e) and know the number of entries, then the optimal bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185) The probability of false positives is minimized when k = m/n ln(2).
Modifier and Type | Field and Description |
---|---|
protected ByteBuffer |
bloom
Bloom bits
|
protected long |
byteSize
Bytes (B) in the array.
|
protected Hash |
hash
Hash Function
|
protected int |
hashCount
Number of hash functions
|
protected int |
hashType
Hash type
|
protected int |
keyCount
Keys currently in the bloom
|
static double |
LOG2_SQUARED
Used in computing the optimal Bloom filter size.
|
protected int |
maxKeys
Max Keys expected for the bloom
|
static String |
STATS_RECORD_SEP
Record separator for the Bloom filter statistics human-readable string
|
static int |
VERSION
Current file format version
|
Constructor and Description |
---|
ByteBloomFilter(DataInput meta)
Loads bloom filter meta data from file input.
|
ByteBloomFilter(int maxKeys,
double errorRate,
int hashType,
int foldFactor)
Determines & initializes bloom filter meta data from user config.
|
Modifier and Type | Method and Description |
---|---|
double |
actualErrorRate()
Computes the error rate for this Bloom filter, taking into account the
actual number of hash functions and keys inserted.
|
static double |
actualErrorRate(long maxKeys,
long bitSize,
int functionCount)
Computes the actual error rate for the given number of elements, number
of bits, and number of hash functions.
|
void |
add(byte[] buf) |
void |
add(byte[] buf,
int offset,
int len)
Add the specified binary to the bloom filter.
|
void |
allocBloom()
Allocate memory for the bloom filter data.
|
void |
compactBloom()
Compact the Bloom filter before writing metadata & data to disk.
|
static long |
computeBitSize(long maxKeys,
double errorRate) |
static int |
computeFoldableByteSize(long bitSize,
int foldFactor)
Increases the given byte size of a Bloom filter until it can be folded by
the given factor.
|
static long |
computeMaxKeys(long bitSize,
double errorRate,
int hashCount)
The maximum number of keys we can put into a Bloom filter of a certain
size to get the given error rate, with the given number of hash functions.
|
boolean |
contains(byte[] buf,
int offset,
int length,
ByteBuffer theBloom)
Check if the specified key is contained in the bloom filter.
|
static boolean |
contains(byte[] buf,
int offset,
int length,
ByteBuffer bloomBuf,
int bloomOffset,
int bloomSize,
Hash hash,
int hashCount) |
ByteBloomFilter |
createAnother()
Creates another similar Bloom filter.
|
byte[] |
createBloomKey(byte[] rowBuf,
int rowOffset,
int rowLen,
byte[] qualBuf,
int qualOffset,
int qualLen)
Create a key for a row-column Bloom filter.
|
static ByteBloomFilter |
createBySize(int byteSizeHint,
double errorRate,
int hashType,
int foldFactor)
Creates a Bloom filter of the given size.
|
static String |
formatStats(BloomFilterBase bloomFilter)
A human-readable string with statistics for the given Bloom filter.
|
long |
getByteSize() |
KeyValue.KVComparator |
getComparator() |
org.apache.hadoop.io.Writable |
getDataWriter()
Get a writable interface into bloom filter data (the actual Bloom bits).
|
int |
getHashCount() |
int |
getHashType() |
long |
getKeyCount() |
long |
getMaxKeys() |
org.apache.hadoop.io.Writable |
getMetaWriter()
Get a writable interface into bloom filter meta data.
|
static long |
idealMaxKeys(long bitSize,
double errorRate)
The maximum number of keys we can put into a Bloom filter of a certain
size to maintain the given error rate, assuming the number of hash
functions is chosen optimally and does not even have to be an integer
(hence the "ideal" in the function name).
|
static void |
setRandomGeneratorForTest(Random random)
Sets a random generator to be used for look-ups instead of computing hashes.
|
boolean |
supportsAutoLoading() |
String |
toString() |
void |
writeBloom(DataOutput out)
Writes just the bloom filter to the output array
|
public static final int VERSION
protected long byteSize
protected int hashCount
protected final int hashType
protected final Hash hash
protected int keyCount
protected int maxKeys
protected ByteBuffer bloom
public static final String STATS_RECORD_SEP
public static final double LOG2_SQUARED
public ByteBloomFilter(DataInput meta) throws IOException, IllegalArgumentException
meta
- stored bloom meta dataIllegalArgumentException
- meta data is invalidIOException
public ByteBloomFilter(int maxKeys, double errorRate, int hashType, int foldFactor) throws IllegalArgumentException
allocBloom()
to allocate bloom filter data.maxKeys
- Maximum expected number of keys that will be stored in this
bloomerrorRate
- Desired false positive error rate. Lower rate = more
storage requiredhashType
- Type of hash function to usefoldFactor
- When finished adding entries, you may be able to 'fold'
this bloom to save space. Tradeoff potentially excess bytes in
bloom for ability to fold if keyCount is exponentially greater
than maxKeys.IllegalArgumentException
public static long computeBitSize(long maxKeys, double errorRate)
maxKeys
- errorRate
- public static long idealMaxKeys(long bitSize, double errorRate)
bitSize
- errorRate
- for a more precise estimate
public static long computeMaxKeys(long bitSize, double errorRate, int hashCount)
bitSize
- errorRate
- hashCount
- public double actualErrorRate()
public static double actualErrorRate(long maxKeys, long bitSize, int functionCount)
maxKeys
- bitSize
- functionCount
- public static int computeFoldableByteSize(long bitSize, int foldFactor)
bitSize
- foldFactor
- public static ByteBloomFilter createBySize(int byteSizeHint, double errorRate, int hashType, int foldFactor)
byteSizeHint
- the desired number of bytes for the Bloom filter bit
array. Will be increased so that folding is possible.errorRate
- target false positive rate of the Bloom filterhashType
- Bloom filter hash function typefoldFactor
- public ByteBloomFilter createAnother()
public void allocBloom()
BloomFilterWriter
allocBloom
in interface BloomFilterWriter
public void add(byte[] buf)
public void add(byte[] buf, int offset, int len)
BloomFilterWriter
add
in interface BloomFilterWriter
buf
- data to be added to the bloomoffset
- offset into the data to be addedlen
- length of the data to be addedpublic boolean contains(byte[] buf, int offset, int length, ByteBuffer theBloom)
BloomFilter
contains
in interface BloomFilter
buf
- data to check for existence ofoffset
- offset into the datalength
- length of the datatheBloom
- bloom filter data to search. This can be null if auto-loading
is supported.public static boolean contains(byte[] buf, int offset, int length, ByteBuffer bloomBuf, int bloomOffset, int bloomSize, Hash hash, int hashCount)
public long getKeyCount()
getKeyCount
in interface BloomFilterBase
public long getMaxKeys()
getMaxKeys
in interface BloomFilterBase
public long getByteSize()
getByteSize
in interface BloomFilterBase
public int getHashType()
public void compactBloom()
BloomFilterWriter
compactBloom
in interface BloomFilterWriter
public void writeBloom(DataOutput out) throws IOException
out
- OutputStream to place bloomIOException
- Error writing bloom arraypublic org.apache.hadoop.io.Writable getMetaWriter()
BloomFilterWriter
getMetaWriter
in interface BloomFilterWriter
public org.apache.hadoop.io.Writable getDataWriter()
BloomFilterWriter
getDataWriter
in interface BloomFilterWriter
public int getHashCount()
public boolean supportsAutoLoading()
supportsAutoLoading
in interface BloomFilter
public static void setRandomGeneratorForTest(Random random)
This gets used in contains(byte[], int, int, ByteBuffer, int, int, Hash, int)
random
- The random number source to use, or null to compute actual hashespublic byte[] createBloomKey(byte[] rowBuf, int rowOffset, int rowLen, byte[] qualBuf, int qualOffset, int qualLen)
createBloomKey
in interface BloomFilterBase
public KeyValue.KVComparator getComparator()
getComparator
in interface BloomFilterBase
public static String formatStats(BloomFilterBase bloomFilter)
bloomFilter
- the Bloom filter to output statistics for;STATS_RECORD_SEP
.Copyright © 2007–2019 The Apache Software Foundation. All rights reserved.