ByteBloomFilter (Apache HBase 1.4.11 API)

java.lang.Object
- org.apache.hadoop.hbase.util.ByteBloomFilter

All Implemented Interfaces:

BloomFilter, BloomFilterBase, BloomFilterWriter
```
@InterfaceAudience.Private
public class ByteBloomFilter
extends Object
implements BloomFilter, BloomFilterWriter
```
Implements a Bloom filter, as defined by Bloom in 1970.
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Originally inspired by European Commission One-Lab Project 034819. Bloom filters are very sensitive to the number of elements inserted into them. For HBase, the number of entries depends on the size of the data stored in the column. Currently the default region size is 256MB, so entry count ~= 256MB / (average value size for column). Despite this rule of thumb, there is no efficient way to calculate the entry count after compactions. Therefore, it is often easier to use a dynamic bloom filter that will add extra space instead of allowing the error rate to grow. ( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey .pdf ) m denotes the number of bits in the Bloom filter (bitSize) n denotes the number of elements inserted into the Bloom filter (maxKeys) k represents the number of hash functions used (nbHash) e represents the desired false positive rate for the bloom (err) If we fix the error rate (e) and know the number of entries, then the optimal bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185) The probability of false positives is minimized when k = m/n ln(2).

See Also:
The general behavior of a filter, Space/Time Trade-Offs in Hash Coding with Allowable Errors

Field Summary

Fields
Modifier and Type	Field and Description
`protected ByteBuffer`	`bloom` Bloom bits
`protected long`	`byteSize` Bytes (B) in the array.
`protected Hash`	`hash` Hash Function
`protected int`	`hashCount` Number of hash functions
`protected int`	`hashType` Hash type
`protected int`	`keyCount` Keys currently in the bloom
`static double`	`LOG2_SQUARED` Used in computing the optimal Bloom filter size.
`protected int`	`maxKeys` Max Keys expected for the bloom
`static String`	`STATS_RECORD_SEP` Record separator for the Bloom filter statistics human-readable string
`static int`	`VERSION` Current file format version

Constructor Summary

Constructors
Constructor and Description
`ByteBloomFilter(DataInput meta)` Loads bloom filter meta data from file input.
`ByteBloomFilter(int maxKeys, double errorRate, int hashType, int foldFactor)` Determines & initializes bloom filter meta data from user config.

Method Summary

Methods
Modifier and Type	Method and Description
`double`	`actualErrorRate()` Computes the error rate for this Bloom filter, taking into account the actual number of hash functions and keys inserted.
`static double`	`actualErrorRate(long maxKeys, long bitSize, int functionCount)` Computes the actual error rate for the given number of elements, number of bits, and number of hash functions.
`void`	`add(byte[] buf)`
`void`	`add(byte[] buf, int offset, int len)` Add the specified binary to the bloom filter.
`void`	`allocBloom()` Allocate memory for the bloom filter data.
`void`	`compactBloom()` Compact the Bloom filter before writing metadata & data to disk.
`static long`	`computeBitSize(long maxKeys, double errorRate)`
`static int`	`computeFoldableByteSize(long bitSize, int foldFactor)` Increases the given byte size of a Bloom filter until it can be folded by the given factor.
`static long`	`computeMaxKeys(long bitSize, double errorRate, int hashCount)` The maximum number of keys we can put into a Bloom filter of a certain size to get the given error rate, with the given number of hash functions.
`boolean`	`contains(byte[] buf, int offset, int length, ByteBuffer theBloom)` Check if the specified key is contained in the bloom filter.
`static boolean`	`contains(byte[] buf, int offset, int length, ByteBuffer bloomBuf, int bloomOffset, int bloomSize, Hash hash, int hashCount)`
`ByteBloomFilter`	`createAnother()` Creates another similar Bloom filter.
`byte[]`	`createBloomKey(byte[] rowBuf, int rowOffset, int rowLen, byte[] qualBuf, int qualOffset, int qualLen)` Create a key for a row-column Bloom filter.
`static ByteBloomFilter`	`createBySize(int byteSizeHint, double errorRate, int hashType, int foldFactor)` Creates a Bloom filter of the given size.
`static String`	`formatStats(BloomFilterBase bloomFilter)` A human-readable string with statistics for the given Bloom filter.
`long`	`getByteSize()`
`KeyValue.KVComparator`	`getComparator()`
`org.apache.hadoop.io.Writable`	`getDataWriter()` Get a writable interface into bloom filter data (the actual Bloom bits).
`int`	`getHashCount()`
`int`	`getHashType()`
`long`	`getKeyCount()`
`long`	`getMaxKeys()`
`org.apache.hadoop.io.Writable`	`getMetaWriter()` Get a writable interface into bloom filter meta data.
`static long`	`idealMaxKeys(long bitSize, double errorRate)` The maximum number of keys we can put into a Bloom filter of a certain size to maintain the given error rate, assuming the number of hash functions is chosen optimally and does not even have to be an integer (hence the "ideal" in the function name).
`static void`	`setRandomGeneratorForTest(Random random)` Sets a random generator to be used for look-ups instead of computing hashes.
`boolean`	`supportsAutoLoading()`
`String`	`toString()`
`void`	`writeBloom(DataOutput out)` Writes just the bloom filter to the output array

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - VERSION
```
public static final int VERSION
```
    Current file format version
    
    See Also:
    Constant Field Values
  - byteSize
```
protected long byteSize
```
    Bytes (B) in the array. This actually has to fit into an int.
  - hashCount
```
protected int hashCount
```
    Number of hash functions
  - hashType
```
protected final int hashType
```
    Hash type
  - hash
```
protected final Hash hash
```
    Hash Function
  - keyCount
```
protected int keyCount
```
    Keys currently in the bloom
  - maxKeys
```
protected int maxKeys
```
    Max Keys expected for the bloom
  - bloom
```
protected ByteBuffer bloom
```
    Bloom bits
  - STATS_RECORD_SEP
```
public static final String STATS_RECORD_SEP
```
    Record separator for the Bloom filter statistics human-readable string
    
    See Also:
    Constant Field Values
  - LOG2_SQUARED
```
public static final double LOG2_SQUARED
```
    Used in computing the optimal Bloom filter size. This approximately equals 0.480453.
- Constructor Detail
  - ByteBloomFilter
```
public ByteBloomFilter(DataInput meta)
                throws IOException,
                       IllegalArgumentException
```
    Loads bloom filter meta data from file input.
    
    Parameters:
    meta - stored bloom meta data
    
    Throws:
    
    IllegalArgumentException - meta data is invalid
    
    IOException
  - ByteBloomFilter
```
public ByteBloomFilter(int maxKeys,
               double errorRate,
               int hashType,
               int foldFactor)
                throws IllegalArgumentException
```
    Determines & initializes bloom filter meta data from user config. Call allocBloom() to allocate bloom filter data.
    
    Parameters:
    maxKeys - Maximum expected number of keys that will be stored in this bloom
    errorRate - Desired false positive error rate. Lower rate = more storage required
    hashType - Type of hash function to use
    foldFactor - When finished adding entries, you may be able to 'fold' this bloom to save space. Tradeoff potentially excess bytes in bloom for ability to fold if keyCount is exponentially greater than maxKeys.
    
    Throws:
    
    IllegalArgumentException
- Method Detail
  - computeBitSize
```
public static long computeBitSize(long maxKeys,
                  double errorRate)
```
    Parameters:
    maxKeys -
    errorRate -
    
    Returns:
    the number of bits for a Bloom filter than can hold the given number of keys and provide the given error rate, assuming that the optimal number of hash functions is used and it does not have to be an integer.
  - idealMaxKeys
```
public static long idealMaxKeys(long bitSize,
                double errorRate)
```
    The maximum number of keys we can put into a Bloom filter of a certain size to maintain the given error rate, assuming the number of hash functions is chosen optimally and does not even have to be an integer (hence the "ideal" in the function name).
    
    Parameters:
    bitSize -
    errorRate -
    
    Returns:
    maximum number of keys that can be inserted into the Bloom filter
    See Also:
    for a more precise estimate
  - computeMaxKeys
```
public static long computeMaxKeys(long bitSize,
                  double errorRate,
                  int hashCount)
```
    The maximum number of keys we can put into a Bloom filter of a certain size to get the given error rate, with the given number of hash functions.
    
    Parameters:
    bitSize -
    errorRate -
    hashCount -
    
    Returns:
    the maximum number of keys that can be inserted in a Bloom filter to maintain the target error rate, if the number of hash functions is provided.
  - actualErrorRate
```
public double actualErrorRate()
```
    Computes the error rate for this Bloom filter, taking into account the actual number of hash functions and keys inserted. The return value of this function changes as a Bloom filter is being populated. Used for reporting the actual error rate of compound Bloom filters when writing them out.
    
    Returns:
    error rate for this particular Bloom filter
  - actualErrorRate
```
public static double actualErrorRate(long maxKeys,
                     long bitSize,
                     int functionCount)
```
    Computes the actual error rate for the given number of elements, number of bits, and number of hash functions. Taken directly from the Wikipedia Bloom filter article.
    
    Parameters:
    maxKeys -
    bitSize -
    functionCount -
    
    Returns:
    the actual error rate
  - computeFoldableByteSize
```
public static int computeFoldableByteSize(long bitSize,
                          int foldFactor)
```
    Increases the given byte size of a Bloom filter until it can be folded by the given factor.
    
    Parameters:
    bitSize -
    foldFactor -
    
    Returns:
    Foldable byte size
  - createBySize
```
public static ByteBloomFilter createBySize(int byteSizeHint,
                           double errorRate,
                           int hashType,
                           int foldFactor)
```
    Creates a Bloom filter of the given size.
    
    Parameters:
    byteSizeHint - the desired number of bytes for the Bloom filter bit array. Will be increased so that folding is possible.
    errorRate - target false positive rate of the Bloom filter
    hashType - Bloom filter hash function type
    foldFactor -
    
    Returns:
    the new Bloom filter of the desired size
  - createAnother
```
public ByteBloomFilter createAnother()
```
    Creates another similar Bloom filter. Does not copy the actual bits, and sets the new filter's key count to zero.
    
    Returns:
    a Bloom filter with the same configuration as this
  - allocBloom
```
public void allocBloom()
```
    Description copied from interface: BloomFilterWriter
    
    Allocate memory for the bloom filter data.
    
    Specified by:
    
    allocBloom in interface BloomFilterWriter
  - add
```
public void add(byte[] buf)
```
  - add
```
public void add(byte[] buf,
       int offset,
       int len)
```
    Description copied from interface: BloomFilterWriter
    
    Add the specified binary to the bloom filter.
    
    Specified by:
    
    add in interface BloomFilterWriter
    
    Parameters:
    buf - data to be added to the bloom
    offset - offset into the data to be added
    len - length of the data to be added
  - contains
```
public boolean contains(byte[] buf,
               int offset,
               int length,
               ByteBuffer theBloom)
```
    Description copied from interface: BloomFilter
    
    Check if the specified key is contained in the bloom filter.
    
    Specified by:
    
    contains in interface BloomFilter
    
    Parameters:
    buf - data to check for existence of
    offset - offset into the data
    length - length of the data
    theBloom - bloom filter data to search. This can be null if auto-loading is supported.
    
    Returns:
    true if matched by bloom, false if not
  - contains
```
public static boolean contains(byte[] buf,
               int offset,
               int length,
               ByteBuffer bloomBuf,
               int bloomOffset,
               int bloomSize,
               Hash hash,
               int hashCount)
```
  - getKeyCount
```
public long getKeyCount()
```
    Specified by:
    
    getKeyCount in interface BloomFilterBase
    
    Returns:
    The number of keys added to the bloom
  - getMaxKeys
```
public long getMaxKeys()
```
    Specified by:
    
    getMaxKeys in interface BloomFilterBase
    
    Returns:
    The max number of keys that can be inserted to maintain the desired error rate
  - getByteSize
```
public long getByteSize()
```
    Specified by:
    
    getByteSize in interface BloomFilterBase
    
    Returns:
    Size of the bloom, in bytes
  - getHashType
```
public int getHashType()
```
  - compactBloom
```
public void compactBloom()
```
    Description copied from interface: BloomFilterWriter
    
    Compact the Bloom filter before writing metadata & data to disk.
    
    Specified by:
    
    compactBloom in interface BloomFilterWriter
  - writeBloom
```
public void writeBloom(DataOutput out)
                throws IOException
```
    Writes just the bloom filter to the output array
    
    Parameters:
    out - OutputStream to place bloom
    
    Throws:
    
    IOException - Error writing bloom array
  - getMetaWriter
```
public org.apache.hadoop.io.Writable getMetaWriter()
```
    Description copied from interface: BloomFilterWriter
    
    Get a writable interface into bloom filter meta data.
    
    Specified by:
    
    getMetaWriter in interface BloomFilterWriter
    
    Returns:
    a writable instance that can be later written to a stream
  - getDataWriter
```
public org.apache.hadoop.io.Writable getDataWriter()
```
    Description copied from interface: BloomFilterWriter
    
    Get a writable interface into bloom filter data (the actual Bloom bits). Not used for compound Bloom filters.
    
    Specified by:
    
    getDataWriter in interface BloomFilterWriter
    
    Returns:
    a writable instance that can be later written to a stream
  - getHashCount
```
public int getHashCount()
```
  - supportsAutoLoading
```
public boolean supportsAutoLoading()
```
    Specified by:
    
    supportsAutoLoading in interface BloomFilter
    
    Returns:
    true if this Bloom filter can automatically load its data and thus allows a null byte buffer to be passed to contains()
  - setRandomGeneratorForTest
```
public static void setRandomGeneratorForTest(Random random)
```
    Sets a random generator to be used for look-ups instead of computing hashes. Can be used to simulate uniformity of accesses better in a test environment. Should not be set in a real environment where correctness matters!
    This gets used in contains(byte[], int, int, ByteBuffer, int, int, Hash, int)
    
    Parameters:
    random - The random number source to use, or null to compute actual hashes
  - createBloomKey
```
public byte[] createBloomKey(byte[] rowBuf,
                    int rowOffset,
                    int rowLen,
                    byte[] qualBuf,
                    int qualOffset,
                    int qualLen)
```
    Create a key for a row-column Bloom filter. Just concatenate row and column by default. May return the original row buffer if the column qualifier is empty.
    
    Specified by:
    
    createBloomKey in interface BloomFilterBase
  - getComparator
```
public KeyValue.KVComparator getComparator()
```
    Specified by:
    
    getComparator in interface BloomFilterBase
    
    Returns:
    Bloom key comparator
  - formatStats
```
public static String formatStats(BloomFilterBase bloomFilter)
```
    A human-readable string with statistics for the given Bloom filter.
    
    Parameters:
    bloomFilter - the Bloom filter to output statistics for;
    
    Returns:
    a string consisting of "<key>: <value>" parts separated by STATS_RECORD_SEP.
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class ByteBloomFilter

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

VERSION

byteSize

hashCount

hashType

hash

keyCount

maxKeys

bloom

STATS_RECORD_SEP

LOG2_SQUARED

Constructor Detail

ByteBloomFilter

ByteBloomFilter

Method Detail

computeBitSize

idealMaxKeys

computeMaxKeys

actualErrorRate

actualErrorRate

computeFoldableByteSize

createBySize

createAnother

allocBloom

add

add

contains

contains

getKeyCount

getMaxKeys

getByteSize

getHashType

compactBloom

writeBloom

getMetaWriter

getDataWriter

getHashCount

supportsAutoLoading

setRandomGeneratorForTest

createBloomKey

getComparator

formatStats

toString