001/* 002 * Licensed to the Apache Software Foundation (ASF) under one 003 * or more contributor license agreements. See the NOTICE file 004 * distributed with this work for additional information 005 * regarding copyright ownership. The ASF licenses this file 006 * to you under the Apache License, Version 2.0 (the 007 * "License"); you may not use this file except in compliance 008 * with the License. You may obtain a copy of the License at 009 * 010 * http://www.apache.org/licenses/LICENSE-2.0 011 * 012 * Unless required by applicable law or agreed to in writing, software 013 * distributed under the License is distributed on an "AS IS" BASIS, 014 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 015 * See the License for the specific language governing permissions and 016 * limitations under the License. 017 */ 018package org.apache.hadoop.hbase.util; 019 020import org.apache.hadoop.hbase.Cell; 021import org.apache.hadoop.hbase.nio.ByteBuff; 022import org.apache.hadoop.hbase.regionserver.BloomType; 023import org.apache.yetus.audience.InterfaceAudience; 024 025/** 026 * Implements a <i>Bloom filter</i>, as defined by Bloom in 1970. 027 * <p> 028 * The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the 029 * networking research community in the past decade thanks to the bandwidth efficiencies that it 030 * offers for the transmission of set membership information between networked hosts. A sender 031 * encodes the information into a bit vector, the Bloom filter, that is more compact than a 032 * conventional representation. Computation and space costs for construction are linear in the 033 * number of elements. The receiver uses the filter to test whether various elements are members of 034 * the set. Though the filter will occasionally return a false positive, it will never return a 035 * false negative. When creating the filter, the sender can choose its desired point in a trade-off 036 * between the false positive rate and the size. 037 * <p> 038 * Originally inspired by <a href="http://www.one-lab.org/">European Commission One-Lab Project 039 * 034819</a>. Bloom filters are very sensitive to the number of elements inserted into them. For 040 * HBase, the number of entries depends on the size of the data stored in the column. Currently the 041 * default region size is 256MB, so entry count ~= 256MB / (average value size for column). Despite 042 * this rule of thumb, there is no efficient way to calculate the entry count after compactions. 043 * Therefore, it is often easier to use a dynamic bloom filter that will add extra space instead of 044 * allowing the error rate to grow. ( 045 * http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey .pdf ) m denotes the 046 * number of bits in the Bloom filter (bitSize) n denotes the number of elements inserted into the 047 * Bloom filter (maxKeys) k represents the number of hash functions used (nbHash) e represents the 048 * desired false positive rate for the bloom (err) If we fix the error rate (e) and know the number 049 * of entries, then the optimal bloom size m = -(n * ln(err) / (ln(2)^2) ~= ln(err) / ln(0.6185) The 050 * probability of false positives is minimized when k = m/n ln(2). 051 * @see BloomFilter The general behavior of a filter 052 * @see <a href="http://portal.acm.org/citation.cfm?id=362692&dl=ACM&coll=portal"> Space/Time 053 * Trade-Offs in Hash Coding with Allowable Errors</a> 054 * @see BloomFilterWriter for the ability to add elements to a Bloom filter 055 */ 056@InterfaceAudience.Private 057public interface BloomFilter extends BloomFilterBase { 058 059 /** 060 * Check if the specified key is contained in the bloom filter. 061 * @param keyCell the key to check for the existence of 062 * @param bloom bloom filter data to search. This can be null if auto-loading is supported. 063 * @param type The type of Bloom ROW/ ROW_COL 064 * @return true if matched by bloom, false if not 065 */ 066 boolean contains(Cell keyCell, ByteBuff bloom, BloomType type); 067 068 /** 069 * Check if the specified key is contained in the bloom filter. 070 * @param buf data to check for existence of 071 * @param offset offset into the data 072 * @param length length of the data 073 * @param bloom bloom filter data to search. This can be null if auto-loading is supported. 074 * @return true if matched by bloom, false if not 075 */ 076 boolean contains(byte[] buf, int offset, int length, ByteBuff bloom); 077 078 /** 079 * @return true if this Bloom filter can automatically load its data and thus allows a null byte 080 * buffer to be passed to contains() 081 */ 082 boolean supportsAutoLoading(); 083}