All Superinterfaces:: BloomFilterBase

All Known Implementing Classes:: CompoundBloomFilter

@Private public interface BloomFilter extends BloomFilterBase

Implements a Bloom filter, as defined by Bloom in 1970.

The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.

Originally inspired by European Commission One-Lab Project 034819. Bloom filters are very sensitive to the number of elements inserted into them. For HBase, the number of entries depends on the size of the data stored in the column. Currently the default region size is 256MB, so entry count ~= 256MB / (average value size for column). Despite this rule of thumb, there is no efficient way to calculate the entry count after compactions. Therefore, it is often easier to use a dynamic bloom filter that will add extra space instead of allowing the error rate to grow. ( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey .pdf ) m denotes the number of bits in the Bloom filter (bitSize) n denotes the number of elements inserted into the Bloom filter (maxKeys) k represents the number of hash functions used (nbHash) e represents the desired false positive rate for the bloom (err) If we fix the error rate (e) and know the number of entries, then the optimal bloom size m = -(n * ln(err) / (ln(2)^2) ~= ln(err) / ln(0.6185) The probability of false positives is minimized when k = m/n ln(2).

See Also:

Method Summary

Modifier and Type

Method

Description

boolean

contains(byte[] buf, int offset, int length, ByteBuff bloom)

Check if the specified key is contained in the bloom filter.

boolean

contains(Cell keyCell, ByteBuff bloom, BloomType type)

Check if the specified key is contained in the bloom filter.

boolean

supportsAutoLoading()

Methods inherited from interface org.apache.hadoop.hbase.util.BloomFilterBase
getByteSize, getKeyCount, getMaxKeys

Method Details
- contains
  
  boolean contains(Cell keyCell, ByteBuff bloom, BloomType type)
  
  Check if the specified key is contained in the bloom filter.
  
  Parameters:
  
  keyCell - the key to check for the existence of
  
  bloom - bloom filter data to search. This can be null if auto-loading is supported.
  
  type - The type of Bloom ROW/ ROW_COL
  
  Returns:
  
  true if matched by bloom, false if not
- contains
  
  boolean contains(byte[] buf, int offset, int length, ByteBuff bloom)
  
  Check if the specified key is contained in the bloom filter.
  
  Parameters:
  
  buf - data to check for existence of
  
  offset - offset into the data
  
  length - length of the data
  
  bloom - bloom filter data to search. This can be null if auto-loading is supported.
  
  Returns:
  
  true if matched by bloom, false if not
- supportsAutoLoading
  
  boolean supportsAutoLoading()
  
  Returns:
  
  true if this Bloom filter can automatically load its data and thus allows a null byte buffer to be passed to contains()

Interface BloomFilter

Method Summary

Methods inherited from interface org.apache.hadoop.hbase.util.BloomFilterBase

Method Details

contains

contains

supportsAutoLoading