See Section 6.3.2, “Try to minimize row and column sizes”. See also Section 18.104.22.168, “However...” for compression caveats.
The regionsize can be set on a per-table basis via
HTableDescriptor in the
event where certain tables require different regionsizes than the configured default regionsize.
See Section 11.4.1, “Number of Regions” for more information.
Bloom Filters can be enabled per-ColumnFamily.
HColumnDescriptor.setBloomFilterType(NONE | ROW |
ROWCOL) to enable blooms per Column Family. Default =
NONE for no bloom filters. If
ROW, the hash of the row will be added to the bloom
on each insert. If
ROWCOL, the hash of the row +
column family + column family qualifier will be added to the bloom on
each key insert.
The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).
ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section 9.6.4, “Block Cache”, but it is not a guarantee that the entire table will be in memory.
See HColumnDescriptor for more information.
Production systems should use compression with their ColumnFamily definitions. See Appendix C, Compression In HBase for more information.
Compression deflates data on disk. When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated. So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.
See Section 6.3.2, “Try to minimize row and column sizes” on for schema design tips, and Section 22.214.171.124, “KeyValue” for more information on HBase stores data internally.