There are several considerations when planning the capacity for an HBase cluster and performing the initial configuration. Start with a solid understanding of how HBase handles data internally.
Physical data size on disk is distinct from logical size of your data and is affected by the following:
Increased by HBase overhead
See Section 126.96.36.199, “KeyValue” and Section 6.3.3, “Try to minimize row and column sizes”. At least 24 bytes per key-value (cell), can be more. Small keys/values means more relative overhead.
KeyValue instances are aggregated into blocks, which are indexed. Indexes also have to be stored. Blocksize is configurable on a per-ColumnFamily basis. See Section 9.7, “Regions”.
Increased by size of region server WAL (usually fixed and negligible - less than half of RS memory size, per RS).
Increased by HDFS replication - usually x3.
Aside from the disk space necessary to store the data, one RS may not be able to serve arbitrarily large amounts of data due to some practical limits on region count and size (see below).
Number of nodes can also be driven by required thoughput for reads and/or writes. The throughput one can get per node depends a lot on data (esp. key/value sizes) and request patterns, as well as node and system configuration. Planning should be done for peak load if it is likely that the load would be the main driver of the increase of the node count. PerformanceEvaluation and YCSB tools can be used to test single node or a test cluster.
For write, usually 5-15Mb/s per RS can be expected, since every region server has only one active WAL. There's no good estimate for reads, as it depends vastly on data, requests, and cache hit rate. Section 14.14, “Case Studies” might be helpful.
RS cannot currently utilize very large heap due to cost of GC. There's also no good way of running multiple RS-es per server (other than running several VMs per machine). Thus, ~20-24Gb or less memory dedicated to one RS is recommended. GC tuning is required for large heap sizes. See Section 188.8.131.52, “Long GC pauses”, Section 15.2.3, “JVM Garbage Collection Logs” and elsewhere (TODO: where?)
Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range. The number of regions cannot be configured directly (unless you go for fully manual splitting); adjust the region size to achieve the target region size given table size.
When configuring regions for multiple tables, note that most region settings can be set
on a per-table basis via HTableDescriptor,
as well as shell commands. These settings will override the ones in
hbase-site.xml. That is useful if your tables have different
Also note that in the discussion of region sizes here, HDFS replication factor is not (and should not be) taken into account, whereas other factors above should be. So, if your data is compressed and replicated 3 ways by HDFS, "9 Gb region" means 9 Gb of compressed data. HDFS replication factor only affects your disk usage and is invisible to most HBase code.
In production scenarios, where you have a lot of data, you are normally concerned with
the maximum number of regions you can have per server. Section 184.108.40.206, “Why cannot I have too many regions?” has technical discussion on the subject; in short, maximum
number of regions is mostly determined by memstore memory usage. Each region has its own
memstores; these grow up to a configurable size; usually in 128-256Mb range, see
hbase.hregion.memstore.flush.size. There's one memstore per column family
(so there's only one per region if there's one CF in the table). RS dedicates some
fraction of total memory (see
hbase.regionserver.global.memstore.size) to region memstores. If this
memory is exceeded (too much memstore usage), undesirable consequences such as
unresponsive server, or later compaction storms, can result. Thus, a good starting point
for the number of regions per RS (assuming one table) is:
(RS memory)*(total memstore fraction)/((memstore size)*(# column families))
E.g. if RS has 16Gb RAM, with default settings, it is 16384*0.4/128 ~ 51 regions per RS is a starting point. The formula can be extended to multiple tables; if they all have the same configuration, just use total number of families.
This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate. If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count. Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.
For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.
HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle. This really can't be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn't doing anything.
On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.
For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp. major, can degrade cluster performance. Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal. For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.
If you cannot estimate the size of your tables well, when starting off, it's probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data. See Section 220.127.116.11.3, “Experimental: Stripe Compactions”.
According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases. However, it is important to think about the data vs cache size ratio at the RS level. With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.
First, see Section 2.6, “The Important Configurations”. Note that some configurations, more than others, depend on specific scenarios. Pay special attention to:
hbase.regionserver.handler.count - request handler thread count, vital
for high-throughput workloads.
Section 18.104.22.168, “Configuring the size and number of WAL files” - the blocking number of WAL files depends on your memstore configuration and should be set accordingly to prevent potential blocking when doing high volume of writes.
Then, there are some considerations when setting up your cluster and tables.
Depending on read/write volume and latency requirements, optimal compaction settings may be different. See Section 22.214.171.124, “Compaction” for some details.
When provisioning for large data sizes, however, it's good to keep in mind that
compactions can affect write throughput. Thus, for write-intensive workloads, you may opt
for less frequent compactions and more store files per regions. Minimum number of files
for compactions (
hbase.hstore.compaction.min) can be set to higher
hbase.hstore.blockingStoreFiles should also be increased, as more files
might accumulate in such case. You may also consider manually managing compactions: Section 126.96.36.199, “Managed Compactions”
Based on the target number of the regions per RS (see above) and number of RSes, one can pre-split the table at creation time. This would both avoid some costly splitting as the table starts to fill up, and ensure that the table starts out already distributed across many servers.
If the table is expected to grow large enough to justify that, at least one region per RS should be created. It is not recommended to split immediately into the full target number of regions (e.g. 50 * number of RSes), but a low intermediate value can be chosen. For multiple tables, it is recommended to be conservative with presplitting (e.g. pre-split 1 region per RS at most), especially if you don't know how much each table will grow. If you split too much, you may end up with too many regions, with some tables having too many small regions.
For pre-splitting howto, see Section 14.8.2, “ Table Creation: Pre-Creating Regions ”.