9.7. Regions

Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects is as follows:

Table       (HBase table)
    Region       (Regions for the table)
         Store          (Store per ColumnFamily for each Region for the table)
              MemStore           (MemStore for each Store for each Region for the table)
              StoreFile          (StoreFiles for each Store for each Region for the table)
                    Block             (Blocks within a StoreFile within a Store for each Region for the table)
 

For a description of what HBase files look like when written to HDFS, see Section 15.7.2, “Browsing HDFS for HBase Objects”.

9.7.1. Considerations for Number of Regions

In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per server. The considerations for this are as follows:

9.7.1.1. Why cannot I have too many regions?

Typically you want to keep your region count low on HBase for numerous reasons. Usually right around 100 regions per RegionServer has yielded the best results. Here are some of the reasons below for keeping region count low:

  1. MSLAB requires 2mb per memstore (that's 2mb per family per region). 1000 regions that have 2 families each is 3.9GB of heap used, and it's not even storing data yet. NB: the 2MB value is configurable.

  2. If you fill all the regions at somewhat the same rate, the global memory usage makes it that it forces tiny flushes when you have too many regions which in turn generates compactions. Rewriting the same data tens of times is the last thing you want. An example is filling 1000 regions (with one family) equally and let's consider a lower bound for global memstore usage of 5GB (the region server would have a big heap). Once it reaches 5GB it will force flush the biggest region, at that point they should almost all have about 5MB of data so it would flush that amount. 5MB inserted later, it would flush another region that will now have a bit over 5MB of data, and so on. This is currently the main limiting factor for the number of regions; see Section 17.9.2.1, “Number of regions per RS - upper bound” for detailed formula.

  3. The master as is is allergic to tons of regions, and will take a lot of time assigning them and moving them around in batches. The reason is that it's heavy on ZK usage, and it's not very async at the moment (could really be improved -- and has been imporoved a bunch in 0.96 hbase).

  4. In older versions of HBase (pre-v2 hfile, 0.90 and previous), tons of regions on a few RS can cause the store file index to rise, increasing heap usage and potentially creating memory pressure or OOME on the RSs

Another issue is the effect of the number of regions on mapreduce jobs; it is typical to have one mapper per HBase region. Thus, hosting only 5 regions per RS may not be enough to get sufficient number of tasks for a mapreduce job, while 1000 regions will generate far too many tasks.

See Section 17.9.2, “Determining region count and size” for configuration guidelines.

9.7.2. Region-RegionServer Assignment

This section describes how Regions are assigned to RegionServers.

9.7.2.1. Startup

When HBase starts regions are assigned as follows (short version):

  1. The Master invokes the AssignmentManager upon startup.

  2. The AssignmentManager looks at the existing region assignments in META.

  3. If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept.

  4. If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the region. The DefaultLoadBalancer will randomly assign the region to a RegionServer.

  5. META is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer.

9.7.2.2. Failover

When a RegionServer fails:

  1. The regions immediately become unavailable because the RegionServer is down.

  2. The Master will detect that the RegionServer has failed.

  3. The region assignments will be considered invalid and will be re-assigned just like the startup sequence.

  4. In-flight queries are re-tried, and not lost.

  5. Operations are switched to a new RegionServer within the following amount of time:

    ZooKeeper session timeout + split time + assignment/replay time

9.7.2.3. Region Load Balancing

Regions can be periodically moved by the Section 9.5.4.1, “LoadBalancer”.

9.7.3. Region-RegionServer Locality

Over time, Region-RegionServer locality is achieved via HDFS block replication. The HDFS client does the following by default when choosing locations to write replicas:

  1. First replica is written to local node

  2. Second replica is written to a random node on another rack

  3. Third replica is written on the same rack as the second, but on a different node chosen randomly

  4. Subsequent replicas are written on random nodes on the cluster [25]

Thus, HBase eventually achieves locality for a region after a flush or a compaction. In a RegionServer failover situation a RegionServer may be assigned regions with non-local StoreFiles (because none of the replicas are local), however as new data is written in the region, or the table is compacted and StoreFiles are re-written, they will become "local" to the RegionServer.

For more information, see Replica Placement: The First Baby Steps on this page: HDFS Architecture and also Lars George's blog on HBase and HDFS locality.

9.7.4. Region Splits

Regions split when they reach a configured threshold. Below we treat the topic in short. For a longer exposition, see Apache HBase Region Splitting and Merging by our Enis Soztutar.

Splits run unaided on the RegionServer; i.e. the Master does not participate. The RegionServer splits a region, offlines the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. See Section 2.5.2.7, “Managed Splitting” for how to manually manage splits (and for why you might do this)

9.7.4.1. Custom Split Policies

The default split policy can be overwritten using a custom RegionSplitPolicy (HBase 0.94+). Typically a custom split policy should extend HBase's default split policy: ConstantSizeRegionSplitPolicy.

The policy can set globally through the HBaseConfiguration used or on a per table basis:

HTableDescriptor myHtd = ...;
myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName());

9.7.5. Online Region Merges

Both Master and Regionserver participate in the event of online region merges. Client sends merge RPC to master, then master moves the regions together to the same regionserver where the more heavily loaded region resided, finally master send merge request to this regionserver and regionserver run the region merges. Similar with process of region splits, region merges run as a local transaction on the regionserver, offlines the regions and then merges two regions on the file system, atomically delete merging regions from META and add merged region to the META, opens merged region on the regionserver and reports the merge to Master at last.

An example of region merges in the hbase shell

$ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME'
          hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true
          

It's an asynchronous operation and call returns immediately without waiting merge completed. Passing 'true' as the optional third parameter will force a merge ('force' merges regardless else merge will fail unless passed adjacent regions. 'force' is for expert use only)

9.7.6. Store

A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.

9.7.6.1. MemStore

The MemStore holds in-memory modifications to the Store. Modifications are Cells/KeyValues. When a flush is requested, the current memstore is moved to a snapshot and is cleared. HBase continues to serve edits from the new memstore and backing snapshot until the flusher reports that the flush succeeded. At this point, the snapshot is discarded. Note that when the flush happens, Memstores that belong to the same region will all be flushed.

9.7.6.2. MemStoreFlush

A MemStore flush can be triggered under any of the conditions listed below. The minimum flush unit is per region, not at individual MemStore level.

  1. When a MemStore reaches the value specified by hbase.hregion.memstore.flush.size, all MemStores that belong to its region will be flushed out to disk.

  2. When overall memstore usage reaches the value specified by hbase.regionserver.global.memstore.upperLimit, MemStores from various regions will be flushed out to disk to reduce overall MemStore usage in a Region Server. The flush order is based on the descending order of a region's MemStore usage. Regions will have their MemStores flushed until the overall MemStore usage drops to or slightly below hbase.regionserver.global.memstore.lowerLimit.

  3. When the number of HLog per region server reaches the value specified in hbase.regionserver.max.logs, MemStores from various regions will be flushed out to disk to reduce HLog count. The flush order is based on time. Regions with the oldest MemStores are flushed first until HLog count drops below hbase.regionserver.max.logs.

9.7.6.3. Scans

  • When a client issues a scan against a table, HBase generates RegionScanner objects, one per region, to serve the scan request.

  • The RegionScanner object contains a list of StoreScanner objects, one per column family.

  • Each StoreScanner object further contains a list of StoreFileScanner objects, corresponding to each StoreFile and HFile of the corresponding column family, and a list of KeyValueScanner objects for the MemStore.

  • The two lists are merge into one, which is sorted in ascending order with the scan object for the MemStore at the end of the list.

  • When a StoreFileScanner object is constructed, it is associated with a MultiVersionConsistencyControl read point, which is the current memstoreTS, filtering out any new updates beyond the read point.

9.7.6.4. StoreFile (HFile)

StoreFiles are where your data lives.

9.7.6.4.1. HFile Format

The hfile file format is based on the SSTable file described in the BigTable [2006] paper and on Hadoop's tfile (The unit test suite and the compression harness were taken directly from tfile). Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a helpful description, HBase I/O: HFile.

For more information, see the HFile source code. Also see Appendix E, HFile format version 2 for information about the HFile v2 format that was included in 0.92.

9.7.6.4.2. HFile Tool

To view a textualized version of hfile content, you can do use the org.apache.hadoop.hbase.io.hfile.HFile tool. Type the following to see usage:

$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile  

For example, to view the content of the file hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475, type the following:

 $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475  

If you leave off the option -v to see just a summary on the hfile. See usage for other things to do with the HFile tool.

9.7.6.4.3. StoreFile Directory Structure on HDFS

For more information of what StoreFiles look like on HDFS with respect to the directory structure, see Section 15.7.2, “Browsing HDFS for HBase Objects”.

9.7.6.5. Blocks

StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis.

Compression happens at the block level within StoreFiles. For more information on compression, see Appendix C, Compression In HBase.

For more information on blocks, see the HFileBlock source code.

9.7.6.6. KeyValue

The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array at where to start interpreting the content as KeyValue.

The KeyValue format inside a byte array is:

  • keylength

  • valuelength

  • key

  • value

The Key is further decomposed as:

  • rowlength

  • row (i.e., the rowkey)

  • columnfamilylength

  • columnfamily

  • columnqualifier

  • timestamp

  • keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily)

KeyValue instances are not split across blocks. For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block. For more information, see the KeyValue source code.

9.7.6.6.1. Example

To emphasize the points above, examine what happens with two Puts for two different columns for the same row:

  • Put #1: rowkey=row1, cf:attr1=value1

  • Put #2: rowkey=row1, cf:attr2=value2

Even though these are for the same row, a KeyValue is created for each column:

Key portion for Put #1:

  • rowlength ------------> 4

  • row -----------------> row1

  • columnfamilylength ---> 2

  • columnfamily --------> cf

  • columnqualifier ------> attr1

  • timestamp -----------> server time of Put

  • keytype -------------> Put

Key portion for Put #2:

  • rowlength ------------> 4

  • row -----------------> row1

  • columnfamilylength ---> 2

  • columnfamily --------> cf

  • columnqualifier ------> attr2

  • timestamp -----------> server time of Put

  • keytype -------------> Put

It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within the KeyValue instance. The longer these identifiers are, the bigger the KeyValue is.

9.7.6.7. Compaction

Compaction is an operation which reduces the number of StoreFiles, by merging them together, in order to increase performance on read operations. Compactions can be resource-intensive to perform, and can either help or hinder performance depending on many factors.

Compactions fall into two categories: minor and major.

Minor compactions usually pick up a small number of small, adjacent StoreFiles and rewrite them as a single StoreFile. Minor compactions do not drop deletes or expired cells. If a minor compaction picks up all the StoreFiles in a Store, it promotes itself from a minor to a major compaction. If there are a lot of small files to be compacted, the algorithm tends to favor minor compactions to "clean up" those small files.

The goal of a major compaction is to end up with a single StoreFile per store. Major compactions also process delete markers and max versions. Attempting to process these during a minor compaction could cause side effects.

Compaction and Deletions.  When an explicit deletion occurs in HBase, the data is not actually deleted. Instead, a tombstone marker is written. The tombstone marker prevents the data from being returned with queries. During a major compaction, the data is actually deleted, and the tombstone marker is removed from the StoreFile. If the deletion happens because of an expired TTL, no tombstone is created. Instead, the expired data is filtered out and is not written back to the compacted StoreFile.

Compaction and Versions.  When you create a column family, you can specify the maximum number of versions to keep, by specifying HColumnDescriptor.setMaxVersions(int versions). The default value is 3. If more versions than the specified maximum exist, the excess versions are filtered out and not written back to the compacted StoreFile.

Major Compactions Can Impact Query Results

In some situations, older versions can be inadvertently resurrected if a newer version is explicitly deleted. See Section 5.9.2.2, “Major compactions change query results” for a more in-depth explanation. This situation is only possible before the compaction finishes.

In theory, major compactions improve performance. However, on a highly loaded system, major compactions can require an inappropriate number of resources and adversely affect performance. In a default configuration, major compactions are scheduled automatically to run once in a 7-day period. This is usually inappropriate for systems in production. You can manage major compactions manually. See Section 2.5.2.8, “Managed Compactions”.

Compactions do not perform region merges. See Section 17.2.2, “Merge” for more information on region merging.

9.7.6.7.1. Algorithm for Compaction File Selection - HBase 0.96.x and newer

The compaction algorithms used by HBase have evolved over time. HBase 0.96 introduced new algorithms for compaction file selection. To find out about the old algorithms, see Section 9.7.6.7, “Compaction”. The rest of this section describes the new algorithm. File selection happens in several phases and is controlled by several configurable parameters. These parameters will be explained in context, and then will be given in a table which shows their descriptions, defaults, and implications of changing them.

TheExploringCompaction PolicyHBASE-7842 was introduced in HBase 0.96 and represents a major change in the algorithms for file selection for compactions. Its goal is to do the most impactful compaction with the lowest cost, in situations where a lot of files need compaction. In such a situation, the list of all eligible files is "explored", and files are grouped by size before any ratio-based algorithms are run. This favors clean-up of large numbers of small files before larger files are considered. For more details, refer to the link to the JIRA. Most of the code for this change can be reviewed in hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.java.

Algorithms for Determining File List and Compaction Type

Create a list of all files which can possibly be compacted, ordered by sequence ID.

The first phase is to create a list of all candidates for compaction. A list is created of all StoreFiles not already in the compaction queue, and all files newer than the newest file that is currently being compacted. This list of files is ordered by the sequence ID. The sequence ID is generated when a Put is appended to the write-ahead log (WAL), and is stored in the metadata of the StoreFile.

Check to see if major compaction is required because there are too many StoreFiles and the memstore is too large.

A store can only have hbase.hstore.blockingStoreFiles. If the store has too many files, you cannot flush data. In addition, you cannot perform an insert if the memstore is over hbase.hregion.memstore.flush.size. Normally, minor compactions will alleviate this situation. However, if the normal compaction algorithm do not find any normally-eligible StoreFiles, a major compaction is the only way to get out of this situation, and is forced.

If you are using the ExploringCompaction policy, the set of files to compact is always selected, and will not trigger a major compaction. See TheExploringCompaction Policy.

If this compaction was user-requested, perform the requested type of compaction.

Compactions can run on a schedule or can be initiated manually. If a compaction is requested manually, HBase always runs that type of compaction. If the user requests a major compaction, the major compaction still runs even if the are more than hbase.hstore.compaction.max files that need compaction.

Exclude files which are too large.

The purpose of compaction is to merge small files together, and it is counterproductive to compact files which are too large. Files larger than hbase.hstore.compaction.max.size are excluded from consideration.

If configured, exclude bulk-loaded files.

You may decide to exclude bulk-loaded files from compaction, in the bulk load operation, by specifying the hbase.mapreduce.hfileoutputformat.compaction.exclude parameter. If a bulk-loaded file was excluded, it is removed from consideration at this point.

If there are too many files to compact, do a minor compaction.

The maximum number of files allowed in a major compaction is controlled by the hbase.hstore.compaction.max parameter. If the list contains more than this number of files, a compaction that would otherwise be a major compaction is downgraded to a minor compaction. However, a user-requested major compaction still occurs even if there are more than hbase.hstore.compaction.max files to compact.

Only run the compaction if enough files need to be compacted.

If the list contains fewer than hbase.hstore.compaction.min files to compact, compaction is aborted.

If this is a minor compaction, determine which files are eligible, based upon the hbase.store.compaction.ratio.

The value of the hbase.store.compaction.ratio parameter is multiplied by the sum of files smaller than a given file, to determine whether that file is selected for compaction during a minor compaction. For instance, if hbase.store.compaction.ratio is 1.2, FileX is 5 mb, FileY is 2 mb, and FileZ is 3 mb:

5 <= 1.2 x (2 + 3)            or          5 <= 6

In this scenario, FileX is eligible for minor compaction. If FileX were 7 mb, it would not be eligible for minor compaction. This ratio favors smaller files. You can configure a different ratio for use in off-peak hours, using the parameter hbase.hstore.compaction.ratio.offpeak, if you also configure hbase.offpeak.start.hour and hbase.offpeak.end.hour.

If major compactions are not managed manually, and it has been too long since the last major compaction, run a major compaction anyway.

If the last major compaction was too long ago and there is more than one file to be compacted, a major compaction is run, even if it would otherwise have been minor. By default, the maximum time between major compactions is 7 days, plus or minus a 4.8 hour period, and determined randomly within those parameters. Prior to HBase 0.96, the major compaction period was 24 hours. This is also referred to as a time-based or time-triggered major compaction. See hbase.hregion.majorcompaction in the table below to tune or disable time-based major compactions.

Table 9.1. Parameters Used by Compaction Algorithm

ParameterDescriptionDefault
hbase.hstore.compaction.min

The minimum number of files which must be eligible for compaction before compaction can run.

In previous versions, the parameter hbase.hstore.compaction.min was called hbase.hstore.compactionThreshold.

3
hbase.hstore.compaction.maxThe maximum number of files which will be selected for a single minor compaction, regardless of the number of eligible files.10
hbase.hstore.compaction.min.sizeA StoreFile smaller than this size (in bytes) will always be eligible for minor compaction.128 MB
hbase.hstore.compaction.max.sizeA StoreFile larger than this size (in bytes) will be excluded from minor compaction.Long.MAX_VALUE
hbase.store.compaction.ratioFor minor compaction, this ratio is used to determine whether a given file is eligible for compaction. Its effect is to limit compaction of large files. Expressed as a floating-point decimal.1.2F
hbase.hstore.compaction.ratio.offpeakThe compaction ratio used during off-peak compactions, if off-peak is enabled. Expressed as a floating-point decimal. This allows for more aggressive compaction, because in theory, the cluster is under less load. Ignored if off-peak is disabled (default).5.0F
hbase.offpeak.start.hourThe start of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set to -1 to disable off-peak.-1 (disabled)
hbase.offpeak.end.hourThe end of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set to -1 to disable off-peak.-1 (disabled)
hbase.regionserver.thread.compaction.throttleThrottles compaction if too much of a backlog of compaction work exists.2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size (which defaults to 128)
hbase.hregion.majorcompactionTime between major compactions, expressed in milliseconds. Set to 0 to disable time-based automatic major compactions. User-requested and size-based major compactions will still run.7 days (604800000 milliseconds)
hbase.hregion.majorcompaction.jitterA multiplier applied to majorCompactionPeriod to cause compaction to occur a given amount of time either side of majorCompactionPeriod. The smaller the number, the closer the compactions will happen to the hbase.hregion.majorcompaction interval. Expressed as a floating-point decimal..50F

9.7.6.7.2. Compaction File Selection

To understand the core algorithm for StoreFile selection, there is some ASCII-art in the Store source code that will serve as useful reference. It has been copied below:

/* normal skew:
 *
 *         older ----> newer
 *     _
 *    | |   _
 *    | |  | |   _
 *  --|-|- |-|- |-|---_-------_-------  minCompactSize
 *    | |  | |  | |  | |  _  | |
 *    | |  | |  | |  | | | | | |
 *    | |  | |  | |  | | | | | |
 */

Important knobs:

  • hbase.store.compaction.ratio Ratio used in compaction file selection algorithm (default 1.2f).

  • hbase.hstore.compaction.min (.90 hbase.hstore.compactionThreshold) (files) Minimum number of StoreFiles per Store to be selected for a compaction to occur (default 2).

  • hbase.hstore.compaction.max (files) Maximum number of StoreFiles to compact per minor compaction (default 10).

  • hbase.hstore.compaction.min.size (bytes) Any StoreFile smaller than this setting with automatically be a candidate for compaction. Defaults to hbase.hregion.memstore.flush.size (128 mb).

  • hbase.hstore.compaction.max.size (.92) (bytes) Any StoreFile larger than this setting with automatically be excluded from compaction (default Long.MAX_VALUE).

The minor compaction StoreFile selection logic is size based, and selects a file for compaction when the file <= sum(smaller_files) * hbase.hstore.compaction.ratio.

9.7.6.7.3. Minor Compaction File Selection - Example #1 (Basic Example)

This example mirrors an example from the unit test TestCompactSelection.

  • hbase.store.compaction.ratio = 1.0f

  • hbase.hstore.compaction.min = 3 (files)

  • hbase.hstore.compaction.max = 5 (files)

  • hbase.hstore.compaction.min.size = 10 (bytes)

  • hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.

Why?

  • 100 --> No, because sum(50, 23, 12, 12) * 1.0 = 97.

  • 50 --> No, because sum(23, 12, 12) * 1.0 = 47.

  • 23 --> Yes, because sum(12, 12) * 1.0 = 24.

  • 12 --> Yes, because the previous file has been included, and because this does not exceed the the max-file limit of 5

  • 12 --> Yes, because the previous file had been included, and because this does not exceed the the max-file limit of 5.

9.7.6.7.4. Minor Compaction File Selection - Example #2 (Not Enough Files To Compact)

This example mirrors an example from the unit test TestCompactSelection.

  • hbase.store.compaction.ratio = 1.0f

  • hbase.hstore.compaction.min = 3 (files)

  • hbase.hstore.compaction.max = 5 (files)

  • hbase.hstore.compaction.min.size = 10 (bytes)

  • hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to newest). With the above parameters, no compaction will be started.

Why?

  • 100 --> No, because sum(25, 12, 12) * 1.0 = 47

  • 25 --> No, because sum(12, 12) * 1.0 = 24

  • 12 --> No. Candidate because sum(12) * 1.0 = 12, there are only 2 files to compact and that is less than the threshold of 3

  • 12 --> No. Candidate because the previous StoreFile was, but there are not enough files to compact

9.7.6.7.5. Minor Compaction File Selection - Example #3 (Limiting Files To Compact)

This example mirrors an example from the unit test TestCompactSelection.

  • hbase.store.compaction.ratio = 1.0f

  • hbase.hstore.compaction.min = 3 (files)

  • hbase.hstore.compaction.max = 5 (files)

  • hbase.hstore.compaction.min.size = 10 (bytes)

  • hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 7, 6, 5, 4, 3.

Why?

  • 7 --> Yes, because sum(6, 5, 4, 3, 2, 1) * 1.0 = 21. Also, 7 is less than the min-size

  • 6 --> Yes, because sum(5, 4, 3, 2, 1) * 1.0 = 15. Also, 6 is less than the min-size.

  • 5 --> Yes, because sum(4, 3, 2, 1) * 1.0 = 10. Also, 5 is less than the min-size.

  • 4 --> Yes, because sum(3, 2, 1) * 1.0 = 6. Also, 4 is less than the min-size.

  • 3 --> Yes, because sum(2, 1) * 1.0 = 3. Also, 3 is less than the min-size.

  • 2 --> No. Candidate because previous file was selected and 2 is less than the min-size, but the max-number of files to compact has been reached.

  • 1 --> No. Candidate because previous file was selected and 1 is less than the min-size, but max-number of files to compact has been reached.

9.7.6.7.6. Impact of Key Configuration Options

hbase.store.compaction.ratio. A large ratio (e.g., 10) will produce a single giant file. Conversely, a value of .25 will produce behavior similar to the BigTable compaction algorithm - resulting in 4 StoreFiles.

hbase.hstore.compaction.min.size. Because this limit represents the "automatic include" limit for all StoreFiles smaller than this value, this value may need to be adjusted downwards in write-heavy environments where many 1 or 2 mb StoreFiles are being flushed, because every file will be targeted for compaction and the resulting files may still be under the min-size and require further compaction, etc.

9.7.6.7.7. Experimental: stripe compactions

Stripe compactions is an experimental feature added in HBase 0.98 which aims to improve compactions for large regions or non-uniformly distributed row keys. In order to achieve smaller and/or more granular compactions, the store files within a region are maintained separately for several row-key sub-ranges, or "stripes", of the region. The division is not visible to the higher levels of the system, so externally each region functions as before.

This feature is fully compatible with default compactions - it can be enabled for existing tables, and the table will continue to operate normally if it's disabled later.

9.7.6.7.8. When to use

You might want to consider using this feature if you have:

  • large regions (in that case, you can get the positive effect of much smaller regions without additional memstore and region management overhead); or

  • non-uniform row keys, e.g. time dimension in a key (in that case, only the stripes receiving the new keys will keep compacting - old data will not compact as much, or at all).

According to perf testing performed, in these case the read performance can improve somewhat, and the read and write performance variability due to compactions is greatly reduced. There's overall perf improvement on large, non-uniform row key regions (hash-prefixed timestamp key) over long term. All of these performance gains are best realized when table is already large. In future, the perf improvement might also extend to region splits.

9.7.6.7.8.1. How to enable

To use stripe compactions for a table or a column family, you should set its hbase.hstore.engine.class to org.apache.hadoop.hbase.regionserver.StripeStoreEngine. Due to the nature of compactions, you also need to set the blocking file count to a high number (100 is a good default, which is 10 times the normal default of 10). If changing the existing table, you should do it when it is disabled. Examples:

alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}

alter 'orders_table', {NAME => 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}}

create 'orders_table', 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}

Then, you can configure the other options if needed (see below) and enable the table. To switch back to default compactions, set hbase.hstore.engine.class to nil to unset it; or set it explicitly to "org.apache.hadoop.hbase.regionserver.DefaultStoreEngine" (this also needs to be done on a disabled table).

When you enable a large table after changing the store engine either way, a major compaction will likely be performed on most regions. This is not a problem with new tables.

9.7.6.7.8.2. How to configure

All of the settings described below are best set on table/cf level (with the table disabled first, for the settings to apply), similar to the above, e.g.

alter 'orders_table', CONFIGURATION => {'key' => 'value', ..., 'key' => 'value'}}

9.7.6.7.8.2.1. Region and stripe sizing

Based on your region sizing, you might want to also change your stripe sizing. By default, your new regions will start with one stripe. When the stripe is too big (16 memstore flushes size), on next compaction it will be split into two stripes. Stripe splitting will continue in a similar manner as the region grows, until the region itself is big enough to split (region split will work the same as with default compactions).

You can improve this pattern for your data. You should generally aim at stripe size of at least 1Gb, and about 8-12 stripes for uniform row keys - so, for example if your regions are 30 Gb, 12x2.5Gb stripes might be a good idea.

The settings are as follows:

SettingNotes
hbase.store.stripe.initialStripeCount

Initial stripe count to create. You can use it as follows:

  • for relatively uniform row keys, if you know the approximate target number of stripes from the above, you can avoid some splitting overhead by starting w/several stripes (2, 5, 10...). Note that if the early data is not representative of overall row key distribution, this will not be as efficient.

  • for existing tables with lots of data, you can use this to pre-split stripes.

  • for e.g. hash-prefixed sequential keys, with more than one hash prefix per region, you know that some pre-splitting makes sense.

hbase.store.stripe.sizeToSplit Maximum stripe size before it's split. You can use this in conjunction with the next setting to control target stripe size (sizeToSplit = splitPartsCount * target stripe size), according to the above sizing considerations.
hbase.store.stripe.splitPartCount The number of new stripes to create when splitting one. The default is 2, and is good for most cases. For non-uniform row keys, you might experiment with increasing the number somewhat (3-4), to isolate the arriving updates into narrower slice of the region with just one split instead of several.

9.7.6.7.8.2.2. Memstore sizing

By default, the flush creates several files from one memstore, according to existing stripe boundaries and row keys to flush. This approach minimizes write amplification, but can be undesirable if memstore is small and there are many stripes (the files will be too small).

In such cases, you can set hbase.store.stripe.compaction.flushToL0 to true. This will cause flush to create a single file instead; when at least hbase.store.stripe.compaction.minFilesL0 such files (by default, 4) accumulate, they will be compacted into striped files.

9.7.6.7.8.2.3. Normal compaction configuration

All the settings that apply to normal compactions (file size limits, etc.) apply to stripe compactions. The exception are min and max number of files, which are set to higher values by default because the files in stripes are smaller. To control these for stripe compactions, use hbase.store.stripe.compaction.minFiles and .maxFiles.



[25] See Replica Placement: The First Baby Steps on this page: HDFS Architecture

comments powered by Disqus