RegionSplitter (Apache HBase 2.4.0 API)

java.lang.Object
- org.apache.hadoop.hbase.util.RegionSplitter

```
@InterfaceAudience.Private
public class RegionSplitter
extends Object
```
The RegionSplitter class provides several utilities to help in the administration lifecycle for developers who choose to manually split regions instead of having HBase handle that automatically. The most useful utilities are:
- Create a table with a specified number of pre-split regions
- Execute a rolling split of all regions on an existing table
Both operations can be safely done on a live server.
Question: How do I turn off automatic splitting?
Answer: Automatic splitting is determined by the configuration value HConstants.HREGION_MAX_FILESIZE. It is not recommended that you set this to Long.MAX_VALUE in case you forget about manual splits. A suggested setting is 100GB, which would result in > 1hr major compactions if reached.
Question: Why did the original authors decide to manually split?
Answer: Specific workload characteristics of our use case allowed us to benefit from a manual split system.
- Data (~1k) that would grow instead of being replaced
- Data growth was roughly uniform across all regions
- OLTP workload. Data loss is a big deal.
Question: Why is manual splitting good for this workload?
Answer: Although automated splitting is not a bad option, there are benefits to manual splitting.
- With growing amounts of data, splits will continually be needed. Since you always know exactly what regions you have, long-term debugging and profiling is much easier with manual splits. It is hard to trace the logs to understand region level problems if it keeps splitting and getting renamed.
- Data offlining bugs + unknown number of split regions == oh crap! If an WAL or StoreFile was mistakenly unprocessed by HBase due to a weird bug and you notice it a day or so later, you can be assured that the regions specified in these files are the same as the current regions and you have less headaches trying to restore/replay your data.
- You can finely tune your compaction algorithm. With roughly uniform data growth, it's easy to cause split / compaction storms as the regions all roughly hit the same data size at the same time. With manual splits, you can let staggered, time-based major compactions spread out your network IO load.
Question: What's the optimal number of pre-split regions to create?
Answer: Mileage will vary depending upon your application.
The short answer for our application is that we started with 10 pre-split regions / server and watched our data growth over time. It's better to err on the side of too little regions and rolling split later.
The more complicated answer is that this depends upon the largest storefile in your region. With a growing data size, this will get larger over time. You want the largest region to be just big enough that the HStore compact selection algorithm only compacts it due to a timed major. If you don't, your cluster can be prone to compaction storms as the algorithm decides to run major compactions on a large series of regions all at once. Note that compaction storms are due to the uniform data growth, not the manual split decision.
If you pre-split your regions too thin, you can increase the major compaction interval by configuring HConstants.MAJOR_COMPACTION_PERIOD. If your data size grows too large, use this script to perform a network IO safe rolling split of all regions.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`RegionSplitter.DecimalStringSplit` The format of a DecimalStringSplit region boundary is the ASCII representation of reversed sequential number, or any other uniformly distributed decimal value.
`static class`	`RegionSplitter.HexStringSplit` HexStringSplit is a well-known `RegionSplitter.SplitAlgorithm` for choosing region boundaries.
`static class`	`RegionSplitter.NumberStringSplit`
`static interface`	`RegionSplitter.SplitAlgorithm` A generic interface for the RegionSplitter code to use for all it's functionality.
`static class`	`RegionSplitter.UniformSplit` A SplitAlgorithm that divides the space of possible keys evenly.

Field Summary

Fields
Modifier and Type Field and Description

private static org.slf4j.Logger LOG

Fields
Modifier and Type	Field and Description
`private static org.slf4j.Logger`	`LOG`

Constructor Summary

Constructors
Constructor and Description

RegionSplitter()

Constructors
Constructor and Description
`RegionSplitter()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`(package private) static void`	`createPresplitTable(TableName tableName, RegionSplitter.SplitAlgorithm splitAlgo, String[] columnFamilies, org.apache.hadoop.conf.Configuration conf)`
`private static int`	`getRegionServerCount(Connection connection)` Alternative getCurrentNrHRS which is no longer available.
`(package private) static LinkedList<Pair<byte[],byte[]>>`	`getSplits(Connection connection, TableName tableName, RegionSplitter.SplitAlgorithm splitAlgo)`
`private static Pair<org.apache.hadoop.fs.Path,org.apache.hadoop.fs.Path>`	`getTableDirAndSplitFile(org.apache.hadoop.conf.Configuration conf, TableName tableName)`
`static void`	`main(String[] args)` The main function for the RegionSplitter application.
`static RegionSplitter.SplitAlgorithm`	`newSplitAlgoInstance(org.apache.hadoop.conf.Configuration conf, String splitClassName)`
`private static byte[]`	`readFile(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path)`
`(package private) static void`	`rollingSplit(TableName tableName, RegionSplitter.SplitAlgorithm splitAlgo, org.apache.hadoop.conf.Configuration conf)`
`(package private) static LinkedList<Pair<byte[],byte[]>>`	`splitScan(LinkedList<Pair<byte[],byte[]>> regionList, Connection connection, TableName tableName, RegionSplitter.SplitAlgorithm splitAlgo)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

LOG

private static final org.slf4j.Logger LOG

Constructor Detail
- RegionSplitter
```
public RegionSplitter()
```

Method Detail

main
```
public static void main(String[] args)
                 throws IOException,
                        InterruptedException,
                        org.apache.hbase.thirdparty.org.apache.commons.cli.ParseException
```
The main function for the RegionSplitter application. Common uses:
- create a table named 'myTable' with 60 pre-split regions containing 2 column families 'test' & 'rs', assuming the keys are hex-encoded ASCII:
  - bin/hbase org.apache.hadoop.hbase.util.RegionSplitter -c 60 -f test:rs myTable HexStringSplit
- create a table named 'myTable' with 50 pre-split regions, assuming the keys are decimal-encoded ASCII:
  - bin/hbase org.apache.hadoop.hbase.util.RegionSplitter -c 50 myTable DecimalStringSplit
- perform a rolling split of 'myTable' (i.e. 60 => 120 regions), # 2 outstanding splits at a time, assuming keys are uniformly distributed bytes:
  - bin/hbase org.apache.hadoop.hbase.util.RegionSplitter -r -o 2 myTable UniformSplit
There are three SplitAlgorithms built into RegionSplitter, HexStringSplit, DecimalStringSplit, and UniformSplit. These are different strategies for choosing region boundaries. See their source code for details.
Parameters:

args - Usage: RegionSplitter <TABLE> <SPLITALGORITHM> <-c <# regions> -f <family:family:...> | -r [-o <# outstanding splits>]> [-D <conf.param=value>]

Throws:

IOException - HBase IO problem

InterruptedException - user requested exit

org.apache.hbase.thirdparty.org.apache.commons.cli.ParseException - problem parsing user input

createPresplitTable

static void createPresplitTable(TableName tableName,
                                RegionSplitter.SplitAlgorithm splitAlgo,
                                String[] columnFamilies,
                                org.apache.hadoop.conf.Configuration conf)
                         throws IOException,
                                InterruptedException

Throws:: IOException; InterruptedException

getRegionServerCount
```
private static int getRegionServerCount(Connection connection)
                                 throws IOException
```
Alternative getCurrentNrHRS which is no longer available.

Parameters:

connection -

Returns:

Rough count of regionservers out on cluster.

Throws:

IOException - if a remote or network exception occurs

readFile

private static byte[] readFile(org.apache.hadoop.fs.FileSystem fs,
                               org.apache.hadoop.fs.Path path)
                        throws IOException

Throws:: IOException

rollingSplit

static void rollingSplit(TableName tableName,
                         RegionSplitter.SplitAlgorithm splitAlgo,
                         org.apache.hadoop.conf.Configuration conf)
                  throws IOException,
                         InterruptedException

Throws:: IOException; InterruptedException

newSplitAlgoInstance

public static RegionSplitter.SplitAlgorithm newSplitAlgoInstance(org.apache.hadoop.conf.Configuration conf,
                                                                 String splitClassName)
                                                          throws IOException

Throws:: IOException - if the specified SplitAlgorithm class couldn't be instantiated

splitScan

static LinkedList<Pair<byte[],byte[]>> splitScan(LinkedList<Pair<byte[],byte[]>> regionList,
                                                 Connection connection,
                                                 TableName tableName,
                                                 RegionSplitter.SplitAlgorithm splitAlgo)
                                          throws IOException,
                                                 InterruptedException

Throws:: IOException; InterruptedException

getTableDirAndSplitFile

private static Pair<org.apache.hadoop.fs.Path,org.apache.hadoop.fs.Path> getTableDirAndSplitFile(org.apache.hadoop.conf.Configuration conf,
                                                                                                 TableName tableName)
                                                                                          throws IOException

Parameters:: conf -; tableName -
Returns:: A Pair where first item is table dir and second is the split file.
Throws:: IOException - if a remote or network exception occurs

getSplits

static LinkedList<Pair<byte[],byte[]>> getSplits(Connection connection,
                                                 TableName tableName,
                                                 RegionSplitter.SplitAlgorithm splitAlgo)
                                          throws IOException

Throws:: IOException

Class RegionSplitter

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

LOG

Constructor Detail

RegionSplitter

Method Detail

main

createPresplitTable

getRegionServerCount

readFile

rollingSplit

newSplitAlgoInstance

splitScan

getTableDirAndSplitFile

getSplits