TableInputFormatBase (Apache HBase 1.4.11 API)

java.lang.Object
- org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
- - org.apache.hadoop.hbase.mapreduce.TableInputFormatBase

Direct Known Subclasses:: TableInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class TableInputFormatBase
extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

A base for TableInputFormats. Receives a Connection, a TableName, an Scan instance that defines the input columns etc. Subclasses may use other TableRecordReader implementations. Subclasses MUST ensure initializeTable(Connection, TableName) is called for an instance to function properly. Each of the entry points to this class used by the MapReduce framework, createRecordReader(InputSplit, TaskAttemptContext) and getSplits(JobContext), will call initialize(JobContext) as a convenient centralized location to handle retrieving the necessary configuration information. If your subclass overrides either of these methods, either call the parent version or call initialize yourself.

An example of a subclass:

   class ExampleTIF extends TableInputFormatBase {

     @Override
     protected void initialize(JobContext context) throws IOException {
       // We are responsible for the lifecycle of this connection until we hand it over in
       // initializeTable.
       Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create(
              job.getConfiguration()));
       TableName tableName = TableName.valueOf("exampleTable");
       // mandatory. once passed here, TableInputFormatBase will handle closing the connection.
       initializeTable(connection, tableName);
       byte[][] inputColumns = new byte [][] { Bytes.toBytes("columnA"),
         Bytes.toBytes("columnB") };
       // optional, by default we'll get everything for the table.
       Scan scan = new Scan();
       for (byte[] family : inputColumns) {
         scan.addFamily(family);
       }
       Filter exampleFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator("aa.*"));
       scan.setFilter(exampleFilter);
       setScan(scan);
     }
   }

 The number of InputSplits(mappers) match the number of regions in a table by default.
 Set "hbase.mapreduce.input.mappers.per.region" to specify how many mappers per region, set
 this property will disable autobalance below.

 Set "hbase.mapreduce.input.autobalance" to enable autobalance, hbase will assign mappers based on
 average region size; For regions, whose size larger than average region size may assigned more mappers,
 and for continuous small one, they may group together to use one mapper. If actual calculated average
 region size is too big, it is not good to only assign 1 mapper for those large regions. Then use
 "hbase.mapreduce.input.average.regionsize" to set max average region size when enable "autobalanece",
 default was average region size is 8G.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`INPUT_AUTOBALANCE_MAXSKEWRATIO` Deprecated.
`static String`	`MAPREDUCE_INPUT_AUTOBALANCE` Specify if we enable auto-balance to set number of mappers in M/R jobs.
`static String`	`MAX_AVERAGE_REGION_SIZE` In auto-balance, we split input by ave region size, if calculated region size is too big, we can set it.
`static String`	`NUM_MAPPERS_PER_REGION` Set the number of Mappers for each region, all regions have same number of Mappers
`static String`	`TABLE_ROW_TEXTKEY` Deprecated.

Constructor Summary

Constructors
Constructor and Description

TableInputFormatBase()

Constructors
Constructor and Description
`TableInputFormatBase()`

Method Summary

Methods
Modifier and Type	Method and Description
`List<org.apache.hadoop.mapreduce.InputSplit>`	`calculateAutoBalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> splits, long maxAverageRegionSize)` Calculates the number of MapReduce input splits for the map tasks.
`List<org.apache.hadoop.mapreduce.InputSplit>`	`calculateRebalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> list, org.apache.hadoop.mapreduce.JobContext context, long average)` Deprecated.
`protected void`	`closeTable()` Close the Table and related objects that were initialized via `initializeTable(Connection, TableName)`.
`protected List<org.apache.hadoop.mapreduce.InputSplit>`	`createNInputSplitsUniform(org.apache.hadoop.mapreduce.InputSplit split, int n)` Create n splits for one InputSplit, For now only support uniform distribution
`org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Builds a `TableRecordReader`.
`protected Admin`	`getAdmin()` Allows subclasses to get the `Admin`.
`protected org.apache.hadoop.hbase.client.HTable`	`getHTable()` Deprecated. use `getTable()`
`protected RegionLocator`	`getRegionLocator()` Allows subclasses to get the `RegionLocator`.
`Scan`	`getScan()` Gets the scan defining the actual details like columns etc.
`static byte[]`	`getSplitKey(byte[] start, byte[] end, boolean isText)` Deprecated.
`List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext context)` Calculates the splits that will serve as input for the map tasks.
`protected Pair<byte[][],byte[][]>`	`getStartEndKeys()`
`protected Table`	`getTable()` Allows subclasses to get the `Table`.
`protected boolean`	`includeRegionInSplit(byte[] startKey, byte[] endKey)` Test if the given region is to be included in the InputSplit while splitting the regions of a table.
`protected void`	`initialize(org.apache.hadoop.mapreduce.JobContext context)` Handle subclass specific set up.
`protected void`	`initializeTable(Connection connection, TableName tableName)` Allows subclasses to initialize the table information.
`String`	`reverseDNS(InetAddress ipAddress)` Deprecated. mistakenly made public in 0.98.7. scope will change to package-private
`protected void`	`setHTable(org.apache.hadoop.hbase.client.HTable table)` Deprecated. Use `initializeTable(Connection, TableName)` instead.
`void`	`setScan(Scan scan)` Sets the scan defining the actual details like columns etc.
`protected void`	`setTableRecordReader(TableRecordReader tableRecordReader)` Allows subclasses to set the `TableRecordReader`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - INPUT_AUTOBALANCE_MAXSKEWRATIO
```
@Deprecated
public static final String INPUT_AUTOBALANCE_MAXSKEWRATIO
```
    Deprecated.
    
    See Also:
    Constant Field Values
  - TABLE_ROW_TEXTKEY
```
@Deprecated
public static final String TABLE_ROW_TEXTKEY
```
    Deprecated.
    
    See Also:
    Constant Field Values
  - MAPREDUCE_INPUT_AUTOBALANCE
```
public static final String MAPREDUCE_INPUT_AUTOBALANCE
```
    Specify if we enable auto-balance to set number of mappers in M/R jobs.
    
    See Also:
    Constant Field Values
  - MAX_AVERAGE_REGION_SIZE
```
public static final String MAX_AVERAGE_REGION_SIZE
```
    In auto-balance, we split input by ave region size, if calculated region size is too big, we can set it.
    
    See Also:
    Constant Field Values
  - NUM_MAPPERS_PER_REGION
```
public static final String NUM_MAPPERS_PER_REGION
```
    Set the number of Mappers for each region, all regions have same number of Mappers
    
    See Also:
    Constant Field Values
- Constructor Detail
  - TableInputFormatBase
```
public TableInputFormatBase()
```
- Method Detail
  - createRecordReader
```
public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                         org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                           throws IOException
```
    Builds a TableRecordReader. If no TableRecordReader was provided, uses the default.
    
    Specified by:
    
    createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
    
    Parameters:
    split - The split to work with.
    context - The current context.
    
    Returns:
    The newly created record reader.
    
    Throws:
    
    IOException - When creating the reader fails.
    See Also:
    InputFormat.createRecordReader( org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)
  - getStartEndKeys
```
protected Pair<byte[][],byte[][]> getStartEndKeys()
                                           throws IOException
```
    Throws:
    
    IOException
  - getSplits
```
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException
```
    Calculates the splits that will serve as input for the map tasks. The number of splits matches the number of regions in a table.
    
    Specified by:
    
    getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
    
    Parameters:
    context - The current job context.
    
    Returns:
    The list of input splits.
    
    Throws:
    
    IOException - When creating the list of splits fails.
    See Also:
    InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)
  - createNInputSplitsUniform
```
protected List<org.apache.hadoop.mapreduce.InputSplit> createNInputSplitsUniform(org.apache.hadoop.mapreduce.InputSplit split,
                                                                     int n)
                                                                          throws org.apache.hadoop.hbase.exceptions.IllegalArgumentIOException
```
    Create n splits for one InputSplit, For now only support uniform distribution
    
    Parameters:
    split - A TableSplit corresponding to a range of rowkeys
    n - Number of ranges after splitting. Pass 1 means no split for the range Pass 2 if you want to split the range in two;
    
    Returns:
    A list of TableSplit, the size of the list is n
    
    Throws:
    
    org.apache.hadoop.hbase.exceptions.IllegalArgumentIOException
  - calculateRebalancedSplits
```
@Deprecated
public List<org.apache.hadoop.mapreduce.InputSplit> calculateRebalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> list,
                                                                                org.apache.hadoop.mapreduce.JobContext context,
                                                                                long average)
                                                                       throws IOException
```
    Deprecated.
    
    Calculates the number of MapReduce input splits for the map tasks. The number of MapReduce input splits depends on the average region size. Make it 'public' for testing
    Deprecated. Former functionality has been replaced by calculateAutoBalancedSplits and will function differently. Do not use.
    
    Parameters:
    list - The list of input splits before balance.
    context - The current job context.
    average - The average size of all regions .
    
    Returns:
    The list of input splits.
    
    Throws:
    
    IOException
  - calculateAutoBalancedSplits
```
public List<org.apache.hadoop.mapreduce.InputSplit> calculateAutoBalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> splits,
                                                                       long maxAverageRegionSize)
                                                                         throws IOException
```
    Calculates the number of MapReduce input splits for the map tasks. The number of MapReduce input splits depends on the average region size. Make it 'public' for testing
    
    Parameters:
    splits - The list of input splits before balance.
    maxAverageRegionSize - max Average region size for one mapper
    
    Returns:
    The list of input splits.
    
    Throws:
    
    IOException - When creating the list of splits fails.
    See Also:
    InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)
  - getSplitKey
```
@Deprecated
public static byte[] getSplitKey(byte[] start,
                            byte[] end,
                            boolean isText)
```
    Deprecated.
    
    Deprecated. Do not use.
    
    Parameters:
    start - Start key of the region
    end - End key of the region
    isText - It determines to use text key mode or binary key mode
    
    Returns:
    The split point in the region.
  - reverseDNS
```
@Deprecated
public String reverseDNS(InetAddress ipAddress)
                  throws NamingException,
                         UnknownHostException
```
    Deprecated. mistakenly made public in 0.98.7. scope will change to package-private
    
    Throws:
    
    NamingException
    
    UnknownHostException
  - includeRegionInSplit
```
protected boolean includeRegionInSplit(byte[] startKey,
                           byte[] endKey)
```
    Test if the given region is to be included in the InputSplit while splitting the regions of a table.
    This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job, (and hence, not contributing to the InputSplit), given the start and end keys of the same.
    Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R processing, continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due to the ordering of the keys.
    
    Note: It is possible that endKey.length() == 0 , for the last (recent) region.
    Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded( i.e. all regions are included).
    
    Parameters:
    startKey - Start key of the region
    endKey - End key of the region
    
    Returns:
    true, if this region needs to be included as part of the input (default).
  - getHTable
```
@Deprecated
protected org.apache.hadoop.hbase.client.HTable getHTable()
```
    Deprecated. use getTable()
    
    Allows subclasses to get the HTable.
  - getRegionLocator
```
protected RegionLocator getRegionLocator()
```
    Allows subclasses to get the RegionLocator.
  - getTable
```
protected Table getTable()
```
    Allows subclasses to get the Table.
  - getAdmin
```
protected Admin getAdmin()
```
    Allows subclasses to get the Admin.
  - setHTable
```
@Deprecated
protected void setHTable(org.apache.hadoop.hbase.client.HTable table)
                  throws IOException
```
    Deprecated. Use initializeTable(Connection, TableName) instead.
    
    Allows subclasses to set the HTable. Will attempt to reuse the underlying Connection for our own needs, including retreiving an Admin interface to the HBase cluster.
    
    Parameters:
    table - The table to get the data from.
    
    Throws:
    
    IOException
  - initializeTable
```
protected void initializeTable(Connection connection,
                   TableName tableName)
                        throws IOException
```
    Allows subclasses to initialize the table information.
    
    Parameters:
    connection - The Connection to the HBase cluster. MUST be unmanaged. We will close.
    tableName - The TableName of the table to process.
    
    Throws:
    
    IOException
  - getScan
```
public Scan getScan()
```
    Gets the scan defining the actual details like columns etc.
    
    Returns:
    The internal scan instance.
  - setScan
```
public void setScan(Scan scan)
```
    Sets the scan defining the actual details like columns etc.
    
    Parameters:
    scan - The scan to set.
  - setTableRecordReader
```
protected void setTableRecordReader(TableRecordReader tableRecordReader)
```
    Allows subclasses to set the TableRecordReader.
    
    Parameters:
    tableRecordReader - A different TableRecordReader implementation.
  - initialize
```
protected void initialize(org.apache.hadoop.mapreduce.JobContext context)
                   throws IOException
```
    Handle subclass specific set up. Each of the entry points used by the MapReduce framework, createRecordReader(InputSplit, TaskAttemptContext) and getSplits(JobContext), will call initialize(JobContext) as a convenient centralized location to handle retrieving the necessary configuration information and calling initializeTable(Connection, TableName). Subclasses should implement their initialize call such that it is safe to call multiple times. The current TableInputFormatBase implementation relies on a non-null table reference to decide if an initialize call is needed, but this behavior may change in the future. In particular, it is critical that initializeTable not be called multiple times since this will leak Connection instances.
    
    Throws:
    
    IOException
  - closeTable
```
protected void closeTable()
                   throws IOException
```
    Close the Table and related objects that were initialized via initializeTable(Connection, TableName).
    
    Throws:
    
    IOException

Class TableInputFormatBase

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

INPUT_AUTOBALANCE_MAXSKEWRATIO

TABLE_ROW_TEXTKEY

MAPREDUCE_INPUT_AUTOBALANCE

MAX_AVERAGE_REGION_SIZE

NUM_MAPPERS_PER_REGION

Constructor Detail

TableInputFormatBase

Method Detail

createRecordReader

getStartEndKeys

getSplits

createNInputSplitsUniform

calculateRebalancedSplits

calculateAutoBalancedSplits

getSplitKey

reverseDNS

includeRegionInSplit

getHTable

getRegionLocator

getTable

getAdmin

setHTable

initializeTable

getScan

setScan

setTableRecordReader

initialize

closeTable