TableInputFormatBase (Apache HBase 1.2.12 API)

java.lang.Object
- org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
- - org.apache.hadoop.hbase.mapreduce.TableInputFormatBase

Direct Known Subclasses:: TableInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class TableInputFormatBase
extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

A base for TableInputFormats. Receives a Connection, a TableName, an Scan instance that defines the input columns etc. Subclasses may use other TableRecordReader implementations. Subclasses MUST ensure initializeTable(Connection, TableName) is called for an instance to function properly. Each of the entry points to this class used by the MapReduce framework, createRecordReader(InputSplit, TaskAttemptContext) and getSplits(JobContext), will call initialize(JobContext) as a convenient centralized location to handle retrieving the necessary configuration information. If your subclass overrides either of these methods, either call the parent version or call initialize yourself.

An example of a subclass:

   class ExampleTIF extends TableInputFormatBase {

     @Override
     protected void initialize(JobContext context) throws IOException {
       // We are responsible for the lifecycle of this connection until we hand it over in
       // initializeTable.
       Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create(
              job.getConfiguration()));
       TableName tableName = TableName.valueOf("exampleTable");
       // mandatory. once passed here, TableInputFormatBase will handle closing the connection.
       initializeTable(connection, tableName);
       byte[][] inputColumns = new byte [][] { Bytes.toBytes("columnA"),
         Bytes.toBytes("columnB") };
       // optional, by default we'll get everything for the table.
       Scan scan = new Scan();
       for (byte[] family : inputColumns) {
         scan.addFamily(family);
       }
       Filter exampleFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator("aa.*"));
       scan.setFilter(exampleFilter);
       setScan(scan);
     }
   }

Field Summary

Fields
Modifier and Type	Field and Description
`private Admin`	`admin` The `Admin`.
`private Connection`	`connection` The underlying `Connection` of the table.
`private static String`	`INITIALIZATION_ERROR`
`static String`	`INPUT_AUTOBALANCE_MAXSKEWRATIO` Specify if ratio for data skew in M/R jobs, it goes well with the enabling hbase.mapreduce .input.autobalance property.
`private static org.apache.commons.logging.Log`	`LOG`
`static String`	`MAPREDUCE_INPUT_AUTOBALANCE` Specify if we enable auto-balance for input in M/R jobs.
`private static String`	`NOT_INITIALIZED`
`private RegionLocator`	`regionLocator` The `RegionLocator` of the table.
`private HashMap<InetAddress,String>`	`reverseDNSCacheMap` The reverse DNS lookup cache mapping: IPAddress => HostName
`private Scan`	`scan` Holds the details for the internal scanner.
`private Table`	`table` The `Table` to scan.
`static String`	`TABLE_ROW_TEXTKEY` Specify if the row key in table is text (ASCII between 32~126), default is true.
`private TableRecordReader`	`tableRecordReader` The reader scanning the table, can be a custom one.

Constructor Summary

Constructors
Constructor and Description

TableInputFormatBase()

Constructors
Constructor and Description
`TableInputFormatBase()`

Method Summary

Methods
Modifier and Type	Method and Description
`List<org.apache.hadoop.mapreduce.InputSplit>`	`calculateRebalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> list, org.apache.hadoop.mapreduce.JobContext context, long average)` Calculates the number of MapReduce input splits for the map tasks.
`private void`	`close(Closeable... closables)`
`protected void`	`closeTable()` Close the Table and related objects that were initialized via `initializeTable(Connection, TableName)`.
`org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Builds a `TableRecordReader`.
`protected Admin`	`getAdmin()` Allows subclasses to get the `Admin`.
`protected HTable`	`getHTable()` Deprecated. use `getTable()`
`protected RegionLocator`	`getRegionLocator()` Allows subclasses to get the `RegionLocator`.
`Scan`	`getScan()` Gets the scan defining the actual details like columns etc.
`static byte[]`	`getSplitKey(byte[] start, byte[] end, boolean isText)` select a split point in the region.
`List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext context)` Calculates the splits that will serve as input for the map tasks.
`protected Pair<byte[][],byte[][]>`	`getStartEndKeys()`
`protected Table`	`getTable()` Allows subclasses to get the `Table`.
`protected boolean`	`includeRegionInSplit(byte[] startKey, byte[] endKey)` Test if the given region is to be included in the InputSplit while splitting the regions of a table.
`protected void`	`initialize(org.apache.hadoop.mapreduce.JobContext context)` Handle subclass specific set up.
`protected void`	`initializeTable(Connection connection, TableName tableName)` Allows subclasses to initialize the table information.
`String`	`reverseDNS(InetAddress ipAddress)` Deprecated. mistakenly made public in 0.98.7. scope will change to package-private
`protected void`	`setHTable(HTable table)` Deprecated. Use `initializeTable(Connection, TableName)` instead.
`void`	`setScan(Scan scan)` Sets the scan defining the actual details like columns etc.
`protected void`	`setTableRecordReader(TableRecordReader tableRecordReader)` Allows subclasses to set the `TableRecordReader`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - MAPREDUCE_INPUT_AUTOBALANCE
```
public static final String MAPREDUCE_INPUT_AUTOBALANCE
```
    Specify if we enable auto-balance for input in M/R jobs.
    
    See Also:
    Constant Field Values
  - INPUT_AUTOBALANCE_MAXSKEWRATIO
```
public static final String INPUT_AUTOBALANCE_MAXSKEWRATIO
```
    Specify if ratio for data skew in M/R jobs, it goes well with the enabling hbase.mapreduce .input.autobalance property.
    
    See Also:
    Constant Field Values
  - TABLE_ROW_TEXTKEY
```
public static final String TABLE_ROW_TEXTKEY
```
    Specify if the row key in table is text (ASCII between 32~126), default is true. False means the table is using binary row key
    
    See Also:
    Constant Field Values
  - LOG
```
private static final org.apache.commons.logging.Log LOG
```
  - NOT_INITIALIZED
```
private static final String NOT_INITIALIZED
```
    See Also:
    Constant Field Values
  - INITIALIZATION_ERROR
```
private static final String INITIALIZATION_ERROR
```
    See Also:
    Constant Field Values
  - scan
```
private Scan scan
```
    Holds the details for the internal scanner.
    
    See Also:
    Scan
  - admin
```
private Admin admin
```
    The Admin.
  - table
```
private Table table
```
    The Table to scan.
  - regionLocator
```
private RegionLocator regionLocator
```
    The RegionLocator of the table.
  - tableRecordReader
```
private TableRecordReader tableRecordReader
```
    The reader scanning the table, can be a custom one.
  - connection
```
private Connection connection
```
    The underlying Connection of the table.
  - reverseDNSCacheMap
```
private HashMap<InetAddress,String> reverseDNSCacheMap
```
    The reverse DNS lookup cache mapping: IPAddress => HostName
- Constructor Detail
  - TableInputFormatBase
```
public TableInputFormatBase()
```
- Method Detail
  - createRecordReader
```
public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                         org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                           throws IOException
```
    Builds a TableRecordReader. If no TableRecordReader was provided, uses the default.
    
    Specified by:
    
    createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
    
    Parameters:
    split - The split to work with.
    context - The current context.
    
    Returns:
    The newly created record reader.
    
    Throws:
    
    IOException - When creating the reader fails.
    See Also:
    InputFormat.createRecordReader( org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)
  - getStartEndKeys
```
protected Pair<byte[][],byte[][]> getStartEndKeys()
                                           throws IOException
```
    Throws:
    
    IOException
  - getSplits
```
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException
```
    Calculates the splits that will serve as input for the map tasks. The number of splits matches the number of regions in a table.
    
    Specified by:
    
    getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
    
    Parameters:
    context - The current job context.
    
    Returns:
    The list of input splits.
    
    Throws:
    
    IOException - When creating the list of splits fails.
    See Also:
    InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)
  - reverseDNS
```
@Deprecated
public String reverseDNS(InetAddress ipAddress)
                  throws NamingException,
                         UnknownHostException
```
    Deprecated. mistakenly made public in 0.98.7. scope will change to package-private
    
    Throws:
    
    NamingException
    
    UnknownHostException
  - calculateRebalancedSplits
```
public List<org.apache.hadoop.mapreduce.InputSplit> calculateRebalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> list,
                                                                     org.apache.hadoop.mapreduce.JobContext context,
                                                                     long average)
                                                                       throws IOException
```
    Calculates the number of MapReduce input splits for the map tasks. The number of MapReduce input splits depends on the average region size and the "data skew ratio" user set in configuration.
    
    Parameters:
    list - The list of input splits before balance.
    context - The current job context.
    average - The average size of all regions .
    
    Returns:
    The list of input splits.
    
    Throws:
    
    IOException - When creating the list of splits fails.
    See Also:
    InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)
  - getSplitKey
```
public static byte[] getSplitKey(byte[] start,
                 byte[] end,
                 boolean isText)
```
    select a split point in the region. The selection of the split point is based on an uniform distribution assumption for the keys in a region. Here are some examples: startKey: aaabcdefg endKey: aaafff split point: aaad startKey: 111000 endKey: 1125790 split point: 111b startKey: 1110 endKey: 1120 split point: 111_ startKey: binary key { 13, -19, 126, 127 }, endKey: binary key { 13, -19, 127, 0 }, split point: binary key { 13, -19, 127, -64 } Set this function as "public static", make it easier for test.
    
    Parameters:
    start - Start key of the region
    end - End key of the region
    isText - It determines to use text key mode or binary key mode
    
    Returns:
    The split point in the region.
  - includeRegionInSplit
```
protected boolean includeRegionInSplit(byte[] startKey,
                           byte[] endKey)
```
    Test if the given region is to be included in the InputSplit while splitting the regions of a table.
    This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job, (and hence, not contributing to the InputSplit), given the start and end keys of the same.
    Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R processing, continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due to the ordering of the keys.
    
    Note: It is possible that endKey.length() == 0 , for the last (recent) region.
    Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded( i.e. all regions are included).
    
    Parameters:
    startKey - Start key of the region
    endKey - End key of the region
    
    Returns:
    true, if this region needs to be included as part of the input (default).
  - getHTable
```
@Deprecated
protected HTable getHTable()
```
    Deprecated. use getTable()
    
    Allows subclasses to get the HTable.
  - getRegionLocator
```
protected RegionLocator getRegionLocator()
```
    Allows subclasses to get the RegionLocator.
  - getTable
```
protected Table getTable()
```
    Allows subclasses to get the Table.
  - getAdmin
```
protected Admin getAdmin()
```
    Allows subclasses to get the Admin.
  - setHTable
```
@Deprecated
protected void setHTable(HTable table)
                  throws IOException
```
    Deprecated. Use initializeTable(Connection, TableName) instead.
    
    Allows subclasses to set the HTable. Will attempt to reuse the underlying Connection for our own needs, including retreiving an Admin interface to the HBase cluster.
    
    Parameters:
    table - The table to get the data from.
    
    Throws:
    
    IOException
  - initializeTable
```
protected void initializeTable(Connection connection,
                   TableName tableName)
                        throws IOException
```
    Allows subclasses to initialize the table information.
    
    Parameters:
    connection - The Connection to the HBase cluster. MUST be unmanaged. We will close.
    tableName - The TableName of the table to process.
    
    Throws:
    
    IOException
  - getScan
```
public Scan getScan()
```
    Gets the scan defining the actual details like columns etc.
    
    Returns:
    The internal scan instance.
  - setScan
```
public void setScan(Scan scan)
```
    Sets the scan defining the actual details like columns etc.
    
    Parameters:
    scan - The scan to set.
  - setTableRecordReader
```
protected void setTableRecordReader(TableRecordReader tableRecordReader)
```
    Allows subclasses to set the TableRecordReader.
    
    Parameters:
    tableRecordReader - A different TableRecordReader implementation.
  - initialize
```
protected void initialize(org.apache.hadoop.mapreduce.JobContext context)
                   throws IOException
```
    Handle subclass specific set up. Each of the entry points used by the MapReduce framework, createRecordReader(InputSplit, TaskAttemptContext) and getSplits(JobContext), will call initialize(JobContext) as a convenient centralized location to handle retrieving the necessary configuration information and calling initializeTable(Connection, TableName). Subclasses should implement their initialize call such that it is safe to call multiple times. The current TableInputFormatBase implementation relies on a non-null table reference to decide if an initialize call is needed, but this behavior may change in the future. In particular, it is critical that initializeTable not be called multiple times since this will leak Connection instances.
    
    Throws:
    
    IOException
  - closeTable
```
protected void closeTable()
                   throws IOException
```
    Close the Table and related objects that were initialized via initializeTable(Connection, TableName).
    
    Throws:
    
    IOException
  - close
```
private void close(Closeable... closables)
            throws IOException
```
    Throws:
    
    IOException

Class TableInputFormatBase

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

MAPREDUCE_INPUT_AUTOBALANCE

INPUT_AUTOBALANCE_MAXSKEWRATIO

TABLE_ROW_TEXTKEY

LOG

NOT_INITIALIZED

INITIALIZATION_ERROR

scan

admin

table

regionLocator

tableRecordReader

connection

reverseDNSCacheMap

Constructor Detail

TableInputFormatBase

Method Detail

createRecordReader

getStartEndKeys

getSplits

reverseDNS

calculateRebalancedSplits

getSplitKey

includeRegionInSplit

getHTable

getRegionLocator

getTable

getAdmin

setHTable

initializeTable

getScan

setScan

setTableRecordReader

initialize

closeTable

close