java.lang.Object

org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

org.apache.hadoop.hbase.mapreduce.TableInputFormatBase

org.apache.hadoop.hbase.mapreduce.TableInputFormat

All Implemented Interfaces:: org.apache.hadoop.conf.Configurable

Direct Known Subclasses:: RoundRobinTableInputFormat

@Public public class TableInputFormat extends TableInputFormatBase implements org.apache.hadoop.conf.Configurable

Convert HBase tabular data into a format that is consumable by Map/Reduce.

Field Summary

Fields

Modifier and Type

Field

Description

private org.apache.hadoop.conf.Configuration

conf

The configuration.

static final String

INPUT_TABLE

Job parameter that specifies the input table.

private static final org.slf4j.Logger

LOG

static final String

SCAN

Base-64 encoded scanner.

static final String

SCAN_BATCHSIZE

Set the maximum number of values to return for each call to next().

static final String

SCAN_CACHEBLOCKS

Set to false to disable server-side caching of blocks for this scan.

static final String

SCAN_CACHEDROWS

The number of rows for caching that will be passed to scanners.

static final String

SCAN_COLUMN_FAMILY

Column Family to Scan

static final String

SCAN_COLUMNS

Space delimited list of columns and column families to scan.

static final String

SCAN_MAXVERSIONS

The maximum number of version to return.

static final String

SCAN_ROW_START

Scan start row

static final String

SCAN_ROW_STOP

Scan stop row

static final String

SCAN_TIMERANGE_END

The ending timestamp used to filter columns with a specific range of versions.

static final String

SCAN_TIMERANGE_START

The starting timestamp used to filter columns with a specific range of versions.

static final String

SCAN_TIMESTAMP

The timestamp used to filter columns with a specific timestamp.

static final String

SHUFFLE_MAPS

Specify if we have to shuffle the map tasks.

private static final String

SPLIT_TABLE

If specified, use start keys of this table to split.

Fields inherited from class org.apache.hadoop.hbase.mapreduce.TableInputFormatBase
MAPREDUCE_INPUT_AUTOBALANCE, MAX_AVERAGE_REGION_SIZE, NUM_MAPPERS_PER_REGION
Constructor Summary

Constructors

Constructor

Description

TableInputFormat()
Method Summary

Modifier and Type

Method

Description

private static void

addColumn(Scan scan, byte[] familyAndQualifier)

Parses a combined family and qualifier and adds either both or just the family in case there is no qualifier.

static void

addColumns(Scan scan, byte[][] columns)

Adds an array of columns specified using old format, family:qualifier.

private static void

addColumns(Scan scan, String columns)

Convenience method to parse a string representation of an array of column specifiers.

static void

configureSplitTable(org.apache.hadoop.mapreduce.Job job, TableName tableName)

Sets split table in map-reduce job.

static Scan

createScanFromConfiguration(org.apache.hadoop.conf.Configuration conf)

Sets up a Scan instance, applying settings from the configuration property constants defined in TableInputFormat.

org.apache.hadoop.conf.Configuration

getConf()

Returns the current configuration.

List<org.apache.hadoop.mapreduce.InputSplit>

getSplits(org.apache.hadoop.mapreduce.JobContext context)

Calculates the splits that will serve as input for the map tasks.

protected Pair<byte[][],byte[][]>

getStartEndKeys()

protected void

initialize(org.apache.hadoop.mapreduce.JobContext context)

Handle subclass specific set up.

void

setConf(org.apache.hadoop.conf.Configuration configuration)

Sets the configuration.

Methods inherited from class org.apache.hadoop.hbase.mapreduce.TableInputFormatBase
calculateAutoBalancedSplits, closeTable, createNInputSplitsUniform, createRecordReader, createRegionSizeCalculator, getAdmin, getRegionLocator, getScan, getTable, includeRegionInSplit, initializeTable, reverseDNS, setScan, setTableRecordReader

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LOG
  
  private static final org.slf4j.Logger LOG
- INPUT_TABLE
  
  public static final String INPUT_TABLE
  
  Job parameter that specifies the input table.
  See Also:
  
  Constant Field Values
- SPLIT_TABLE
  
  private static final String SPLIT_TABLE
  
  If specified, use start keys of this table to split. This is useful when you are preparing data for bulkload.
  See Also:
  
  Constant Field Values
- SCAN
  
  public static final String SCAN
  
  Base-64 encoded scanner. All other SCAN_ confs are ignored if this is specified. See TableMapReduceUtil.convertScanToString(Scan) for more details.
  See Also:
  
  Constant Field Values
- SCAN_ROW_START
  
  public static final String SCAN_ROW_START
  
  Scan start row
  See Also:
  
  Constant Field Values
- SCAN_ROW_STOP
  
  public static final String SCAN_ROW_STOP
  
  Scan stop row
  See Also:
  
  Constant Field Values
- SCAN_COLUMN_FAMILY
  
  public static final String SCAN_COLUMN_FAMILY
  
  Column Family to Scan
  See Also:
  
  Constant Field Values
- SCAN_COLUMNS
  
  public static final String SCAN_COLUMNS
  
  Space delimited list of columns and column families to scan.
  See Also:
  
  Constant Field Values
- SCAN_TIMESTAMP
  
  public static final String SCAN_TIMESTAMP
  
  The timestamp used to filter columns with a specific timestamp.
  See Also:
  
  Constant Field Values
- SCAN_TIMERANGE_START
  
  public static final String SCAN_TIMERANGE_START
  
  The starting timestamp used to filter columns with a specific range of versions.
  See Also:
  
  Constant Field Values
- SCAN_TIMERANGE_END
  
  public static final String SCAN_TIMERANGE_END
  
  The ending timestamp used to filter columns with a specific range of versions.
  See Also:
  
  Constant Field Values
- SCAN_MAXVERSIONS
  
  public static final String SCAN_MAXVERSIONS
  
  The maximum number of version to return.
  See Also:
  
  Constant Field Values
- SCAN_CACHEBLOCKS
  
  public static final String SCAN_CACHEBLOCKS
  
  Set to false to disable server-side caching of blocks for this scan.
  See Also:
  
  Constant Field Values
- SCAN_CACHEDROWS
  
  public static final String SCAN_CACHEDROWS
  
  The number of rows for caching that will be passed to scanners.
  See Also:
  
  Constant Field Values
- SCAN_BATCHSIZE
  
  public static final String SCAN_BATCHSIZE
  
  Set the maximum number of values to return for each call to next().
  See Also:
  
  Constant Field Values
- SHUFFLE_MAPS
  
  public static final String SHUFFLE_MAPS
  
  Specify if we have to shuffle the map tasks.
  See Also:
  
  Constant Field Values
- conf
  
  private org.apache.hadoop.conf.Configuration conf
  
  The configuration.
Constructor Details
- TableInputFormat
  
  public TableInputFormat()
Method Details
- getConf
  
  public org.apache.hadoop.conf.Configuration getConf()
  
  Returns the current configuration.
  Specified by:
  
  getConf in interface org.apache.hadoop.conf.Configurable
  
  Returns:
  
  The current configuration.
  
  See Also:
  
  Configurable.getConf()
- setConf
  
  public void setConf(org.apache.hadoop.conf.Configuration configuration)
  
  Sets the configuration. This is used to set the details for the table to be scanned.
  Specified by:
  
  setConf in interface org.apache.hadoop.conf.Configurable
  
  Parameters:
  
  configuration - The configuration to set.
  
  See Also:
  
  Configurable.setConf(org.apache.hadoop.conf.Configuration)
- createScanFromConfiguration
  
  public static Scan createScanFromConfiguration(org.apache.hadoop.conf.Configuration conf) throws IOException
  Sets up a Scan instance, applying settings from the configuration property constants defined in TableInputFormat. This allows specifying things such as:
  
  start and stop rows
  
  column qualifiers or families
  
  timestamps or timerange
  
  scanner caching and batch size
  Throws:
  
  IOException
- initialize
  
  protected void initialize(org.apache.hadoop.mapreduce.JobContext context) throws IOException
  
  Description copied from class: TableInputFormatBase
  
  Handle subclass specific set up. Each of the entry points used by the MapReduce framework, TableInputFormatBase.createRecordReader(InputSplit, TaskAttemptContext) and TableInputFormatBase.getSplits(JobContext), will call TableInputFormatBase.initialize(JobContext) as a convenient centralized location to handle retrieving the necessary configuration information and calling TableInputFormatBase.initializeTable(Connection, TableName). Subclasses should implement their initialize call such that it is safe to call multiple times. The current TableInputFormatBase implementation relies on a non-null table reference to decide if an initialize call is needed, but this behavior may change in the future. In particular, it is critical that initializeTable not be called multiple times since this will leak Connection instances.
  
  Overrides:
  
  initialize in class TableInputFormatBase
  
  Throws:
  
  IOException
- addColumn
  
  private static void addColumn(Scan scan, byte[] familyAndQualifier)
  
  Parses a combined family and qualifier and adds either both or just the family in case there is no qualifier. This assumes the older colon divided notation, e.g. "family:qualifier".
  
  Parameters:
  
  scan - The Scan to update.
  
  familyAndQualifier - family and qualifier
  
  Throws:
  
  IllegalArgumentException - When familyAndQualifier is invalid.
- addColumns
  
  public static void addColumns(Scan scan, byte[][] columns)
  
  Adds an array of columns specified using old format, family:qualifier.
  Overrides previous calls to Scan.addColumn(byte[], byte[])for any families in the input.
  Parameters:
  
  scan - The Scan to update.
  
  columns - array of columns, formatted as family:qualifier
  
  See Also:
  
  Scan.addColumn(byte[], byte[])
- getSplits
  
  public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context) throws IOException
  
  Calculates the splits that will serve as input for the map tasks. The number of splits matches the number of regions in a table. Splits are shuffled if required.
  Overrides:
  
  getSplits in class TableInputFormatBase
  
  Parameters:
  
  context - The current job context.
  
  Returns:
  
  The list of input splits.
  
  Throws:
  
  IOException - When creating the list of splits fails.
  
  See Also:
  
  InputFormat.getSplits(org.apache.hadoop.mapreduce.JobContext)
- addColumns
  
  private static void addColumns(Scan scan, String columns)
  
  Convenience method to parse a string representation of an array of column specifiers.
  
  Parameters:
  
  scan - The Scan to update.
  
  columns - The columns to parse.
- getStartEndKeys
  
  protected Pair<byte[][],byte[][]> getStartEndKeys() throws IOException
  
  Overrides:
  
  getStartEndKeys in class TableInputFormatBase
  
  Throws:
  
  IOException
- configureSplitTable
  
  public static void configureSplitTable(org.apache.hadoop.mapreduce.Job job, TableName tableName)
  
  Sets split table in map-reduce job.

Class TableInputFormat

Field Summary

Fields inherited from class org.apache.hadoop.hbase.mapreduce.TableInputFormatBase

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.hbase.mapreduce.TableInputFormatBase

Methods inherited from class java.lang.Object

Field Details

LOG

INPUT_TABLE

SPLIT_TABLE

SCAN

SCAN_ROW_START

SCAN_ROW_STOP

SCAN_COLUMN_FAMILY

SCAN_COLUMNS

SCAN_TIMESTAMP

SCAN_TIMERANGE_START

SCAN_TIMERANGE_END

SCAN_MAXVERSIONS

SCAN_CACHEBLOCKS

SCAN_CACHEDROWS

SCAN_BATCHSIZE

SHUFFLE_MAPS

conf

Constructor Details

TableInputFormat

Method Details

getConf

setConf

createScanFromConfiguration

initialize

addColumn

addColumns

getSplits

addColumns

getStartEndKeys

configureSplitTable