Package org.apache.hadoop.hbase.mapred
Class TableInputFormatBase
java.lang.Object
org.apache.hadoop.hbase.mapred.TableInputFormatBase
- All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,
Result>
- Direct Known Subclasses:
TableInputFormat
@Public
public abstract class TableInputFormatBase
extends Object
implements org.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,Result>
A Base for
TableInputFormat
s. Receives a Table
, a byte[] of input columns and
optionally a Filter
. Subclasses may use other TableRecordReader implementations.
Subclasses MUST ensure initializeTable(Connection, TableName) is called for an instance to
function properly. Each of the entry points to this class used by the MapReduce framework,
getRecordReader(InputSplit, JobConf, Reporter)
and getSplits(JobConf, int)
,
will call initialize(JobConf)
as a convenient centralized location to handle retrieving
the necessary configuration information. If your subclass overrides either of these methods,
either call the parent version or call initialize yourself.
An example of a subclass:
class ExampleTIF extends TableInputFormatBase { @Override protected void initialize(JobConf context) throws IOException { // We are responsible for the lifecycle of this connection until we hand it over in // initializeTable. Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create(job)); TableName tableName = TableName.valueOf("exampleTable"); // mandatory. once passed here, TableInputFormatBase will handle closing the connection. initializeTable(connection, tableName); byte[][] inputColumns = new byte [][] { Bytes.toBytes("columnA"), Bytes.toBytes("columnB") }; // mandatory setInputColumns(inputColumns); // optional, by default we'll get everything for the given columns. Filter exampleFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator("aa.*")); setRowFilter(exampleFilter); } }
-
Field Summary
Modifier and TypeFieldDescriptionprivate Connection
private static final String
private byte[][]
private static final org.slf4j.Logger
private static final String
private RegionLocator
private Filter
private Table
private TableRecordReader
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprivate void
protected void
Close the Table and related objects that were initialized viainitializeTable(Connection, TableName)
.org.apache.hadoop.mapred.RecordReader<ImmutableBytesWritable,
Result> getRecordReader
(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter) Builds a TableRecordReader.org.apache.hadoop.mapred.InputSplit[]
getSplits
(org.apache.hadoop.mapred.JobConf job, int numSplits) Calculates the splits that will serve as input for the map tasks.protected Table
getTable()
Allows subclasses to get theTable
.protected void
initialize
(org.apache.hadoop.mapred.JobConf job) Handle subclass specific set up.protected void
initializeTable
(Connection connection, TableName tableName) Allows subclasses to initialize the table information.protected void
setInputColumns
(byte[][] inputColumns) protected void
setRowFilter
(Filter rowFilter) Allows subclasses to set theFilter
to be used.protected void
setTableRecordReader
(TableRecordReader tableRecordReader) Allows subclasses to set theTableRecordReader
.
-
Field Details
-
LOG
-
inputColumns
-
table
-
regionLocator
-
connection
-
tableRecordReader
-
rowFilter
-
NOT_INITIALIZED
- See Also:
-
INITIALIZATION_ERROR
- See Also:
-
-
Constructor Details
-
TableInputFormatBase
public TableInputFormatBase()
-
-
Method Details
-
getRecordReader
public org.apache.hadoop.mapred.RecordReader<ImmutableBytesWritable,Result> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter) throws IOException Builds a TableRecordReader. If no TableRecordReader was provided, uses the default.- Specified by:
getRecordReader
in interfaceorg.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,
Result> - Throws:
IOException
- See Also:
-
InputFormat.getRecordReader(InputSplit, JobConf, Reporter)
-
getSplits
public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException Calculates the splits that will serve as input for the map tasks. Splits are created in number equal to the smallest between numSplits and the number ofHRegion
s in the table. If the number of splits is smaller than the number ofHRegion
s then splits are spanned across multipleHRegion
s and are grouped the most evenly possible. In the case splits are uneven the bigger splits are placed first in theInputSplit
array.- Specified by:
getSplits
in interfaceorg.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,
Result> - Parameters:
job
- the map taskJobConf
numSplits
- a hint to calculate the number of splits (mapred.map.tasks).- Returns:
- the input splits
- Throws:
IOException
- See Also:
-
InputFormat.getSplits(org.apache.hadoop.mapred.JobConf, int)
-
initializeTable
Allows subclasses to initialize the table information.- Parameters:
connection
- The Connection to the HBase cluster. MUST be unmanaged. We will close.tableName
- TheTableName
of the table to process.- Throws:
IOException
-
setInputColumns
- Parameters:
inputColumns
- to be passed inResult
to the map task.
-
getTable
Allows subclasses to get theTable
. -
setTableRecordReader
Allows subclasses to set theTableRecordReader
. to provide otherTableRecordReader
implementations. -
setRowFilter
Allows subclasses to set theFilter
to be used. -
initialize
Handle subclass specific set up. Each of the entry points used by the MapReduce framework,getRecordReader(InputSplit, JobConf, Reporter)
andgetSplits(JobConf, int)
, will callinitialize(JobConf)
as a convenient centralized location to handle retrieving the necessary configuration information and callinginitializeTable(Connection, TableName)
. Subclasses should implement their initialize call such that it is safe to call multiple times. The current TableInputFormatBase implementation relies on a non-null table reference to decide if an initialize call is needed, but this behavior may change in the future. In particular, it is critical that initializeTable not be called multiple times since this will leak Connection instances.- Throws:
IOException
-
closeTable
Close the Table and related objects that were initialized viainitializeTable(Connection, TableName)
.- Throws:
IOException
-
close
- Throws:
IOException
-