Class MultiTableInputFormatBase

java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase
Direct Known Subclasses:
MultiTableInputFormat

@Public public abstract class MultiTableInputFormatBase extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
A base for MultiTableInputFormats. Receives a list of Scan instances that define the input tables and filters etc. Subclasses may use other TableRecordReader implementations.
  • Field Details

  • Constructor Details

  • Method Details

    • createRecordReader

      public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException
      Builds a TableRecordReader. If no TableRecordReader was provided, uses the default.
      Specified by:
      createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
      Parameters:
      split - The split to work with.
      context - The current context.
      Returns:
      The newly created record reader.
      Throws:
      IOException - When creating the reader fails.
      InterruptedException - when record reader initialization fails
      See Also:
      • InputFormat.createRecordReader(InputSplit, TaskAttemptContext)
    • getSplits

      public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context) throws IOException
      Calculates the splits that will serve as input for the map tasks. The number of splits matches the number of regions in a table.
      Specified by:
      getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
      Parameters:
      context - The current job context.
      Returns:
      The list of input splits.
      Throws:
      IOException - When creating the list of splits fails.
      See Also:
      • InputFormat.getSplits(org.apache.hadoop.mapreduce.JobContext)
    • includeRegionInSplit

      protected boolean includeRegionInSplit(byte[] startKey, byte[] endKey)
      Test if the given region is to be included in the InputSplit while splitting the regions of a table.

      This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job, (and hence, not contributing to the InputSplit), given the start and end keys of the same.
      Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R processing, continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due to the ordering of the keys.

      Note: It is possible that endKey.length() == 0 , for the last (recent) region.
      Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded( i.e. all regions are included).

      Parameters:
      startKey - Start key of the region
      endKey - End key of the region
      Returns:
      true, if this region needs to be included as part of the input (default).
    • getScans

      protected List<Scan> getScans()
      Allows subclasses to get the list of Scan objects.
    • setScans

      protected void setScans(List<Scan> scans)
      Allows subclasses to set the list of Scan objects.
      Parameters:
      scans - The list of Scan used to define the input
    • setTableRecordReader

      protected void setTableRecordReader(TableRecordReader tableRecordReader)
      Allows subclasses to set the TableRecordReader.
      Parameters:
      tableRecordReader - A different TableRecordReader implementation.