org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat

Direct Known Subclasses:: MultiTableSnapshotInputFormat

@Public public class TableSnapshotInputFormat extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

TableSnapshotInputFormat allows a MapReduce job to run over a table snapshot. The job bypasses HBase servers, and directly accesses the underlying files (hfile, recovered edits, wals, etc) directly to provide maximum performance. The snapshot is not required to be restored to the live cluster or cloned. This also allows to run the mapreduce job from an online or offline hbase cluster. The snapshot files can be exported by using the ExportSnapshot tool, to a pure-hdfs cluster, and this InputFormat can be used to run the mapreduce job directly over the snapshot files. The snapshot should not be deleted while there are jobs reading from snapshot files.

Usage is similar to TableInputFormat, and TableMapReduceUtil.initTableSnapshotMapperJob(String, Scan, Class, Class, Class, Job, boolean, Path) can be used to configure the job.

 {
   @code
   Job job = new Job(conf);
   Scan scan = new Scan();
   TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, scan, MyTableMapper.class,
     MyMapKeyOutput.class, MyMapOutputValueWritable.class, job, true);
 }

Internally, this input format restores the snapshot into the given tmp directory. By default, and similar to TableInputFormat an InputSplit is created per region, but optionally you can run N mapper tasks per every region, in which case the region key range will be split to N sub-ranges and an InputSplit will be created per sub-range. The region is opened for reading from each RecordReader. An internal RegionScanner is used to execute the CellScanner obtained from the user.

HBase owns all the data and snapshot files on the filesystem. Only the 'hbase' user can read from snapshot files and data files. To read from snapshot files directly from the file system, the user who is running the MR job must have sufficient permissions to access snapshot and reference files. This means that to run mapreduce over snapshot files, the MR job has to be run as the HBase user or the user must have group or other privileges in the filesystem (See HBASE-8369). Note that, given other users access to read from snapshot/data files will completely circumvent the access control enforced by HBase.

See Also:

TableSnapshotScanner

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

(package private) static class

TableSnapshotInputFormat.TableSnapshotRegionRecordReader

static class

TableSnapshotInputFormat.TableSnapshotRegionSplit
Constructor Summary

Constructors

Constructor

Description

TableSnapshotInputFormat()
Method Summary

Modifier and Type

Method

Description

static void

cleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName)

clean restore directory after snapshot scan job

org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result>

createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)

List<org.apache.hadoop.mapreduce.InputSplit>

getSplits(org.apache.hadoop.mapreduce.JobContext job)

static void

setInput(org.apache.hadoop.mapreduce.Job job, String snapshotName, org.apache.hadoop.fs.Path restoreDir)

Configures the job to use TableSnapshotInputFormat to read from a snapshot.

static void

setInput(org.apache.hadoop.mapreduce.Job job, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion)

Configures the job to use TableSnapshotInputFormat to read from a snapshot.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TableSnapshotInputFormat
  
  public TableSnapshotInputFormat()
Method Details
- createRecordReader
  
  public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
  
  Specified by:
  
  createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
  
  Throws:
  
  IOException
- getSplits
  
  public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job) throws IOException, InterruptedException
  
  Specified by:
  
  getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
  
  Throws:
  
  IOException
  
  InterruptedException
- setInput
  
  public static void setInput(org.apache.hadoop.mapreduce.Job job, String snapshotName, org.apache.hadoop.fs.Path restoreDir) throws IOException
  
  Configures the job to use TableSnapshotInputFormat to read from a snapshot.
  
  Parameters:
  
  job - the job to configure
  
  snapshotName - the name of the snapshot to read from
  
  restoreDir - a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.
  
  Throws:
  
  IOException - if an error occurs
- setInput
  
  public static void setInput(org.apache.hadoop.mapreduce.Job job, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException
  
  Configures the job to use TableSnapshotInputFormat to read from a snapshot.
  
  Parameters:
  
  job - the job to configure
  
  snapshotName - the name of the snapshot to read from
  
  restoreDir - a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.
  
  splitAlgo - split algorithm to generate splits from region
  
  numSplitsPerRegion - how many input splits to generate per one region
  
  Throws:
  
  IOException - if an error occurs
- cleanRestoreDir
  
  public static void cleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName) throws IOException
  
  clean restore directory after snapshot scan job
  
  Parameters:
  
  job - the snapshot scan job
  
  snapshotName - the name of the snapshot to read from
  
  Throws:
  
  IOException - if an error occurs

Class TableSnapshotInputFormat

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TableSnapshotInputFormat

Method Details

createRecordReader

getSplits

setInput

setInput

cleanRestoreDir