java.lang.Object

org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl

@Private public class TableSnapshotInputFormatImpl extends Object

Hadoop MR API-agnostic implementation for mapreduce over table snapshots.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

TableSnapshotInputFormatImpl.InputSplit

Implementation class for InputSplit logic common between mapred and mapreduce.

static class

TableSnapshotInputFormatImpl.RecordReader

Implementation class for RecordReader logic common between mapred and mapreduce.
Field Summary

Fields

Modifier and Type

Field

Description

private static final float

DEFAULT_LOCALITY_CUTOFF_MULTIPLIER

private static final String

LOCALITY_CUTOFF_MULTIPLIER

See getBestLocations(Configuration, HDFSBlocksDistribution, int)

static final org.slf4j.Logger

LOG

static final String

NUM_SPLITS_PER_REGION

For MapReduce jobs running multiple mappers per region, determines number of splits to generate per region.

protected static final String

RESTORE_DIR_KEY

static final String

SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION

Whether to calculate the Snapshot region location by region location from meta.

static final boolean

SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION_DEFAULT

static final boolean

SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_DEFAULT

static final String

SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_KEY

Whether to calculate the block location for splits.

static final String

SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT

In some scenario, scan limited rows on each InputSplit for sampling data extraction

static final String

SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED

Whether to enable scan metrics on Scan, default to true

static final boolean

SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED_DEFAULT

static final String

SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE

The Scan.ReadType which should be set on the Scan to read the HBase Snapshot, default STREAM.

static final Scan.ReadType

SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE_DEFAULT

private static final String

SNAPSHOT_NAME_KEY

static final String

SPLIT_ALGO

For MapReduce jobs running multiple mappers per region, determines what split algorithm we should be using to find split points for scanners.
Constructor Summary

Constructors

Constructor

Description

TableSnapshotInputFormatImpl()
Method Summary

Modifier and Type

Method

Description

private static List<String>

calculateLocationsForInputSplit(org.apache.hadoop.conf.Configuration conf, TableDescriptor htd, RegionInfo hri, org.apache.hadoop.fs.Path tableDir)

Compute block locations for snapshot files (which will get the locations for referred hfiles) only when localityEnabled is true.

static void

cleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName)

clean restore directory after snapshot scan job

static Scan

extractScanFromConf(org.apache.hadoop.conf.Configuration conf)

static List<String>

getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution)

private static List<String>

getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution, int numTopsAtMost)

This computes the locations to be passed from the InputSplit.

static List<RegionInfo>

getRegionInfosFromManifest(SnapshotManifest manifest)

static SnapshotManifest

getSnapshotManifest(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path rootDir, org.apache.hadoop.fs.FileSystem fs)

private static String

getSnapshotName(org.apache.hadoop.conf.Configuration conf)

static RegionSplitter.SplitAlgorithm

getSplitAlgo(org.apache.hadoop.conf.Configuration conf)

static List<TableSnapshotInputFormatImpl.InputSplit>

getSplits(org.apache.hadoop.conf.Configuration conf)

static List<TableSnapshotInputFormatImpl.InputSplit>

getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf)

static List<TableSnapshotInputFormatImpl.InputSplit>

getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf, RegionSplitter.SplitAlgorithm sa, int numSplits)

static void

setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir)

Configures the job to use TableSnapshotInputFormat to read from a snapshot.

static void

setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion)

Configures the job to use TableSnapshotInputFormat to read from a snapshot.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LOG
  
  public static final org.slf4j.Logger LOG
- SNAPSHOT_NAME_KEY
  
  private static final String SNAPSHOT_NAME_KEY
  See Also:
  
  Constant Field Values
- RESTORE_DIR_KEY
  
  protected static final String RESTORE_DIR_KEY
  See Also:
  
  Constant Field Values
- LOCALITY_CUTOFF_MULTIPLIER
  
  private static final String LOCALITY_CUTOFF_MULTIPLIER
  
  See getBestLocations(Configuration, HDFSBlocksDistribution, int)
  See Also:
  
  Constant Field Values
- DEFAULT_LOCALITY_CUTOFF_MULTIPLIER
  
  private static final float DEFAULT_LOCALITY_CUTOFF_MULTIPLIER
  See Also:
  
  Constant Field Values
- SPLIT_ALGO
  
  public static final String SPLIT_ALGO
  
  For MapReduce jobs running multiple mappers per region, determines what split algorithm we should be using to find split points for scanners.
  See Also:
  
  Constant Field Values
- NUM_SPLITS_PER_REGION
  
  public static final String NUM_SPLITS_PER_REGION
  
  For MapReduce jobs running multiple mappers per region, determines number of splits to generate per region.
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_KEY
  
  public static final String SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_KEY
  
  Whether to calculate the block location for splits. Default to true. If the computing layer runs outside of HBase cluster, the block locality does not master. Setting this value to false could skip the calculation and save some time. Set access modifier to "public" so that these could be accessed by test classes of both org.apache.hadoop.hbase.mapred and org.apache.hadoop.hbase.mapreduce.
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_DEFAULT
  
  public static final boolean SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_DEFAULT
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION
  
  public static final String SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION
  
  Whether to calculate the Snapshot region location by region location from meta. It is much faster than computing block locations for splits.
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION_DEFAULT
  
  public static final boolean SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION_DEFAULT
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT
  
  public static final String SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT
  
  In some scenario, scan limited rows on each InputSplit for sampling data extraction
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED
  
  public static final String SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED
  
  Whether to enable scan metrics on Scan, default to true
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED_DEFAULT
  
  public static final boolean SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED_DEFAULT
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE
  
  public static final String SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE
  
  The Scan.ReadType which should be set on the Scan to read the HBase Snapshot, default STREAM.
  See Also:
  
  Constant Field Values
- SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE_DEFAULT
  
  public static final Scan.ReadType SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE_DEFAULT
Constructor Details
- TableSnapshotInputFormatImpl
  
  public TableSnapshotInputFormatImpl()
Method Details
- getSplits
  
  public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(org.apache.hadoop.conf.Configuration conf) throws IOException
  
  Throws:
  
  IOException
- getSplitAlgo
  
  public static RegionSplitter.SplitAlgorithm getSplitAlgo(org.apache.hadoop.conf.Configuration conf) throws IOException
  
  Throws:
  
  IOException
- getRegionInfosFromManifest
  
  public static List<RegionInfo> getRegionInfosFromManifest(SnapshotManifest manifest)
- getSnapshotManifest
  
  public static SnapshotManifest getSnapshotManifest(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path rootDir, org.apache.hadoop.fs.FileSystem fs) throws IOException
  
  Throws:
  
  IOException
- extractScanFromConf
  
  public static Scan extractScanFromConf(org.apache.hadoop.conf.Configuration conf) throws IOException
  
  Throws:
  
  IOException
- getSplits
  
  public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf) throws IOException
  
  Throws:
  
  IOException
- getSplits
  
  public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf, RegionSplitter.SplitAlgorithm sa, int numSplits) throws IOException
  
  Throws:
  
  IOException
- calculateLocationsForInputSplit
  
  private static List<String> calculateLocationsForInputSplit(org.apache.hadoop.conf.Configuration conf, TableDescriptor htd, RegionInfo hri, org.apache.hadoop.fs.Path tableDir) throws IOException
  
  Compute block locations for snapshot files (which will get the locations for referred hfiles) only when localityEnabled is true.
  
  Throws:
  
  IOException
- getBestLocations
  
  private static List<String> getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution, int numTopsAtMost)
  
  This computes the locations to be passed from the InputSplit. MR/Yarn schedulers does not take weights into account, thus will treat every location passed from the input split as equal. We do not want to blindly pass all the locations, since we are creating one split per region, and the region's blocks are all distributed throughout the cluster unless favorite node assignment is used. On the expected stable case, only one location will contain most of the blocks as local. On the other hand, in favored node assignment, 3 nodes will contain highly local blocks. Here we are doing a simple heuristic, where we will pass all hosts which have at least 80% (hbase.tablesnapshotinputformat.locality.cutoff.multiplier) as much block locality as the top host with the best locality. Return at most numTopsAtMost locations if there are more than that.
- getBestLocations
  
  public static List<String> getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution)
- getSnapshotName
  
  private static String getSnapshotName(org.apache.hadoop.conf.Configuration conf)
- setInput
  
  public static void setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir) throws IOException
  
  Configures the job to use TableSnapshotInputFormat to read from a snapshot.
  
  Parameters:
  
  conf - the job to configuration
  
  snapshotName - the name of the snapshot to read from
  
  restoreDir - a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.
  
  Throws:
  
  IOException - if an error occurs
- setInput
  
  public static void setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException
  
  Configures the job to use TableSnapshotInputFormat to read from a snapshot.
  
  Parameters:
  
  conf - the job to configure
  
  snapshotName - the name of the snapshot to read from
  
  restoreDir - a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.
  
  numSplitsPerRegion - how many input splits to generate per one region
  
  splitAlgo - SplitAlgorithm to be used when generating InputSplits
  
  Throws:
  
  IOException - if an error occurs
- cleanRestoreDir
  
  public static void cleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName) throws IOException
  
  clean restore directory after snapshot scan job
  
  Parameters:
  
  job - the snapshot scan job
  
  snapshotName - the name of the snapshot to read from
  
  Throws:
  
  IOException - if an error occurs

Class TableSnapshotInputFormatImpl

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

LOG

SNAPSHOT_NAME_KEY

RESTORE_DIR_KEY

LOCALITY_CUTOFF_MULTIPLIER

DEFAULT_LOCALITY_CUTOFF_MULTIPLIER

SPLIT_ALGO

NUM_SPLITS_PER_REGION

SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_KEY

SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_DEFAULT

SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION

SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION_DEFAULT

SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT

SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED

SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED_DEFAULT

SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE

SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE_DEFAULT

Constructor Details

TableSnapshotInputFormatImpl

Method Details

getSplits

getSplitAlgo

getRegionInfosFromManifest

getSnapshotManifest

extractScanFromConf

getSplits

getSplits

calculateLocationsForInputSplit

getBestLocations

getBestLocations

getSnapshotName

setInput

setInput

cleanRestoreDir