Class TableSnapshotInputFormatImpl
java.lang.Object
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl
Hadoop MR API-agnostic implementation for mapreduce over table snapshots.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classImplementation class for InputSplit logic common between mapred and mapreduce.static classImplementation class for RecordReader logic common between mapred and mapreduce. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final floatprivate static final Stringstatic final org.slf4j.Loggerstatic final StringFor MapReduce jobs running multiple mappers per region, determines number of splits to generate per region.protected static final Stringstatic final StringWhether to calculate the Snapshot region location by region location from meta.static final booleanstatic final booleanstatic final StringWhether to calculate the block location for splits.static final StringIn some scenario, scan limited rows on each InputSplit for sampling data extractionstatic final StringWhether to enable scan metrics on Scan, default to truestatic final booleanstatic final StringTheScan.ReadTypewhich should be set on theScanto read the HBase Snapshot, default STREAM.static final Scan.ReadTypeprivate static final Stringstatic final StringFor MapReduce jobs running multiple mappers per region, determines what split algorithm we should be using to find split points for scanners. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptioncalculateLocationsForInputSplit(org.apache.hadoop.conf.Configuration conf, TableDescriptor htd, RegionInfo hri, org.apache.hadoop.fs.Path tableDir) Compute block locations for snapshot files (which will get the locations for referred hfiles) only when localityEnabled is true.static voidcleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName) clean restore directory after snapshot scan jobstatic ScanextractScanFromConf(org.apache.hadoop.conf.Configuration conf) getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution) getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution, int numTopsAtMost) This computes the locations to be passed from the InputSplit.static List<RegionInfo>getRegionInfosFromManifest(SnapshotManifest manifest) static SnapshotManifestgetSnapshotManifest(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path rootDir, org.apache.hadoop.fs.FileSystem fs) private static StringgetSnapshotName(org.apache.hadoop.conf.Configuration conf) getSplitAlgo(org.apache.hadoop.conf.Configuration conf) getSplits(org.apache.hadoop.conf.Configuration conf) getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf) getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf, RegionSplitter.SplitAlgorithm sa, int numSplits) static voidsetInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir) Configures the job to use TableSnapshotInputFormat to read from a snapshot.static voidsetInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) Configures the job to use TableSnapshotInputFormat to read from a snapshot.
-
Field Details
-
LOG
-
SNAPSHOT_NAME_KEY
- See Also:
-
RESTORE_DIR_KEY
- See Also:
-
LOCALITY_CUTOFF_MULTIPLIER
- See Also:
-
DEFAULT_LOCALITY_CUTOFF_MULTIPLIER
- See Also:
-
SPLIT_ALGO
For MapReduce jobs running multiple mappers per region, determines what split algorithm we should be using to find split points for scanners.- See Also:
-
NUM_SPLITS_PER_REGION
For MapReduce jobs running multiple mappers per region, determines number of splits to generate per region.- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_KEY
Whether to calculate the block location for splits. Default to true. If the computing layer runs outside of HBase cluster, the block locality does not master. Setting this value to false could skip the calculation and save some time. Set access modifier to "public" so that these could be accessed by test classes of both org.apache.hadoop.hbase.mapred and org.apache.hadoop.hbase.mapreduce.- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_DEFAULT
- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION
Whether to calculate the Snapshot region location by region location from meta. It is much faster than computing block locations for splits.- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION_DEFAULT
- See Also:
-
SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT
In some scenario, scan limited rows on each InputSplit for sampling data extraction- See Also:
-
SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED
Whether to enable scan metrics on Scan, default to true- See Also:
-
SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED_DEFAULT
- See Also:
-
SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE
TheScan.ReadTypewhich should be set on theScanto read the HBase Snapshot, default STREAM.- See Also:
-
SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE_DEFAULT
-
-
Constructor Details
-
TableSnapshotInputFormatImpl
public TableSnapshotInputFormatImpl()
-
-
Method Details
-
getSplits
public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getSplitAlgo
public static RegionSplitter.SplitAlgorithm getSplitAlgo(org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getRegionInfosFromManifest
-
getSnapshotManifest
public static SnapshotManifest getSnapshotManifest(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path rootDir, org.apache.hadoop.fs.FileSystem fs) throws IOException - Throws:
IOException
-
extractScanFromConf
public static Scan extractScanFromConf(org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getSplits
public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getSplits
public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(Scan scan, SnapshotManifest manifest, List<RegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf, RegionSplitter.SplitAlgorithm sa, int numSplits) throws IOException - Throws:
IOException
-
calculateLocationsForInputSplit
private static List<String> calculateLocationsForInputSplit(org.apache.hadoop.conf.Configuration conf, TableDescriptor htd, RegionInfo hri, org.apache.hadoop.fs.Path tableDir) throws IOException Compute block locations for snapshot files (which will get the locations for referred hfiles) only when localityEnabled is true.- Throws:
IOException
-
getBestLocations
private static List<String> getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution, int numTopsAtMost) This computes the locations to be passed from the InputSplit. MR/Yarn schedulers does not take weights into account, thus will treat every location passed from the input split as equal. We do not want to blindly pass all the locations, since we are creating one split per region, and the region's blocks are all distributed throughout the cluster unless favorite node assignment is used. On the expected stable case, only one location will contain most of the blocks as local. On the other hand, in favored node assignment, 3 nodes will contain highly local blocks. Here we are doing a simple heuristic, where we will pass all hosts which have at least 80% (hbase.tablesnapshotinputformat.locality.cutoff.multiplier) as much block locality as the top host with the best locality. Return at most numTopsAtMost locations if there are more than that. -
getBestLocations
public static List<String> getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution) -
getSnapshotName
-
setInput
public static void setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir) throws IOException Configures the job to use TableSnapshotInputFormat to read from a snapshot.- Parameters:
conf- the job to configurationsnapshotName- the name of the snapshot to read fromrestoreDir- a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.- Throws:
IOException- if an error occurs
-
setInput
public static void setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException Configures the job to use TableSnapshotInputFormat to read from a snapshot.- Parameters:
conf- the job to configuresnapshotName- the name of the snapshot to read fromrestoreDir- a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.numSplitsPerRegion- how many input splits to generate per one regionsplitAlgo- SplitAlgorithm to be used when generating InputSplits- Throws:
IOException- if an error occurs
-
cleanRestoreDir
public static void cleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName) throws IOException clean restore directory after snapshot scan job- Parameters:
job- the snapshot scan jobsnapshotName- the name of the snapshot to read from- Throws:
IOException- if an error occurs
-