Class TableMapReduceUtil
java.lang.Object
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil
Utility for
TableMapper
and TableReducer
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
addDependencyJars
(org.apache.hadoop.mapreduce.Job job) Add the HBase dependency jars as well as jars for any of the configured job classes to the job configuration, so that JobClient will ship them to the cluster and add them to the DistributedCache.static void
addDependencyJarsForClasses
(org.apache.hadoop.conf.Configuration conf, Class<?>... classes) Add the jars containing the given classes to the job's configuration such that JobClient will ship them to the cluster and add them to the DistributedCache.static void
addHBaseDependencyJars
(org.apache.hadoop.conf.Configuration conf) Add HBase and its dependencies (only) to the job configuration.private static void
addTokenForJob
(IOExceptionSupplier<Connection> connSupplier, User user, org.apache.hadoop.mapreduce.Job job) static String
buildDependencyClasspath
(org.apache.hadoop.conf.Configuration conf) Returns a classpath string built from the content of the "tmpjars" value inconf
.static String
convertScanToString
(Scan scan) Writes the given scan into a Base64 encoded string.static Scan
convertStringToScan
(String base64) Converts the given Base64 string back into a Scan instance.private static String
findContainingJar
(Class<?> my_class, Map<String, String> packagedClasses) Find a jar that contains a class of the same name, if any.private static org.apache.hadoop.fs.Path
findOrCreateJar
(Class<?> my_class, org.apache.hadoop.fs.FileSystem fs, Map<String, String> packagedClasses) Finds the Jar for a class or creates it if it doesn't exist.private static Class<? extends org.apache.hadoop.mapreduce.InputFormat>
getConfiguredInputFormat
(org.apache.hadoop.mapreduce.Job job) private static String
Invoke 'getJar' on a custom JarFinder implementation.private static int
getRegionCount
(org.apache.hadoop.conf.Configuration conf, TableName tableName) static void
initCredentials
(org.apache.hadoop.mapreduce.Job job) static void
initCredentialsForCluster
(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf) Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job.static void
initCredentialsForCluster
(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf, URI uri) Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job.static void
initMultiTableSnapshotMapperJob
(Map<String, Collection<Scan>> snapshotScans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) Sets up the job for reading from one or more table snapshots, with one or more scans per snapshot.static void
initTableMapperJob
(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) Use this before submitting a TableMap job.static void
initTableMapperJob
(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) Use this before submitting a TableMap job.static void
initTableMapperJob
(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) Use this before submitting a TableMap job.static void
initTableMapperJob
(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) Use this before submitting a TableMap job.static void
initTableMapperJob
(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) Use this before submitting a TableMap job.static void
initTableMapperJob
(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) Use this before submitting a TableMap job.static void
initTableMapperJob
(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) Use this before submitting a TableMap job.static void
initTableMapperJob
(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) Use this before submitting a Multi TableMap job.static void
initTableMapperJob
(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) Use this before submitting a Multi TableMap job.static void
initTableMapperJob
(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials) Use this before submitting a Multi TableMap job.static void
initTableMapperJob
(TableName table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) Use this before submitting a TableMap job.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job) Use this before submitting a TableReduce job.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner) Use this before submitting a TableReduce job.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress) Deprecated.Since 3.0.0, will be removed in 4.0.0.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, boolean addDependencyJars) Deprecated.Since 3.0.0, will be removed in 4.0.0.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl) Deprecated.Since 2.5.9, 2.6.1, 2.7.0, will be removed in 4.0.0.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl, boolean addDependencyJars) Deprecated.Since 2.5.9, 2.6.1, 2.7.0, will be removed in 4.0.0.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, URI outputCluster) Use this before submitting a TableReduce job.static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, URI outputCluster, boolean addDependencyJars) Use this before submitting a TableReduce job.private static void
initTableReducerJob
(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, IOExceptionRunnable setOutputCluster, boolean addDependencyJars) static void
initTableSnapshotMapperJob
(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) Sets up the job for reading from a table snapshot.static void
initTableSnapshotMapperJob
(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) Sets up the job for reading from a table snapshot.static void
limitNumReduceTasks
(String table, org.apache.hadoop.mapreduce.Job job) Ensures that the given number of reduce tasks for the given job configuration does not exceed the number of regions for the given table.static void
resetCacheConfig
(org.apache.hadoop.conf.Configuration conf) Enable a basic on-heap cache for these jobs.static void
setNumReduceTasks
(String table, org.apache.hadoop.mapreduce.Job job) Sets the number of reduce tasks for the given job configuration to the number of regions the given table has.static void
setScannerCaching
(org.apache.hadoop.mapreduce.Job job, int batchSize) Sets the number of rows to return and cache with each scanner iteration.private static void
Add entries topackagedClasses
corresponding to class files contained injar
.
-
Field Details
-
LOG
-
TABLE_INPUT_CLASS_KEY
- See Also:
-
-
Constructor Details
-
TableMapReduceUtil
public TableMapReduceUtil()
-
-
Method Details
-
initTableMapperJob
public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(TableName table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- Binary representation of the table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).initCredentials
- whether to initialize hbase auth credentials for the jobinputFormatClass
- the input format- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- Binary representation of the table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).inputFormatClass
- The class of the input format- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- Binary representation of the table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When setting up the details fails.
-
getConfiguredInputFormat
private static Class<? extends org.apache.hadoop.mapreduce.InputFormat> getConfiguredInputFormat(org.apache.hadoop.mapreduce.Job job) - Returns:
TableInputFormat
.class unless Configuration has something else atTABLE_INPUT_CLASS_KEY
.
-
initTableMapperJob
public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException Use this before submitting a TableMap job. It will appropriately set up the job.- Parameters:
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When setting up the details fails.
-
resetCacheConfig
Enable a basic on-heap cache for these jobs. Any BlockCache implementation based on direct memory will likely cause the map tasks to OOM when opening the region. This is done here instead of in TableSnapshotRegionRecordReader in case an advanced user wants to override this behavior in their job. -
initMultiTableSnapshotMapperJob
public static void initMultiTableSnapshotMapperJob(Map<String, Collection<Scan>> snapshotScans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) throws IOExceptionSets up the job for reading from one or more table snapshots, with one or more scans per snapshot. It bypasses hbase servers and read directly from snapshot files.- Parameters:
snapshotScans
- map of snapshot name to scans on that snapshot.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
-
initTableSnapshotMapperJob
public static void initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) throws IOException Sets up the job for reading from a table snapshot. It bypasses hbase servers and read directly from snapshot files.- Parameters:
snapshotName
- The name of the snapshot (of a table) to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).tmpRestoreDir
- a temporary directory to copy the snapshot files into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restore directory can be deleted.- Throws:
IOException
- When setting up the details fails.- See Also:
-
initTableSnapshotMapperJob
public static void initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException Sets up the job for reading from a table snapshot. It bypasses hbase servers and read directly from snapshot files.- Parameters:
snapshotName
- The name of the snapshot (of a table) to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).tmpRestoreDir
- a temporary directory to copy the snapshot files into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restore directory can be deleted.splitAlgo
- algorithm to splitnumSplitsPerRegion
- how many input splits to generate per one region- Throws:
IOException
- When setting up the details fails.- See Also:
-
initTableMapperJob
public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException Use this before submitting a Multi TableMap job. It will appropriately set up the job.- Parameters:
scans
- The list ofScan
objects to read from.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException Use this before submitting a Multi TableMap job. It will appropriately set up the job.- Parameters:
scans
- The list ofScan
objects to read from.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When setting up the details fails.
-
initTableMapperJob
public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials) throws IOException Use this before submitting a Multi TableMap job. It will appropriately set up the job.- Parameters:
scans
- The list ofScan
objects to read from.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).initCredentials
- whether to initialize hbase auth credentials for the job- Throws:
IOException
- When setting up the details fails.
-
addTokenForJob
private static void addTokenForJob(IOExceptionSupplier<Connection> connSupplier, User user, org.apache.hadoop.mapreduce.Job job) throws IOException, InterruptedException - Throws:
IOException
InterruptedException
-
initCredentials
- Throws:
IOException
-
initCredentialsForCluster
public static void initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf) throws IOException Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job.- Parameters:
job
- The job that requires the permission.conf
- The configuration to use in connecting to the peer cluster- Throws:
IOException
- When the authentication token cannot be obtained.
-
initCredentialsForCluster
public static void initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf, URI uri) throws IOException Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job.- Parameters:
job
- The job that requires the permission.conf
- The configuration to use in connecting to the peer clusteruri
- The connection uri for the given peer cluster- Throws:
IOException
- When the authentication token cannot be obtained.
-
convertScanToString
Writes the given scan into a Base64 encoded string.- Parameters:
scan
- The scan to write out.- Returns:
- The scan saved in a Base64 encoded string.
- Throws:
IOException
- When writing the scan fails.
-
convertStringToScan
Converts the given Base64 string back into a Scan instance.- Parameters:
base64
- The scan details.- Returns:
- The newly created Scan instance.
- Throws:
IOException
- When reading the scan instance fails.
-
initTableReducerJob
public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job) throws IOException Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust.- Throws:
IOException
- When determining the region count fails.
-
initTableReducerJob
public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner) throws IOException Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust.partitioner
- Partitioner to use. Passnull
to use default partitioner.- Throws:
IOException
- When determining the region count fails.
-
initTableReducerJob
@Deprecated public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress) throws IOException Deprecated.Since 3.0.0, will be removed in 4.0.0. UseinitTableReducerJob(String, Class, Job, Class, URI)
instead, where we use the connection uri to specify the target cluster.Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.partitioner
- Partitioner to use. Passnull
to use default partitioner.quorumAddress
- Distant cluster to write to; default is null for output to the cluster that is designated inhbase-site.xml
. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated byhbase-site.xml
and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass<hbase.zookeeper.quorum>:< hbase.zookeeper.client.port>:<zookeeper.znode.parent>
such asserver,server2,server3:2181:/hbase
.- Throws:
IOException
- When determining the region count fails.
-
initTableReducerJob
@Deprecated public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, boolean addDependencyJars) throws IOException Deprecated.Since 3.0.0, will be removed in 4.0.0. UseinitTableReducerJob(String, Class, Job, Class, URI, boolean)
instead, where we use the connection uri to specify the target cluster.Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.partitioner
- Partitioner to use. Passnull
to use default partitioner.quorumAddress
- Distant cluster to write to; default is null for output to the cluster that is designated inhbase-site.xml
. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated byhbase-site.xml
and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass<hbase.zookeeper.quorum>:< hbase.zookeeper.client.port>:<zookeeper.znode.parent>
such asserver,server2,server3:2181:/hbase
.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When determining the region count fails.
-
initTableReducerJob
public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, URI outputCluster) throws IOException Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.partitioner
- Partitioner to use. Passnull
to use default partitioner.outputCluster
- The HBase cluster you want to write to. Default is null which means output to the same cluster you read from, i.e, the cluster when initializing by the job's Configuration instance.- Throws:
IOException
- When determining the region count fails.
-
initTableReducerJob
public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, URI outputCluster, boolean addDependencyJars) throws IOException Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.partitioner
- Partitioner to use. Passnull
to use default partitioner.outputCluster
- The HBase cluster you want to write to. Default is null which means output to the same cluster you read from, i.e, the cluster when initializing by the job's Configuration instance.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When determining the region count fails.
-
initTableReducerJob
private static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, IOExceptionRunnable setOutputCluster, boolean addDependencyJars) throws IOException - Throws:
IOException
-
initTableReducerJob
@Deprecated public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl) throws IOException Deprecated.Since 2.5.9, 2.6.1, 2.7.0, will be removed in 4.0.0. TheserverClass
andserverImpl
do not take effect any more, just useinitTableReducerJob(String, Class, Job, Class, String)
instead.Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.partitioner
- Partitioner to use. Passnull
to use default partitioner.quorumAddress
- Distant cluster to write to; default is null for output to the cluster that is designated inhbase-site.xml
. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated byhbase-site.xml
and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass<hbase.zookeeper.quorum>:< hbase.zookeeper.client.port>:<zookeeper.znode.parent>
such asserver,server2,server3:2181:/hbase
.serverClass
- redefined hbase.regionserver.classserverImpl
- redefined hbase.regionserver.impl- Throws:
IOException
- When determining the region count fails.- See Also:
-
initTableReducerJob
@Deprecated public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl, boolean addDependencyJars) throws IOException Deprecated.Since 2.5.9, 2.6.1, 2.7.0, will be removed in 4.0.0. TheserverClass
andserverImpl
do not take effect any more, just useinitTableReducerJob(String, Class, Job, Class, String, boolean)
instead.Use this before submitting a TableReduce job. It will appropriately set up the JobConf.- Parameters:
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.partitioner
- Partitioner to use. Passnull
to use default partitioner.quorumAddress
- Distant cluster to write to; default is null for output to the cluster that is designated inhbase-site.xml
. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated byhbase-site.xml
and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass<hbase.zookeeper.quorum>:< hbase.zookeeper.client.port>:<zookeeper.znode.parent>
such asserver,server2,server3:2181:/hbase
.serverClass
- redefined hbase.regionserver.classserverImpl
- redefined hbase.regionserver.impladdDependencyJars
- upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).- Throws:
IOException
- When determining the region count fails.- See Also:
-
limitNumReduceTasks
public static void limitNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job) throws IOException Ensures that the given number of reduce tasks for the given job configuration does not exceed the number of regions for the given table.- Parameters:
table
- The table to get the region count for.job
- The current job to adjust.- Throws:
IOException
- When retrieving the table details fails.
-
setNumReduceTasks
public static void setNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job) throws IOException Sets the number of reduce tasks for the given job configuration to the number of regions the given table has.- Parameters:
table
- The table to get the region count for.job
- The current job to adjust.- Throws:
IOException
- When retrieving the table details fails.
-
setScannerCaching
Sets the number of rows to return and cache with each scanner iteration. Higher caching values will enable faster mapreduce jobs at the expense of requiring more heap to contain the cached rows.- Parameters:
job
- The current job to adjust.batchSize
- The number of rows to return in batch with each scanner iteration.
-
addHBaseDependencyJars
public static void addHBaseDependencyJars(org.apache.hadoop.conf.Configuration conf) throws IOException Add HBase and its dependencies (only) to the job configuration.This is intended as a low-level API, facilitating code reuse between this class and its mapred counterpart. It also of use to external tools that need to build a MapReduce job that interacts with HBase but want fine-grained control over the jars shipped to the cluster.
- Parameters:
conf
- The Configuration object to extend with dependencies.- Throws:
IOException
- See Also:
-
buildDependencyClasspath
Returns a classpath string built from the content of the "tmpjars" value inconf
. Also exposed to shell scripts via `bin/hbase mapredcp`. -
addDependencyJars
Add the HBase dependency jars as well as jars for any of the configured job classes to the job configuration, so that JobClient will ship them to the cluster and add them to the DistributedCache.- Throws:
IOException
-
addDependencyJarsForClasses
@Private public static void addDependencyJarsForClasses(org.apache.hadoop.conf.Configuration conf, Class<?>... classes) throws IOException Add the jars containing the given classes to the job's configuration such that JobClient will ship them to the cluster and add them to the DistributedCache. N.B. that this method at most adds one jar per class given. If there is more than one jar available containing a class with the same name as a given class, we don't define which of those jars might be chosen.- Parameters:
conf
- The Hadoop Configuration to modifyclasses
- will add just those dependencies needed to find the given classes- Throws:
IOException
- if an underlying library call fails.
-
findOrCreateJar
private static org.apache.hadoop.fs.Path findOrCreateJar(Class<?> my_class, org.apache.hadoop.fs.FileSystem fs, Map<String, String> packagedClasses) throws IOExceptionFinds the Jar for a class or creates it if it doesn't exist. If the class is in a directory in the classpath, it creates a Jar on the fly with the contents of the directory and returns the path to that Jar. If a Jar is created, it is created in the system temporary directory. Otherwise, returns an existing jar that contains a class of the same name. Maintains a mapping from jar contents to the tmp jar created.- Parameters:
my_class
- the class to find.fs
- the FileSystem with which to qualify the returned path.packagedClasses
- a map of class name to path.- Returns:
- a jar file that contains the class.
- Throws:
IOException
-
updateMap
Add entries topackagedClasses
corresponding to class files contained injar
.- Parameters:
jar
- The jar who's content to list.packagedClasses
- map[class -> jar]- Throws:
IOException
-
findContainingJar
private static String findContainingJar(Class<?> my_class, Map<String, String> packagedClasses) throws IOExceptionFind a jar that contains a class of the same name, if any. It will return a jar file, even if that is not the first thing on the class path that has a class with the same name. Looks first on the classpath and then in thepackagedClasses
map.- Parameters:
my_class
- the class to find.- Returns:
- a jar file that contains the class, or null.
- Throws:
IOException
-
getJar
Invoke 'getJar' on a custom JarFinder implementation. Useful for some job configuration contexts (HBASE-8140) and also for testing on MRv2. check if we have HADOOP-9426.- Parameters:
my_class
- the class to find.- Returns:
- a jar file that contains the class, or null.
-
getRegionCount
private static int getRegionCount(org.apache.hadoop.conf.Configuration conf, TableName tableName) throws IOException - Throws:
IOException
-