Class HFileOutputFormat2
java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<K,V>
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<ImmutableBytesWritable,Cell>
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
- Direct Known Subclasses:
MultiTableHFileOutputFormat
@Public
public class HFileOutputFormat2
extends org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<ImmutableBytesWritable,Cell>
Writes HFiles. Passed Cells must arrive in order. Writes current time as the sequence id for the
file. Sets the major compacted attribute on created
HFile
s. Calling write(null,null) will
forcibly roll all HFiles being written.
Using this class as part of a MapReduce job is best done using
configureIncrementalLoad(Job, TableDescriptor, RegionLocator)
.
-
Nested Class Summary
Modifier and TypeClassDescription(package private) static class
(package private) static class
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.Counter
-
Field Summary
Modifier and TypeFieldDescription(package private) static final String
(package private) static Function<ColumnFamilyDescriptor,
String> Serialize column family to block size map to configuration.(package private) static final String
(package private) static final String
(package private) static Function<ColumnFamilyDescriptor,
String> Serialize column family to bloom param map to configuration.(package private) static Function<ColumnFamilyDescriptor,
String> Serialize column family to bloom type map to configuration.(package private) static final String
static final String
(package private) static Function<ColumnFamilyDescriptor,
String> Serialize column family to compression algorithm map to configuration.(package private) static final String
static final String
(package private) static Function<ColumnFamilyDescriptor,
String> Serialize column family to data block encoding map to configuration.private static final boolean
(package private) static final boolean
static final String
ExtendedCell and ExtendedCellSerialization are InterfaceAudience.Private.static final String
Keep locality while generating HFiles for bulkload.private static final org.slf4j.Logger
(package private) static final String
(package private) static final String
static final String
static final String
static final String
static final String
static final String
static final String
protected static final byte[]
Fields inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
BASE_OUTPUT_NAME, COMPRESS, COMPRESS_CODEC, COMPRESS_TYPE, OUTDIR, PART
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected static byte[]
combineTableNameSuffix
(byte[] tableName, byte[] suffix) (package private) static void
configureIncrementalLoad
(org.apache.hadoop.mapreduce.Job job, List<HFileOutputFormat2.TableInfo> multiTableInfo, Class<? extends org.apache.hadoop.mapreduce.OutputFormat<?, ?>> cls) static void
configureIncrementalLoad
(org.apache.hadoop.mapreduce.Job job, TableDescriptor tableDescriptor, RegionLocator regionLocator) Configure a MapReduce Job to perform an incremental load into the given table.static void
configureIncrementalLoad
(org.apache.hadoop.mapreduce.Job job, Table table, RegionLocator regionLocator) Configure a MapReduce Job to perform an incremental load into the given table.static void
configureIncrementalLoadMap
(org.apache.hadoop.mapreduce.Job job, TableDescriptor tableDescriptor) (package private) static void
configurePartitioner
(org.apache.hadoop.mapreduce.Job job, List<ImmutableBytesWritable> splitPoints, boolean writeMultipleTables) Configurejob
with a TotalOrderPartitioner, partitioning againstsplitPoints
.static void
configureRemoteCluster
(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration clusterConf) Configure HBase cluster key for remote cluster to load region location for locality-sensitive if it's enabled.(package private) static void
configureStoragePolicy
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.FileSystem fs, byte[] tableAndFamily, org.apache.hadoop.fs.Path cfPath) Configure block storage policy for CF after the directory is created.createFamilyBlockSizeMap
(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to block size map from the configuration.createFamilyBloomParamMap
(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to bloom filter param map from the configuration.createFamilyBloomTypeMap
(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to bloom filter type map from the configuration.(package private) static Map<byte[],
Compression.Algorithm> createFamilyCompressionMap
(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to compression algorithm map from the configuration.createFamilyConfValueMap
(org.apache.hadoop.conf.Configuration conf, String confName) Run inside the task to deserialize column family to given conf value map.(package private) static Map<byte[],
DataBlockEncoding> createFamilyDataBlockEncodingMap
(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to data block encoding type map from the configuration.(package private) static <V extends Cell>
org.apache.hadoop.mapreduce.RecordWriter<ImmutableBytesWritable,V> createRecordWriter
(org.apache.hadoop.mapreduce.TaskAttemptContext context, org.apache.hadoop.mapreduce.OutputCommitter committer) org.apache.hadoop.mapreduce.RecordWriter<ImmutableBytesWritable,
Cell> getRecordWriter
(org.apache.hadoop.mapreduce.TaskAttemptContext context) private static List<ImmutableBytesWritable>
getRegionStartKeys
(List<RegionLocator> regionLocators, boolean writeMultipleTables) Return the start keys of all of the regions in this table, as a list of ImmutableBytesWritable.protected static byte[]
getTableNameSuffixedWithFamily
(byte[] tableName, byte[] family) private static void
mergeSerializations
(org.apache.hadoop.conf.Configuration conf) (package private) static String
serializeColumnFamilyAttribute
(Function<ColumnFamilyDescriptor, String> fn, List<TableDescriptor> allTables) private static void
writePartitions
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path partitionsPath, List<ImmutableBytesWritable> startKeys, boolean writeMultipleTables) Write out aSequenceFile
that can be read byTotalOrderPartitioner
that contains the split points in startKeys.Methods inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
checkOutputSpecs, getCompressOutput, getDefaultWorkFile, getOutputCommitter, getOutputCompressorClass, getOutputName, getOutputPath, getPathForWorkFile, getUniqueFile, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputName, setOutputPath
-
Field Details
-
LOG
-
tableSeparator
-
COMPRESSION_FAMILIES_CONF_KEY
- See Also:
-
BLOOM_TYPE_FAMILIES_CONF_KEY
- See Also:
-
BLOOM_PARAM_FAMILIES_CONF_KEY
- See Also:
-
BLOCK_SIZE_FAMILIES_CONF_KEY
- See Also:
-
DATABLOCK_ENCODING_FAMILIES_CONF_KEY
- See Also:
-
DATABLOCK_ENCODING_OVERRIDE_CONF_KEY
- See Also:
-
COMPRESSION_OVERRIDE_CONF_KEY
- See Also:
-
LOCALITY_SENSITIVE_CONF_KEY
Keep locality while generating HFiles for bulkload. See HBASE-12596- See Also:
-
DEFAULT_LOCALITY_SENSITIVE
- See Also:
-
OUTPUT_TABLE_NAME_CONF_KEY
- See Also:
-
MULTI_TABLE_HFILEOUTPUTFORMAT_CONF_KEY
- See Also:
-
EXTENDED_CELL_SERIALIZATION_ENABLED_KEY
ExtendedCell and ExtendedCellSerialization are InterfaceAudience.Private. We expose this config for internal usage in jobs like WALPlayer which need to use features of ExtendedCell.- See Also:
-
EXTENDED_CELL_SERIALIZATION_ENABLED_DEFULT
- See Also:
-
REMOTE_CLUSTER_CONF_PREFIX
- See Also:
-
REMOTE_CLUSTER_ZOOKEEPER_QUORUM_CONF_KEY
- See Also:
-
REMOTE_CLUSTER_ZOOKEEPER_CLIENT_PORT_CONF_KEY
- See Also:
-
REMOTE_CLUSTER_ZOOKEEPER_ZNODE_PARENT_CONF_KEY
- See Also:
-
STORAGE_POLICY_PROPERTY
- See Also:
-
STORAGE_POLICY_PROPERTY_CF_PREFIX
- See Also:
-
compressionDetails
Serialize column family to compression algorithm map to configuration. Invoked while configuring the MR job for incremental load. -
blockSizeDetails
Serialize column family to block size map to configuration. Invoked while configuring the MR job for incremental load. -
bloomTypeDetails
Serialize column family to bloom type map to configuration. Invoked while configuring the MR job for incremental load. -
bloomParamDetails
Serialize column family to bloom param map to configuration. Invoked while configuring the MR job for incremental load. -
dataBlockEncodingDetails
Serialize column family to data block encoding map to configuration. Invoked while configuring the MR job for incremental load.
-
-
Constructor Details
-
HFileOutputFormat2
public HFileOutputFormat2()
-
-
Method Details
-
combineTableNameSuffix
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<ImmutableBytesWritable,Cell> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException - Specified by:
getRecordWriter
in classorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat<ImmutableBytesWritable,
Cell> - Throws:
IOException
InterruptedException
-
getTableNameSuffixedWithFamily
-
createRecordWriter
static <V extends Cell> org.apache.hadoop.mapreduce.RecordWriter<ImmutableBytesWritable,V> createRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context, org.apache.hadoop.mapreduce.OutputCommitter committer) throws IOException - Throws:
IOException
-
configureStoragePolicy
static void configureStoragePolicy(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.FileSystem fs, byte[] tableAndFamily, org.apache.hadoop.fs.Path cfPath) Configure block storage policy for CF after the directory is created. -
getRegionStartKeys
private static List<ImmutableBytesWritable> getRegionStartKeys(List<RegionLocator> regionLocators, boolean writeMultipleTables) throws IOException Return the start keys of all of the regions in this table, as a list of ImmutableBytesWritable.- Throws:
IOException
-
writePartitions
private static void writePartitions(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path partitionsPath, List<ImmutableBytesWritable> startKeys, boolean writeMultipleTables) throws IOException Write out aSequenceFile
that can be read byTotalOrderPartitioner
that contains the split points in startKeys.- Throws:
IOException
-
configureIncrementalLoad
public static void configureIncrementalLoad(org.apache.hadoop.mapreduce.Job job, Table table, RegionLocator regionLocator) throws IOException Configure a MapReduce Job to perform an incremental load into the given table. This- Inspects the table to configure a total order partitioner
- Uploads the partitions file to the cluster and adds it to the DistributedCache
- Sets the number of reduce tasks to match the current number of regions
- Sets the output key/value class to match HFileOutputFormat2's requirements
- Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer)
- Sets the HBase cluster key to load region locations for locality-sensitive
- Throws:
IOException
-
configureIncrementalLoad
public static void configureIncrementalLoad(org.apache.hadoop.mapreduce.Job job, TableDescriptor tableDescriptor, RegionLocator regionLocator) throws IOException Configure a MapReduce Job to perform an incremental load into the given table. This- Inspects the table to configure a total order partitioner
- Uploads the partitions file to the cluster and adds it to the DistributedCache
- Sets the number of reduce tasks to match the current number of regions
- Sets the output key/value class to match HFileOutputFormat2's requirements
- Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer)
- Throws:
IOException
-
configureIncrementalLoad
static void configureIncrementalLoad(org.apache.hadoop.mapreduce.Job job, List<HFileOutputFormat2.TableInfo> multiTableInfo, Class<? extends org.apache.hadoop.mapreduce.OutputFormat<?, ?>> cls) throws IOException- Throws:
IOException
-
mergeSerializations
-
configureIncrementalLoadMap
public static void configureIncrementalLoadMap(org.apache.hadoop.mapreduce.Job job, TableDescriptor tableDescriptor) throws IOException - Throws:
IOException
-
configureRemoteCluster
public static void configureRemoteCluster(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration clusterConf) Configure HBase cluster key for remote cluster to load region location for locality-sensitive if it's enabled. It's not necessary to call this method explicitly when the cluster key for HBase cluster to be used to load region location is configured in the job configuration. Call this method when another HBase cluster key is configured in the job configuration. For example, you should call when you load data from HBase cluster A usingTableInputFormat
and generate hfiles for HBase cluster B. Otherwise, HFileOutputFormat2 fetch location from cluster A and locality-sensitive won't working correctly.configureIncrementalLoad(Job, Table, RegionLocator)
calls this method usingTable.getConfiguration()
as clusterConf. See HBASE-25608.- Parameters:
job
- which has configuration to be updatedclusterConf
- which contains cluster key of the HBase cluster to be locality-sensitive- See Also:
-
createFamilyCompressionMap
@Private static Map<byte[],Compression.Algorithm> createFamilyCompressionMap(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to compression algorithm map from the configuration.- Parameters:
conf
- to read the serialized values from- Returns:
- a map from column family to the configured compression algorithm
-
createFamilyBloomTypeMap
@Private static Map<byte[],BloomType> createFamilyBloomTypeMap(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to bloom filter type map from the configuration.- Parameters:
conf
- to read the serialized values from- Returns:
- a map from column family to the the configured bloom filter type
-
createFamilyBloomParamMap
@Private static Map<byte[],String> createFamilyBloomParamMap(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to bloom filter param map from the configuration.- Parameters:
conf
- to read the serialized values from- Returns:
- a map from column family to the the configured bloom filter param
-
createFamilyBlockSizeMap
@Private static Map<byte[],Integer> createFamilyBlockSizeMap(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to block size map from the configuration.- Parameters:
conf
- to read the serialized values from- Returns:
- a map from column family to the configured block size
-
createFamilyDataBlockEncodingMap
@Private static Map<byte[],DataBlockEncoding> createFamilyDataBlockEncodingMap(org.apache.hadoop.conf.Configuration conf) Runs inside the task to deserialize column family to data block encoding type map from the configuration.- Parameters:
conf
- to read the serialized values from- Returns:
- a map from column family to HFileDataBlockEncoder for the configured data block type for the family
-
createFamilyConfValueMap
private static Map<byte[],String> createFamilyConfValueMap(org.apache.hadoop.conf.Configuration conf, String confName) Run inside the task to deserialize column family to given conf value map.- Parameters:
conf
- to read the serialized values fromconfName
- conf key to read from the configuration- Returns:
- a map of column family to the given configuration value
-
configurePartitioner
static void configurePartitioner(org.apache.hadoop.mapreduce.Job job, List<ImmutableBytesWritable> splitPoints, boolean writeMultipleTables) throws IOException Configurejob
with a TotalOrderPartitioner, partitioning againstsplitPoints
. Cleans up the partitions file after job exists.- Throws:
IOException
-
serializeColumnFamilyAttribute
@Private static String serializeColumnFamilyAttribute(Function<ColumnFamilyDescriptor, String> fn, List<TableDescriptor> allTables) throws UnsupportedEncodingException- Throws:
UnsupportedEncodingException
-