Package org.apache.hadoop.hbase.tool
Class BulkLoadHFilesTool
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.hadoop.hbase.tool.BulkLoadHFilesTool
- All Implemented Interfaces:
org.apache.hadoop.conf.Configurable
,BulkLoadHFiles
,org.apache.hadoop.util.Tool
@LimitedPrivate("Tools")
public class BulkLoadHFilesTool
extends org.apache.hadoop.conf.Configured
implements BulkLoadHFiles, org.apache.hadoop.util.Tool
The implementation for
BulkLoadHFiles
, and also can be executed from command line as a
tool.-
Nested Class Summary
Modifier and TypeClassDescriptionprivate static interface
Nested classes/interfaces inherited from interface org.apache.hadoop.hbase.tool.BulkLoadHFiles
BulkLoadHFiles.LoadQueueItem
-
Field Summary
Modifier and TypeFieldDescriptionprivate boolean
static final String
HBASE-24221 Support bulkLoadHFile by family to avoid long time waiting of bulkLoadHFile because of compacting at server sideprivate boolean
private String
private static final boolean
static final String
private boolean
private FsDelegationToken
static final String
Keep locality while generating HFiles for bulkload.private static final org.slf4j.Logger
private int
static final String
private int
private final AtomicInteger
private boolean
(package private) static final String
private UserProvider
private static final String
Whether to run validation on hfiles before loading.Fields inherited from interface org.apache.hadoop.hbase.tool.BulkLoadHFiles
ALWAYS_COPY_FILES, ASSIGN_SEQ_IDS, CREATE_TABLE_CONF_KEY, IGNORE_UNMATCHED_CF_CONF_KEY, MAX_FILES_PER_REGION_PER_FAMILY, RETRY_ON_IO_EXCEPTION
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionPerform a bulk load of the given directory into the given pre-existing table.Perform a bulk load of the given directory into the given pre-existing table.protected void
bulkLoadPhase
(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer, BulkLoadHFiles.LoadQueueItem> regionGroups, boolean copyFiles, Map<BulkLoadHFiles.LoadQueueItem, ByteBuffer> item2RegionMap) This takes the LQI's grouped by likely regions and attempts to bulk load them.private boolean
checkHFilesCountPerRegionPerFamily
(org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer, BulkLoadHFiles.LoadQueueItem> regionGroups) private void
checkRegionIndexValid
(int idx, List<Pair<byte[], byte[]>> startEndKeys, TableName tableName) we can consider there is a region hole or overlap in following conditions.private void
cleanup
(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, ExecutorService pool) private static void
copyHFileHalf
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inFile, org.apache.hadoop.fs.Path outFile, Reference reference, ColumnFamilyDescriptor familyDescriptor, AsyncTableRegionLocator loc) Copy half of an HFile into a new HFile with favored nodes.private ExecutorService
private void
createTable
(TableName tableName, org.apache.hadoop.fs.Path hfofDir, AsyncAdmin admin) If the table is created for the first time, then "completebulkload" reads the files twice.void
Disables replication for all bulkloads done via this instance, when bulkload replication is configured.private static void
discoverLoadQueue
(org.apache.hadoop.conf.Configuration conf, Deque<BulkLoadHFiles.LoadQueueItem> ret, org.apache.hadoop.fs.Path hfofDir, boolean validateHFile) Walk the given directory for all HFiles, and return a Queue containing all such files.private Map<BulkLoadHFiles.LoadQueueItem,
ByteBuffer> doBulkLoad
(AsyncClusterConnection conn, TableName tableName, Map<byte[], List<org.apache.hadoop.fs.Path>> map, boolean silence, boolean copyFile) Perform a bulk load of the given map of families to hfiles into the given pre-existing table.private Map<BulkLoadHFiles.LoadQueueItem,
ByteBuffer> doBulkLoad
(AsyncClusterConnection conn, TableName tableName, org.apache.hadoop.fs.Path hfofDir, boolean silence, boolean copyFile) Perform a bulk load of the given directory into the given pre-existing table.private int
getRegionIndex
(List<Pair<byte[], byte[]>> startEndKeys, byte[] key) private String
private Map<byte[],
Collection<BulkLoadHFiles.LoadQueueItem>> groupByFamilies
(Collection<BulkLoadHFiles.LoadQueueItem> itemsInRegion) protected Pair<List<BulkLoadHFiles.LoadQueueItem>,
String> groupOrSplit
(AsyncClusterConnection conn, TableName tableName, org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer, BulkLoadHFiles.LoadQueueItem> regionGroups, BulkLoadHFiles.LoadQueueItem item, List<Pair<byte[], byte[]>> startEndKeys) Attempt to assign the given load queue item into its target region group.private Pair<org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer,
BulkLoadHFiles.LoadQueueItem>, Set<String>> groupOrSplitPhase
(AsyncClusterConnection conn, TableName tableName, ExecutorService pool, Deque<BulkLoadHFiles.LoadQueueItem> queue, List<Pair<byte[], byte[]>> startEndKeys) static byte[][]
inferBoundaries
(SortedMap<byte[], Integer> bdryMap) Infers region boundaries for a new table.void
private static StoreFileWriter
initStoreFileWriter
(org.apache.hadoop.conf.Configuration conf, Cell cell, HFileContext hFileContext, CacheConfig cacheConf, BloomType bloomFilterType, org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path outFile, AsyncTableRegionLocator loc) private boolean
private boolean
boolean
Returns true if replication has been disabled.private boolean
void
loadHFileQueue
(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean copyFiles) Used by the replication sink to load the hfiles from the source cluster.static void
private Map<BulkLoadHFiles.LoadQueueItem,
ByteBuffer> performBulkLoad
(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, ExecutorService pool, boolean copyFile) private static void
populateLoadQueue
(Deque<BulkLoadHFiles.LoadQueueItem> ret, Map<byte[], List<org.apache.hadoop.fs.Path>> map) Populate the Queue with given HFilesstatic void
prepareHFileQueue
(org.apache.hadoop.conf.Configuration conf, AsyncClusterConnection conn, TableName tableName, org.apache.hadoop.fs.Path hfilesDir, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean validateHFile, boolean silence) Prepare a collection ofLoadQueueItem
from list of source hfiles contained in the passed directory and validates whether the prepared queue has all the valid table column families in it.static void
prepareHFileQueue
(AsyncClusterConnection conn, TableName tableName, Map<byte[], List<org.apache.hadoop.fs.Path>> map, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean silence) Prepare a collection ofLoadQueueItem
from list of source hfiles contained in the passed directory and validates whether the prepared queue has all the valid table column families in it.int
void
setBulkToken
(String bulkToken) void
setClusterIds
(List<String> clusterIds) private static boolean
shouldCopyHFileMetaKey
(byte[] key) (package private) static void
splitStoreFile
(AsyncTableRegionLocator loc, org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inFile, ColumnFamilyDescriptor familyDesc, byte[] splitKey, org.apache.hadoop.fs.Path bottomOut, org.apache.hadoop.fs.Path topOut) Split a storefile into a top and bottom half with favored nodes, maintaining the metadata, recreating bloom filters, etc.private List<BulkLoadHFiles.LoadQueueItem>
splitStoreFile
(AsyncTableRegionLocator loc, BulkLoadHFiles.LoadQueueItem item, TableDescriptor tableDesc, byte[] splitKey) private void
tableExists
(AsyncClusterConnection conn, TableName tableName) private void
protected CompletableFuture<Collection<BulkLoadHFiles.LoadQueueItem>>
tryAtomicRegionLoad
(AsyncClusterConnection conn, TableName tableName, boolean copyFiles, byte[] first, Collection<BulkLoadHFiles.LoadQueueItem> lqis) Attempts to do an atomic load of many hfiles into a region.private void
usage()
private static void
validateFamiliesInHFiles
(TableDescriptor tableDesc, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean silence) Checks whether there is any invalid family name in HFiles to be bulk loaded.private static <TFamily> void
visitBulkHFiles
(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path bulkDir, BulkLoadHFilesTool.BulkHFileVisitor<TFamily> visitor, boolean validateHFile) Iterate over the bulkDir hfiles.Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
Field Details
-
LOG
-
LOCALITY_SENSITIVE_CONF_KEY
Keep locality while generating HFiles for bulkload. See HBASE-12596- See Also:
-
DEFAULT_LOCALITY_SENSITIVE
- See Also:
-
NAME
- See Also:
-
VALIDATE_HFILES
Whether to run validation on hfiles before loading.- See Also:
-
BULK_LOAD_HFILES_BY_FAMILY
HBASE-24221 Support bulkLoadHFile by family to avoid long time waiting of bulkLoadHFile because of compacting at server side- See Also:
-
FAIL_IF_NEED_SPLIT_HFILE
- See Also:
-
TMP_DIR
- See Also:
-
maxFilesPerRegionPerFamily
-
assignSeqIds
-
bulkLoadByFamily
-
fsDelegationToken
-
userProvider
-
nrThreads
-
numRetries
-
bulkToken
-
clusterIds
-
replicate
-
failIfNeedSplitHFile
-
-
Constructor Details
-
BulkLoadHFilesTool
-
-
Method Details
-
initialize
-
createExecutorService
-
isCreateTable
-
isSilence
-
isAlwaysCopyFiles
-
shouldCopyHFileMetaKey
-
validateFamiliesInHFiles
private static void validateFamiliesInHFiles(TableDescriptor tableDesc, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean silence) throws IOException Checks whether there is any invalid family name in HFiles to be bulk loaded.- Throws:
IOException
-
populateLoadQueue
private static void populateLoadQueue(Deque<BulkLoadHFiles.LoadQueueItem> ret, Map<byte[], List<org.apache.hadoop.fs.Path>> map) Populate the Queue with given HFiles -
visitBulkHFiles
private static <TFamily> void visitBulkHFiles(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path bulkDir, BulkLoadHFilesTool.BulkHFileVisitor<TFamily> visitor, boolean validateHFile) throws IOException Iterate over the bulkDir hfiles. Skip reference, HFileLink, files starting with "_". Check and skip non-valid hfiles by default, or skip this validation by settingVALIDATE_HFILES
to false.- Throws:
IOException
-
discoverLoadQueue
private static void discoverLoadQueue(org.apache.hadoop.conf.Configuration conf, Deque<BulkLoadHFiles.LoadQueueItem> ret, org.apache.hadoop.fs.Path hfofDir, boolean validateHFile) throws IOException Walk the given directory for all HFiles, and return a Queue containing all such files.- Throws:
IOException
-
prepareHFileQueue
public static void prepareHFileQueue(AsyncClusterConnection conn, TableName tableName, Map<byte[], List<org.apache.hadoop.fs.Path>> map, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean silence) throws IOExceptionPrepare a collection ofLoadQueueItem
from list of source hfiles contained in the passed directory and validates whether the prepared queue has all the valid table column families in it.- Parameters:
map
- map of family to List of hfilestableName
- table to which hfiles should be loadedqueue
- queue which needs to be loaded into the tablesilence
- true to ignore unmatched column families- Throws:
IOException
- If any I/O or network error occurred
-
prepareHFileQueue
public static void prepareHFileQueue(org.apache.hadoop.conf.Configuration conf, AsyncClusterConnection conn, TableName tableName, org.apache.hadoop.fs.Path hfilesDir, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean validateHFile, boolean silence) throws IOException Prepare a collection ofLoadQueueItem
from list of source hfiles contained in the passed directory and validates whether the prepared queue has all the valid table column families in it.- Parameters:
hfilesDir
- directory containing list of hfiles to be loaded into the tablequeue
- queue which needs to be loaded into the tablevalidateHFile
- if true hfiles will be validated for its formatsilence
- true to ignore unmatched column families- Throws:
IOException
- If any I/O or network error occurred
-
loadHFileQueue
public void loadHFileQueue(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, boolean copyFiles) throws IOException Used by the replication sink to load the hfiles from the source cluster. It does the following,- Parameters:
conn
- Connection to usetableName
- Table to which these hfiles should be loaded toqueue
-LoadQueueItem
has hfiles yet to be loaded- Throws:
IOException
-
tryAtomicRegionLoad
@Private protected CompletableFuture<Collection<BulkLoadHFiles.LoadQueueItem>> tryAtomicRegionLoad(AsyncClusterConnection conn, TableName tableName, boolean copyFiles, byte[] first, Collection<BulkLoadHFiles.LoadQueueItem> lqis) Attempts to do an atomic load of many hfiles into a region. If it fails, it returns a list of hfiles that need to be retried. If it is successful it will return an empty list. NOTE: To maintain row atomicity guarantees, region server side should succeed atomically and fails atomically.- Parameters:
conn
- Connection to usetableName
- Table to which these hfiles should be loaded tocopyFiles
- whether replicate to peer cluster while bulkloadingfirst
- the start key of regionlqis
- hfiles should be loaded- Returns:
- empty list if success, list of items to retry on recoverable failure
-
bulkLoadPhase
@Private protected void bulkLoadPhase(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer, BulkLoadHFiles.LoadQueueItem> regionGroups, boolean copyFiles, Map<BulkLoadHFiles.LoadQueueItem, throws IOExceptionByteBuffer> item2RegionMap) This takes the LQI's grouped by likely regions and attempts to bulk load them. Any failures are re-queued for another pass with the groupOrSplitPhase. protected for testing.- Throws:
IOException
-
groupByFamilies
private Map<byte[],Collection<BulkLoadHFiles.LoadQueueItem>> groupByFamilies(Collection<BulkLoadHFiles.LoadQueueItem> itemsInRegion) -
checkHFilesCountPerRegionPerFamily
private boolean checkHFilesCountPerRegionPerFamily(org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer, BulkLoadHFiles.LoadQueueItem> regionGroups) -
groupOrSplitPhase
private Pair<org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer,BulkLoadHFiles.LoadQueueItem>, groupOrSplitPhaseSet<String>> (AsyncClusterConnection conn, TableName tableName, ExecutorService pool, Deque<BulkLoadHFiles.LoadQueueItem> queue, List<Pair<byte[], byte[]>> startEndKeys) throws IOException- Parameters:
conn
- the HBase cluster connectiontableName
- the table name of the table to load intopool
- the ExecutorServicequeue
- the queue for LoadQueueItemstartEndKeys
- start and end keys- Returns:
- A map that groups LQI by likely bulk load region targets and Set of missing hfiles.
- Throws:
IOException
-
getUniqueName
-
splitStoreFile
private List<BulkLoadHFiles.LoadQueueItem> splitStoreFile(AsyncTableRegionLocator loc, BulkLoadHFiles.LoadQueueItem item, TableDescriptor tableDesc, byte[] splitKey) throws IOException - Throws:
IOException
-
getRegionIndex
- Parameters:
startEndKeys
- the start/end keys of regions belong to this table, the list in ascending order by start keykey
- the key need to find which region belong to- Returns:
- region index
-
checkRegionIndexValid
private void checkRegionIndexValid(int idx, List<Pair<byte[], byte[]>> startEndKeys, TableName tableName) throws IOExceptionwe can consider there is a region hole or overlap in following conditions. 1) if idx < 0,then first region info is lost. 2) if the endkey of a region is not equal to the startkey of the next region. 3) if the endkey of the last region is not empty.- Throws:
IOException
-
groupOrSplit
@Private protected Pair<List<BulkLoadHFiles.LoadQueueItem>,String> groupOrSplit(AsyncClusterConnection conn, TableName tableName, org.apache.hbase.thirdparty.com.google.common.collect.Multimap<ByteBuffer, BulkLoadHFiles.LoadQueueItem> regionGroups, BulkLoadHFiles.LoadQueueItem item, List<Pair<byte[], throws IOExceptionbyte[]>> startEndKeys) Attempt to assign the given load queue item into its target region group. If the hfile boundary no longer fits into a region, physically splits the hfile such that the new bottom half will fit and returns the list of LQI's corresponding to the resultant hfiles. protected for testing- Throws:
IOException
- if an IO failure is encountered
-
splitStoreFile
@Private static void splitStoreFile(AsyncTableRegionLocator loc, org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inFile, ColumnFamilyDescriptor familyDesc, byte[] splitKey, org.apache.hadoop.fs.Path bottomOut, org.apache.hadoop.fs.Path topOut) throws IOException Split a storefile into a top and bottom half with favored nodes, maintaining the metadata, recreating bloom filters, etc.- Throws:
IOException
-
initStoreFileWriter
private static StoreFileWriter initStoreFileWriter(org.apache.hadoop.conf.Configuration conf, Cell cell, HFileContext hFileContext, CacheConfig cacheConf, BloomType bloomFilterType, org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path outFile, AsyncTableRegionLocator loc) throws IOException - Throws:
IOException
-
copyHFileHalf
private static void copyHFileHalf(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inFile, org.apache.hadoop.fs.Path outFile, Reference reference, ColumnFamilyDescriptor familyDescriptor, AsyncTableRegionLocator loc) throws IOException Copy half of an HFile into a new HFile with favored nodes.- Throws:
IOException
-
inferBoundaries
Infers region boundaries for a new table. Parameter:
bdryMap is a map between keys to an integer belonging to {+1, -1}- If a key is a start key of a file, then it maps to +1
- If a key is an end key of a file, then it maps to -1
Algo:
- Poll on the keys in order:
- Keep adding the mapped values to these keys (runningSum)
- Each time runningSum reaches 0, add the start Key from when the runningSum had started to a boundary list.
- Return the boundary list.
-
createTable
private void createTable(TableName tableName, org.apache.hadoop.fs.Path hfofDir, AsyncAdmin admin) throws IOException If the table is created for the first time, then "completebulkload" reads the files twice. More modifications necessary if we want to avoid doing it.- Throws:
IOException
-
performBulkLoad
private Map<BulkLoadHFiles.LoadQueueItem,ByteBuffer> performBulkLoad(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, ExecutorService pool, boolean copyFile) throws IOException - Throws:
IOException
-
cleanup
private void cleanup(AsyncClusterConnection conn, TableName tableName, Deque<BulkLoadHFiles.LoadQueueItem> queue, ExecutorService pool) throws IOException - Throws:
IOException
-
doBulkLoad
private Map<BulkLoadHFiles.LoadQueueItem,ByteBuffer> doBulkLoad(AsyncClusterConnection conn, TableName tableName, Map<byte[], List<org.apache.hadoop.fs.Path>> map, boolean silence, boolean copyFile) throws IOExceptionPerform a bulk load of the given map of families to hfiles into the given pre-existing table. This method is not threadsafe.- Parameters:
map
- map of family to List of hfilestableName
- table to load the hfilessilence
- true to ignore unmatched column familiescopyFile
- always copy hfiles if true- Throws:
IOException
-
doBulkLoad
private Map<BulkLoadHFiles.LoadQueueItem,ByteBuffer> doBulkLoad(AsyncClusterConnection conn, TableName tableName, org.apache.hadoop.fs.Path hfofDir, boolean silence, boolean copyFile) throws IOException Perform a bulk load of the given directory into the given pre-existing table. This method is not threadsafe.- Parameters:
tableName
- table to load the hfileshfofDir
- the directory that was provided as the output path of a job using HFileOutputFormatsilence
- true to ignore unmatched column familiescopyFile
- always copy hfiles if true- Throws:
IOException
-
bulkLoad
public Map<BulkLoadHFiles.LoadQueueItem,ByteBuffer> bulkLoad(TableName tableName, Map<byte[], List<org.apache.hadoop.fs.Path>> family2Files) throws IOExceptionDescription copied from interface:BulkLoadHFiles
Perform a bulk load of the given directory into the given pre-existing table.- Specified by:
bulkLoad
in interfaceBulkLoadHFiles
- Parameters:
tableName
- the table to load intofamily2Files
- map of family to List of hfiles- Throws:
TableNotFoundException
- if table does not yet existIOException
-
bulkLoad
public Map<BulkLoadHFiles.LoadQueueItem,ByteBuffer> bulkLoad(TableName tableName, org.apache.hadoop.fs.Path dir) throws IOException Description copied from interface:BulkLoadHFiles
Perform a bulk load of the given directory into the given pre-existing table.- Specified by:
bulkLoad
in interfaceBulkLoadHFiles
- Parameters:
tableName
- the table to load intodir
- the directory that was provided as the output path of a job usingHFileOutputFormat
- Throws:
TableNotFoundException
- if table does not yet existIOException
-
tableExists
- Throws:
TableNotFoundException
- if table does not exist.IOException
-
throwAndLogTableNotFoundException
- Throws:
TableNotFoundException
-
setBulkToken
-
setClusterIds
-
usage
-
run
- Specified by:
run
in interfaceorg.apache.hadoop.util.Tool
- Throws:
Exception
-
main
- Throws:
Exception
-
disableReplication
Description copied from interface:BulkLoadHFiles
Disables replication for all bulkloads done via this instance, when bulkload replication is configured.- Specified by:
disableReplication
in interfaceBulkLoadHFiles
-
isReplicationDisabled
Description copied from interface:BulkLoadHFiles
Returns true if replication has been disabled.- Specified by:
isReplicationDisabled
in interfaceBulkLoadHFiles
-