public class IntegrationTestLoadCommonCrawl extends IntegrationTestBase
Run like:
./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Access to the Common Crawl dataset in S3 is made available to anyone by Amazon AWS, but Hadoop's S3N filesystem still requires valid access credentials to initialize.
The input path can either specify a directory or a file. The file may optionally be compressed with gzip. If a directory, the loader expects the directory to contain one or more WARC files from the Common Crawl dataset. If a file, the loader expects a list of Hadoop S3N URIs which point to S3 locations for one or more WARC files from the Common Crawl dataset, one URI per line. Lines should be terminated with the UNIX line terminator.
Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz is a list of all WARC files comprising the Q1 2021 crawl archive. There are 64,000 WARC files in this data set, each containing ~1GB of gzipped data. The WARC files contain several record types, such as metadata, request, and response, but we only load the response record types. If the HBase table schema does not specify compression (by default) there is roughly a 10x expansion. Loading the full crawl archive results in a table approximately 640 TB in size.
The loader can optionally drive read load during ingest by incrementing counters for each URL discovered in content. Add -DIntegrationTestLoadCommonCrawl.increments=true to the command line to enable.
You can also split the Loader and Verify stages:
Load with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \
-files /path/to/hadoop-aws.jar \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Note: The hadoop-aws jar will be needed at runtime to instantiate the S3N filesystem. Use the -files ToolRunner argument to add it.
Verify with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \
/path/to/tmp/warc-loader-output
Modifier and Type | Class and Description |
---|---|
static class |
IntegrationTestLoadCommonCrawl.Counts |
static class |
IntegrationTestLoadCommonCrawl.HBaseKeyWritable |
static class |
IntegrationTestLoadCommonCrawl.Loader |
static class |
IntegrationTestLoadCommonCrawl.OneFilePerMapperSFIF<K,V> |
static class |
IntegrationTestLoadCommonCrawl.Verify |
Modifier and Type | Field and Description |
---|---|
protected String[] |
args |
(package private) static byte[] |
CONTENT_FAMILY_NAME |
(package private) static byte[] |
CONTENT_LENGTH_QUALIFIER |
(package private) static byte[] |
CONTENT_QUALIFIER |
(package private) static byte[] |
CONTENT_TYPE_QUALIFIER |
private static AtomicLong |
counter |
(package private) static byte[] |
CRC_QUALIFIER |
(package private) static byte[] |
DATE_QUALIFIER |
(package private) static boolean |
DEFAULT_INCREMENTS |
(package private) static String |
DEFAULT_TABLE_NAME |
(package private) static String |
INCREMENTS_NAME_KEY |
(package private) static int |
INFLIGHT_PAUSE_MS |
(package private) static byte[] |
INFO_FAMILY_NAME |
(package private) static byte[] |
IP_ADDRESS_QUALIFIER |
private static org.slf4j.Logger |
LOG |
(package private) static int |
MAX_INFLIGHT |
protected org.apache.hadoop.fs.Path |
outputDir |
(package private) static byte[] |
REF_QUALIFIER |
(package private) static byte[] |
SEP |
private static int |
shift |
(package private) static String |
TABLE_NAME_KEY |
(package private) static byte[] |
TARGET_URI_QUALIFIER |
(package private) static byte[] |
URL_FAMILY_NAME |
(package private) static Pattern |
URL_PATTERN |
protected org.apache.hadoop.fs.Path |
warcFileInputDir |
CHAOS_MONKEY_PROPS, monkey, MONKEY_LONG_OPT, monkeyProps, monkeyToUse, NO_CLUSTER_CLEANUP_LONG_OPT, noClusterCleanUp, util
Constructor and Description |
---|
IntegrationTestLoadCommonCrawl() |
Modifier and Type | Method and Description |
---|---|
void |
cleanUpCluster() |
private static Collection<String> |
extractUrls(byte[] content) |
protected Set<String> |
getColumnFamilies()
Provides the name of the CFs that are protected from random Chaos monkey activity (alter)
|
private static long |
getSequence() |
org.apache.hadoop.hbase.TableName |
getTablename()
Provides the name of the table that is protected from random Chaos monkey activity
|
(package private) static org.apache.hadoop.hbase.TableName |
getTablename(org.apache.hadoop.conf.Configuration c) |
static void |
main(String[] args) |
protected void |
processOptions(org.apache.hbase.thirdparty.org.apache.commons.cli.CommandLine cmd) |
private static byte[] |
rowKeyFromTargetURI(String targetUri) |
int |
run(String[] args) |
protected int |
runLoader(org.apache.hadoop.fs.Path warcFileInputDir,
org.apache.hadoop.fs.Path outputDir) |
int |
runTestFromCommandLine() |
protected int |
runVerify(org.apache.hadoop.fs.Path inputDir) |
void |
setUpCluster() |
addOptions, cleanUp, cleanUpMonkey, cleanUpMonkey, doWork, getConf, getDefaultMonkeyFactory, getTestingUtil, loadMonkeyProperties, processBaseOptions, setUp, setUpMonkey, startMonkey
addOption, addOptNoArg, addOptNoArg, addOptWithArg, addOptWithArg, addRequiredOption, addRequiredOptWithArg, addRequiredOptWithArg, doStaticMain, getOptionAsDouble, getOptionAsInt, getOptionAsInt, getOptionAsLong, getOptionAsLong, newParser, parseArgs, parseInt, parseLong, printUsage, printUsage, processOldArgs, setConf
private static final org.slf4j.Logger LOG
static final String TABLE_NAME_KEY
static final String DEFAULT_TABLE_NAME
static final String INCREMENTS_NAME_KEY
static final boolean DEFAULT_INCREMENTS
static final int MAX_INFLIGHT
static final int INFLIGHT_PAUSE_MS
static final byte[] CONTENT_FAMILY_NAME
static final byte[] INFO_FAMILY_NAME
static final byte[] URL_FAMILY_NAME
static final byte[] SEP
static final byte[] CONTENT_QUALIFIER
static final byte[] CONTENT_LENGTH_QUALIFIER
static final byte[] CONTENT_TYPE_QUALIFIER
static final byte[] CRC_QUALIFIER
static final byte[] DATE_QUALIFIER
static final byte[] IP_ADDRESS_QUALIFIER
static final byte[] TARGET_URI_QUALIFIER
static final byte[] REF_QUALIFIER
protected org.apache.hadoop.fs.Path warcFileInputDir
protected org.apache.hadoop.fs.Path outputDir
private static final AtomicLong counter
private static final int shift
static final Pattern URL_PATTERN
public IntegrationTestLoadCommonCrawl()
protected int runLoader(org.apache.hadoop.fs.Path warcFileInputDir, org.apache.hadoop.fs.Path outputDir) throws Exception
Exception
protected int runVerify(org.apache.hadoop.fs.Path inputDir) throws Exception
Exception
public int run(String[] args)
run
in interface org.apache.hadoop.util.Tool
run
in class org.apache.hadoop.hbase.util.AbstractHBaseTool
protected void processOptions(org.apache.hbase.thirdparty.org.apache.commons.cli.CommandLine cmd)
processOptions
in class IntegrationTestBase
public void setUpCluster() throws Exception
setUpCluster
in class IntegrationTestBase
Exception
public void cleanUpCluster() throws Exception
cleanUpCluster
in class IntegrationTestBase
Exception
static org.apache.hadoop.hbase.TableName getTablename(org.apache.hadoop.conf.Configuration c)
public org.apache.hadoop.hbase.TableName getTablename()
IntegrationTestBase
getTablename
in class IntegrationTestBase
protected Set<String> getColumnFamilies()
IntegrationTestBase
getColumnFamilies
in class IntegrationTestBase
public int runTestFromCommandLine() throws Exception
runTestFromCommandLine
in class IntegrationTestBase
Exception
private static long getSequence()
private static byte[] rowKeyFromTargetURI(String targetUri) throws IOException, URISyntaxException, IllegalArgumentException
private static Collection<String> extractUrls(byte[] content)
Copyright © 2007–2020 The Apache Software Foundation. All rights reserved.