org.apache.hadoop.hbase.util.AbstractHBaseTool

org.apache.hadoop.hbase.IntegrationTestBase

org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl

All Implemented Interfaces:: org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class IntegrationTestLoadCommonCrawl extends IntegrationTestBase

This integration test loads successful resource retrieval records from the Common Crawl (https://commoncrawl.org/) public dataset into an HBase table and writes records that can be used to later verify the presence and integrity of those records.

Run like:

./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \
   -Dfs.s3n.awsAccessKeyId=<AWS access key> \
   -Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
   /path/to/test-CC-MAIN-2021-10-warc.paths.gz \
   /path/to/tmp/warc-loader-output

Access to the Common Crawl dataset in S3 is made available to anyone by Amazon AWS, but Hadoop's S3N filesystem still requires valid access credentials to initialize.

The input path can either specify a directory or a file. The file may optionally be compressed with gzip. If a directory, the loader expects the directory to contain one or more WARC files from the Common Crawl dataset. If a file, the loader expects a list of Hadoop S3N URIs which point to S3 locations for one or more WARC files from the Common Crawl dataset, one URI per line. Lines should be terminated with the UNIX line terminator.

Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz is a list of all WARC files comprising the Q1 2021 crawl archive. There are 64,000 WARC files in this data set, each containing ~1GB of gzipped data. The WARC files contain several record types, such as metadata, request, and response, but we only load the response record types. If the HBase table schema does not specify compression (by default) there is roughly a 10x expansion. Loading the full crawl archive results in a table approximately 640 TB in size.

The loader can optionally drive read load during ingest by incrementing counters for each URL discovered in content. Add -DIntegrationTestLoadCommonCrawl.increments=true to the command line to enable.

You can also split the Loader and Verify stages:

Load with:

./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \
   -files /path/to/hadoop-aws.jar \
   -Dfs.s3n.awsAccessKeyId=<AWS access key> \
   -Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
   /path/to/test-CC-MAIN-2021-10-warc.paths.gz \
   /path/to/tmp/warc-loader-output

Note: The hadoop-aws jar will be needed at runtime to instantiate the S3N filesystem. Use the -files ToolRunner argument to add it.

Verify with:

./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \
/path/to/tmp/warc-loader-output

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

IntegrationTestLoadCommonCrawl.Counts

static class

IntegrationTestLoadCommonCrawl.HBaseKeyWritable

static class

IntegrationTestLoadCommonCrawl.Loader

static class

IntegrationTestLoadCommonCrawl.OneFilePerMapperSFIF<K,V>

static class

IntegrationTestLoadCommonCrawl.Verify

Nested classes/interfaces inherited from class org.apache.hadoop.hbase.util.AbstractHBaseTool
org.apache.hadoop.hbase.util.AbstractHBaseTool.OptionsOrderComparator
Field Summary

Fields

Modifier and Type

Field

Description

protected String[]

args

(package private) static final byte[]

CONTENT_FAMILY_NAME

(package private) static final byte[]

CONTENT_LENGTH_QUALIFIER

(package private) static final byte[]

CONTENT_QUALIFIER

(package private) static final byte[]

CONTENT_TYPE_QUALIFIER

private static final AtomicLong

counter

(package private) static final byte[]

CRC_QUALIFIER

(package private) static final byte[]

DATE_QUALIFIER

(package private) static final boolean

DEFAULT_INCREMENTS

(package private) static final String

DEFAULT_TABLE_NAME

(package private) static final String

INCREMENTS_NAME_KEY

(package private) static final int

INFLIGHT_PAUSE_MS

(package private) static final byte[]

INFO_FAMILY_NAME

(package private) static final byte[]

IP_ADDRESS_QUALIFIER

private static final org.slf4j.Logger

LOG

(package private) static final int

MAX_INFLIGHT

protected org.apache.hadoop.fs.Path

outputDir

(package private) static final byte[]

REF_QUALIFIER

(package private) static final byte[]

SEP

private static final int

shift

(package private) static final String

TABLE_NAME_KEY

(package private) static final byte[]

TARGET_URI_QUALIFIER

(package private) static final byte[]

URL_FAMILY_NAME

(package private) static final Pattern

URL_PATTERN

protected org.apache.hadoop.fs.Path

warcFileInputDir

Fields inherited from class org.apache.hadoop.hbase.IntegrationTestBase
CHAOS_MONKEY_PROPS, monkey, MONKEY_LONG_OPT, monkeyProps, monkeyToUse, NO_CLUSTER_CLEANUP_LONG_OPT, noClusterCleanUp, util

Fields inherited from class org.apache.hadoop.hbase.util.AbstractHBaseTool
cmdLineArgs, conf, EXIT_FAILURE, EXIT_SUCCESS, LONG_HELP_OPTION, options, SHORT_HELP_OPTION
Constructor Summary

Constructors

Constructor

Description

IntegrationTestLoadCommonCrawl()
Method Summary

Modifier and Type

Method

Description

void

cleanUpCluster()

private static Collection<String>

extractUrls(byte[] content)

protected Set<String>

getColumnFamilies()

Provides the name of the CFs that are protected from random Chaos monkey activity (alter)

private static long

getSequence()

org.apache.hadoop.hbase.TableName

getTablename()

Provides the name of the table that is protected from random Chaos monkey activity

(package private) static org.apache.hadoop.hbase.TableName

getTablename(org.apache.hadoop.conf.Configuration c)

static void

main(String[] args)

protected void

processOptions(org.apache.hbase.thirdparty.org.apache.commons.cli.CommandLine cmd)

private static byte[]

rowKeyFromTargetURI(String targetUri)

int

run(String[] args)

protected int

runLoader(org.apache.hadoop.fs.Path warcFileInputDir, org.apache.hadoop.fs.Path outputDir)

int

runTestFromCommandLine()

protected int

runVerify(org.apache.hadoop.fs.Path inputDir)

void

setUpCluster()

Methods inherited from class org.apache.hadoop.hbase.IntegrationTestBase
addOptions, cleanUp, cleanUpMonkey, cleanUpMonkey, doWork, getConf, getDefaultMonkeyFactory, getTestingUtil, loadMonkeyProperties, processBaseOptions, setUp, setUpMonkey, startMonkey

Methods inherited from class org.apache.hadoop.hbase.util.AbstractHBaseTool
addOption, addOptNoArg, addOptNoArg, addOptWithArg, addOptWithArg, addRequiredOption, addRequiredOptWithArg, addRequiredOptWithArg, doStaticMain, getOptionAsDouble, getOptionAsInt, getOptionAsInt, getOptionAsLong, getOptionAsLong, newParser, parseArgs, parseInt, parseLong, printUsage, printUsage, processOldArgs, setConf

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LOG
  
  private static final org.slf4j.Logger LOG
- TABLE_NAME_KEY
  
  static final String TABLE_NAME_KEY
  See Also:
  
  Constant Field Values
- DEFAULT_TABLE_NAME
  
  static final String DEFAULT_TABLE_NAME
  See Also:
  
  Constant Field Values
- INCREMENTS_NAME_KEY
  
  static final String INCREMENTS_NAME_KEY
  See Also:
  
  Constant Field Values
- DEFAULT_INCREMENTS
  
  static final boolean DEFAULT_INCREMENTS
  See Also:
  
  Constant Field Values
- MAX_INFLIGHT
  
  static final int MAX_INFLIGHT
  See Also:
  
  Constant Field Values
- INFLIGHT_PAUSE_MS
  
  static final int INFLIGHT_PAUSE_MS
  See Also:
  
  Constant Field Values
- CONTENT_FAMILY_NAME
  
  static final byte[] CONTENT_FAMILY_NAME
- INFO_FAMILY_NAME
  
  static final byte[] INFO_FAMILY_NAME
- URL_FAMILY_NAME
  
  static final byte[] URL_FAMILY_NAME
- SEP
  
  static final byte[] SEP
- CONTENT_QUALIFIER
  
  static final byte[] CONTENT_QUALIFIER
- CONTENT_LENGTH_QUALIFIER
  
  static final byte[] CONTENT_LENGTH_QUALIFIER
- CONTENT_TYPE_QUALIFIER
  
  static final byte[] CONTENT_TYPE_QUALIFIER
- CRC_QUALIFIER
  
  static final byte[] CRC_QUALIFIER
- DATE_QUALIFIER
  
  static final byte[] DATE_QUALIFIER
- IP_ADDRESS_QUALIFIER
  
  static final byte[] IP_ADDRESS_QUALIFIER
- TARGET_URI_QUALIFIER
  
  static final byte[] TARGET_URI_QUALIFIER
- REF_QUALIFIER
  
  static final byte[] REF_QUALIFIER
- warcFileInputDir
  
  protected org.apache.hadoop.fs.Path warcFileInputDir
- outputDir
  
  protected org.apache.hadoop.fs.Path outputDir
- args
  
  protected String[] args
- counter
  
  private static final AtomicLong counter
- shift
  
  private static final int shift
  See Also:
  
  Constant Field Values
- URL_PATTERN
  
  static final Pattern URL_PATTERN
Constructor Details
- IntegrationTestLoadCommonCrawl
  
  public IntegrationTestLoadCommonCrawl()
Method Details
- runLoader
  
  protected int runLoader(org.apache.hadoop.fs.Path warcFileInputDir, org.apache.hadoop.fs.Path outputDir) throws Exception
  
  Throws:
  
  Exception
- runVerify
  
  protected int runVerify(org.apache.hadoop.fs.Path inputDir) throws Exception
  
  Throws:
  
  Exception
- run
  
  public int run(String[] args)
  
  Specified by:
  
  run in interface org.apache.hadoop.util.Tool
  
  Overrides:
  
  run in class org.apache.hadoop.hbase.util.AbstractHBaseTool
- processOptions
  
  protected void processOptions(org.apache.hbase.thirdparty.org.apache.commons.cli.CommandLine cmd)
  
  Overrides:
  
  processOptions in class IntegrationTestBase
- setUpCluster
  
  public void setUpCluster() throws Exception
  
  Specified by:
  
  setUpCluster in class IntegrationTestBase
  
  Throws:
  
  Exception
- cleanUpCluster
  
  public void cleanUpCluster() throws Exception
  
  Overrides:
  
  cleanUpCluster in class IntegrationTestBase
  
  Throws:
  
  Exception
- getTablename
  
  static org.apache.hadoop.hbase.TableName getTablename(org.apache.hadoop.conf.Configuration c)
- getTablename
  
  public org.apache.hadoop.hbase.TableName getTablename()
  
  Description copied from class: IntegrationTestBase
  
  Provides the name of the table that is protected from random Chaos monkey activity
  
  Specified by:
  
  getTablename in class IntegrationTestBase
  
  Returns:
  
  table to not delete.
- getColumnFamilies
  
  protected Set<String> getColumnFamilies()
  
  Description copied from class: IntegrationTestBase
  
  Provides the name of the CFs that are protected from random Chaos monkey activity (alter)
  
  Specified by:
  
  getColumnFamilies in class IntegrationTestBase
  
  Returns:
  
  set of cf names to protect.
- runTestFromCommandLine
  
  public int runTestFromCommandLine() throws Exception
  
  Specified by:
  
  runTestFromCommandLine in class IntegrationTestBase
  
  Throws:
  
  Exception
- main
  
  public static void main(String[] args) throws Exception
  
  Throws:
  
  Exception
- getSequence
  
  private static long getSequence()
- rowKeyFromTargetURI
  
  private static byte[] rowKeyFromTargetURI(String targetUri) throws IOException, URISyntaxException, IllegalArgumentException
  
  Throws:
  
  IOException
  
  URISyntaxException
  
  IllegalArgumentException
- extractUrls
  
  private static Collection<String> extractUrls(byte[] content)

Class IntegrationTestLoadCommonCrawl

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.hbase.util.AbstractHBaseTool

Field Summary

Fields inherited from class org.apache.hadoop.hbase.IntegrationTestBase

Fields inherited from class org.apache.hadoop.hbase.util.AbstractHBaseTool

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.hbase.IntegrationTestBase

Methods inherited from class org.apache.hadoop.hbase.util.AbstractHBaseTool

Methods inherited from class java.lang.Object

Field Details

LOG

TABLE_NAME_KEY

DEFAULT_TABLE_NAME

INCREMENTS_NAME_KEY

DEFAULT_INCREMENTS

MAX_INFLIGHT

INFLIGHT_PAUSE_MS

CONTENT_FAMILY_NAME

INFO_FAMILY_NAME

URL_FAMILY_NAME

SEP

CONTENT_QUALIFIER

CONTENT_LENGTH_QUALIFIER

CONTENT_TYPE_QUALIFIER

CRC_QUALIFIER

DATE_QUALIFIER

IP_ADDRESS_QUALIFIER

TARGET_URI_QUALIFIER

REF_QUALIFIER

warcFileInputDir

outputDir

args

counter

shift

URL_PATTERN

Constructor Details

IntegrationTestLoadCommonCrawl

Method Details

runLoader

runVerify

run

processOptions

setUpCluster

cleanUpCluster

getTablename

getTablename

getColumnFamilies

runTestFromCommandLine

main

getSequence

rowKeyFromTargetURI

extractUrls