Class TableMapReduceUtil

java.lang.Object
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil

@Public public class TableMapReduceUtil extends Object
Utility for TableMapper and TableReducer
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private static final org.slf4j.Logger
     
    static final String
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static void
    addDependencyJars(org.apache.hadoop.conf.Configuration conf, Class<?>... classes)
    Deprecated.
    since 1.3.0 and will be removed in 3.0.0.
    static void
    addDependencyJars(org.apache.hadoop.mapreduce.Job job)
    Add the HBase dependency jars as well as jars for any of the configured job classes to the job configuration, so that JobClient will ship them to the cluster and add them to the DistributedCache.
    static void
    addDependencyJarsForClasses(org.apache.hadoop.conf.Configuration conf, Class<?>... classes)
    Add the jars containing the given classes to the job's configuration such that JobClient will ship them to the cluster and add them to the DistributedCache.
    static void
    addHBaseDependencyJars(org.apache.hadoop.conf.Configuration conf)
    Add HBase and its dependencies (only) to the job configuration.
    static String
    buildDependencyClasspath(org.apache.hadoop.conf.Configuration conf)
    Returns a classpath string built from the content of the "tmpjars" value in conf.
    static String
    Writes the given scan into a Base64 encoded string.
    static Scan
    Converts the given Base64 string back into a Scan instance.
    private static String
    findContainingJar(Class<?> my_class, Map<String,String> packagedClasses)
    Find a jar that contains a class of the same name, if any.
    private static org.apache.hadoop.fs.Path
    findOrCreateJar(Class<?> my_class, org.apache.hadoop.fs.FileSystem fs, Map<String,String> packagedClasses)
    Finds the Jar for a class or creates it if it doesn't exist.
    private static Class<? extends org.apache.hadoop.mapreduce.InputFormat>
    getConfiguredInputFormat(org.apache.hadoop.mapreduce.Job job)
     
    private static String
    getJar(Class<?> my_class)
    Invoke 'getJar' on a custom JarFinder implementation.
    private static int
    getRegionCount(org.apache.hadoop.conf.Configuration conf, TableName tableName)
     
    static void
    initCredentials(org.apache.hadoop.mapreduce.Job job)
     
    static void
    initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, String quorumAddress)
    Deprecated.
    Since 1.2.0 and will be removed in 3.0.0.
    static void
    initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf)
    Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job.
    static void
    initMultiTableSnapshotMapperJob(Map<String,Collection<Scan>> snapshotScans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir)
    Sets up the job for reading from one or more table snapshots, with one or more scans per snapshot.
    static void
    initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass)
    Use this before submitting a TableMap job.
    static void
    initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job)
    Use this before submitting a Multi TableMap job.
    static void
    initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars)
    Use this before submitting a Multi TableMap job.
    static void
    initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials)
    Use this before submitting a Multi TableMap job.
    static void
    initTableMapperJob(TableName table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job)
    Use this before submitting a TableMap job.
    static void
    initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job)
    Use this before submitting a TableReduce job.
    static void
    initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner)
    Use this before submitting a TableReduce job.
    static void
    initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl)
    Use this before submitting a TableReduce job.
    static void
    initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl, boolean addDependencyJars)
    Use this before submitting a TableReduce job.
    static void
    initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir)
    Sets up the job for reading from a table snapshot.
    static void
    initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion)
    Sets up the job for reading from a table snapshot.
    static void
    limitNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job)
    Ensures that the given number of reduce tasks for the given job configuration does not exceed the number of regions for the given table.
    static void
    resetCacheConfig(org.apache.hadoop.conf.Configuration conf)
    Enable a basic on-heap cache for these jobs.
    static void
    setNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job)
    Sets the number of reduce tasks for the given job configuration to the number of regions the given table has.
    static void
    setScannerCaching(org.apache.hadoop.mapreduce.Job job, int batchSize)
    Sets the number of rows to return and cache with each scanner iteration.
    private static void
    updateMap(String jar, Map<String,String> packagedClasses)
    Add entries to packagedClasses corresponding to class files contained in jar.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Constructor Details

  • Method Details

    • initTableMapperJob

      public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - The table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(TableName table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - The table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - Binary representation of the table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - The table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - The table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      initCredentials - whether to initialize hbase auth credentials for the job
      inputFormatClass - the input format
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - Binary representation of the table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      inputFormatClass - The class of the input format
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - Binary representation of the table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      Throws:
      IOException - When setting up the details fails.
    • getConfiguredInputFormat

      private static Class<? extends org.apache.hadoop.mapreduce.InputFormat> getConfiguredInputFormat(org.apache.hadoop.mapreduce.Job job)
      Returns:
      TableInputFormat .class unless Configuration has something else at TABLE_INPUT_CLASS_KEY.
    • initTableMapperJob

      public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException
      Use this before submitting a TableMap job. It will appropriately set up the job.
      Parameters:
      table - The table name to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      Throws:
      IOException - When setting up the details fails.
    • resetCacheConfig

      public static void resetCacheConfig(org.apache.hadoop.conf.Configuration conf)
      Enable a basic on-heap cache for these jobs. Any BlockCache implementation based on direct memory will likely cause the map tasks to OOM when opening the region. This is done here instead of in TableSnapshotRegionRecordReader in case an advanced user wants to override this behavior in their job.
    • initMultiTableSnapshotMapperJob

      public static void initMultiTableSnapshotMapperJob(Map<String,Collection<Scan>> snapshotScans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) throws IOException
      Sets up the job for reading from one or more table snapshots, with one or more scans per snapshot. It bypasses hbase servers and read directly from snapshot files.
      Parameters:
      snapshotScans - map of snapshot name to scans on that snapshot.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      Throws:
      IOException
    • initTableSnapshotMapperJob

      public static void initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) throws IOException
      Sets up the job for reading from a table snapshot. It bypasses hbase servers and read directly from snapshot files.
      Parameters:
      snapshotName - The name of the snapshot (of a table) to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      tmpRestoreDir - a temporary directory to copy the snapshot files into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restore directory can be deleted.
      Throws:
      IOException - When setting up the details fails.
      See Also:
    • initTableSnapshotMapperJob

      public static void initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException
      Sets up the job for reading from a table snapshot. It bypasses hbase servers and read directly from snapshot files.
      Parameters:
      snapshotName - The name of the snapshot (of a table) to read from.
      scan - The scan instance with the columns, time range etc.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      tmpRestoreDir - a temporary directory to copy the snapshot files into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restore directory can be deleted.
      splitAlgo - algorithm to split
      numSplitsPerRegion - how many input splits to generate per one region
      Throws:
      IOException - When setting up the details fails.
      See Also:
    • initTableMapperJob

      public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
      Use this before submitting a Multi TableMap job. It will appropriately set up the job.
      Parameters:
      scans - The list of Scan objects to read from.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException
      Use this before submitting a Multi TableMap job. It will appropriately set up the job.
      Parameters:
      scans - The list of Scan objects to read from.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      Throws:
      IOException - When setting up the details fails.
    • initTableMapperJob

      public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials) throws IOException
      Use this before submitting a Multi TableMap job. It will appropriately set up the job.
      Parameters:
      scans - The list of Scan objects to read from.
      mapper - The mapper class to use.
      outputKeyClass - The class of the output key.
      outputValueClass - The class of the output value.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      initCredentials - whether to initialize hbase auth credentials for the job
      Throws:
      IOException - When setting up the details fails.
    • initCredentials

      public static void initCredentials(org.apache.hadoop.mapreduce.Job job) throws IOException
      Throws:
      IOException
    • initCredentialsForCluster

      @Deprecated public static void initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, String quorumAddress) throws IOException
      Deprecated.
      Since 1.2.0 and will be removed in 3.0.0. Use initCredentialsForCluster(Job, Configuration) instead.
      Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job. The quorumAddress is the key to the ZK ensemble, which contains: hbase.zookeeper.quorum, hbase.zookeeper.client.port and zookeeper.znode.parent
      Parameters:
      job - The job that requires the permission.
      quorumAddress - string that contains the 3 required configuratins
      Throws:
      IOException - When the authentication token cannot be obtained.
      See Also:
    • initCredentialsForCluster

      public static void initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf) throws IOException
      Obtain an authentication token, for the specified cluster, on behalf of the current user and add it to the credentials for the given map reduce job.
      Parameters:
      job - The job that requires the permission.
      conf - The configuration to use in connecting to the peer cluster
      Throws:
      IOException - When the authentication token cannot be obtained.
    • convertScanToString

      public static String convertScanToString(Scan scan) throws IOException
      Writes the given scan into a Base64 encoded string.
      Parameters:
      scan - The scan to write out.
      Returns:
      The scan saved in a Base64 encoded string.
      Throws:
      IOException - When writing the scan fails.
    • convertStringToScan

      public static Scan convertStringToScan(String base64) throws IOException
      Converts the given Base64 string back into a Scan instance.
      Parameters:
      base64 - The scan details.
      Returns:
      The newly created Scan instance.
      Throws:
      IOException - When reading the scan instance fails.
    • initTableReducerJob

      public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job) throws IOException
      Use this before submitting a TableReduce job. It will appropriately set up the JobConf.
      Parameters:
      table - The output table.
      reducer - The reducer class to use.
      job - The current job to adjust.
      Throws:
      IOException - When determining the region count fails.
    • initTableReducerJob

      public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner) throws IOException
      Use this before submitting a TableReduce job. It will appropriately set up the JobConf.
      Parameters:
      table - The output table.
      reducer - The reducer class to use.
      job - The current job to adjust.
      partitioner - Partitioner to use. Pass null to use default partitioner.
      Throws:
      IOException - When determining the region count fails.
    • initTableReducerJob

      public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl) throws IOException
      Use this before submitting a TableReduce job. It will appropriately set up the JobConf.
      Parameters:
      table - The output table.
      reducer - The reducer class to use.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      partitioner - Partitioner to use. Pass null to use default partitioner.
      quorumAddress - Distant cluster to write to; default is null for output to the cluster that is designated in hbase-site.xml. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated by hbase-site.xml and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass <hbase.zookeeper.quorum>:< hbase.zookeeper.client.port>:<zookeeper.znode.parent> such as server,server2,server3:2181:/hbase.
      serverClass - redefined hbase.regionserver.class
      serverImpl - redefined hbase.regionserver.impl
      Throws:
      IOException - When determining the region count fails.
    • initTableReducerJob

      public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl, boolean addDependencyJars) throws IOException
      Use this before submitting a TableReduce job. It will appropriately set up the JobConf.
      Parameters:
      table - The output table.
      reducer - The reducer class to use.
      job - The current job to adjust. Make sure the passed job is carrying all necessary HBase configuration.
      partitioner - Partitioner to use. Pass null to use default partitioner.
      quorumAddress - Distant cluster to write to; default is null for output to the cluster that is designated in hbase-site.xml. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated by hbase-site.xml and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass <hbase.zookeeper.quorum>:< hbase.zookeeper.client.port>:<zookeeper.znode.parent> such as server,server2,server3:2181:/hbase.
      serverClass - redefined hbase.regionserver.class
      serverImpl - redefined hbase.regionserver.impl
      addDependencyJars - upload HBase jars and jars for any of the configured job classes via the distributed cache (tmpjars).
      Throws:
      IOException - When determining the region count fails.
    • limitNumReduceTasks

      public static void limitNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job) throws IOException
      Ensures that the given number of reduce tasks for the given job configuration does not exceed the number of regions for the given table.
      Parameters:
      table - The table to get the region count for.
      job - The current job to adjust.
      Throws:
      IOException - When retrieving the table details fails.
    • setNumReduceTasks

      public static void setNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job) throws IOException
      Sets the number of reduce tasks for the given job configuration to the number of regions the given table has.
      Parameters:
      table - The table to get the region count for.
      job - The current job to adjust.
      Throws:
      IOException - When retrieving the table details fails.
    • setScannerCaching

      public static void setScannerCaching(org.apache.hadoop.mapreduce.Job job, int batchSize)
      Sets the number of rows to return and cache with each scanner iteration. Higher caching values will enable faster mapreduce jobs at the expense of requiring more heap to contain the cached rows.
      Parameters:
      job - The current job to adjust.
      batchSize - The number of rows to return in batch with each scanner iteration.
    • addHBaseDependencyJars

      public static void addHBaseDependencyJars(org.apache.hadoop.conf.Configuration conf) throws IOException
      Add HBase and its dependencies (only) to the job configuration.

      This is intended as a low-level API, facilitating code reuse between this class and its mapred counterpart. It also of use to external tools that need to build a MapReduce job that interacts with HBase but want fine-grained control over the jars shipped to the cluster.

      Parameters:
      conf - The Configuration object to extend with dependencies.
      Throws:
      IOException
      See Also:
    • buildDependencyClasspath

      public static String buildDependencyClasspath(org.apache.hadoop.conf.Configuration conf)
      Returns a classpath string built from the content of the "tmpjars" value in conf. Also exposed to shell scripts via `bin/hbase mapredcp`.
    • addDependencyJars

      public static void addDependencyJars(org.apache.hadoop.mapreduce.Job job) throws IOException
      Add the HBase dependency jars as well as jars for any of the configured job classes to the job configuration, so that JobClient will ship them to the cluster and add them to the DistributedCache.
      Throws:
      IOException
    • addDependencyJars

      @Deprecated public static void addDependencyJars(org.apache.hadoop.conf.Configuration conf, Class<?>... classes) throws IOException
      Deprecated.
      since 1.3.0 and will be removed in 3.0.0. Use addDependencyJars(Job) instead.
      Add the jars containing the given classes to the job's configuration such that JobClient will ship them to the cluster and add them to the DistributedCache.
      Throws:
      IOException
      See Also:
    • addDependencyJarsForClasses

      @Private public static void addDependencyJarsForClasses(org.apache.hadoop.conf.Configuration conf, Class<?>... classes) throws IOException
      Add the jars containing the given classes to the job's configuration such that JobClient will ship them to the cluster and add them to the DistributedCache. N.B. that this method at most adds one jar per class given. If there is more than one jar available containing a class with the same name as a given class, we don't define which of those jars might be chosen.
      Parameters:
      conf - The Hadoop Configuration to modify
      classes - will add just those dependencies needed to find the given classes
      Throws:
      IOException - if an underlying library call fails.
    • findOrCreateJar

      private static org.apache.hadoop.fs.Path findOrCreateJar(Class<?> my_class, org.apache.hadoop.fs.FileSystem fs, Map<String,String> packagedClasses) throws IOException
      Finds the Jar for a class or creates it if it doesn't exist. If the class is in a directory in the classpath, it creates a Jar on the fly with the contents of the directory and returns the path to that Jar. If a Jar is created, it is created in the system temporary directory. Otherwise, returns an existing jar that contains a class of the same name. Maintains a mapping from jar contents to the tmp jar created.
      Parameters:
      my_class - the class to find.
      fs - the FileSystem with which to qualify the returned path.
      packagedClasses - a map of class name to path.
      Returns:
      a jar file that contains the class.
      Throws:
      IOException
    • updateMap

      private static void updateMap(String jar, Map<String,String> packagedClasses) throws IOException
      Add entries to packagedClasses corresponding to class files contained in jar.
      Parameters:
      jar - The jar who's content to list.
      packagedClasses - map[class -> jar]
      Throws:
      IOException
    • findContainingJar

      private static String findContainingJar(Class<?> my_class, Map<String,String> packagedClasses) throws IOException
      Find a jar that contains a class of the same name, if any. It will return a jar file, even if that is not the first thing on the class path that has a class with the same name. Looks first on the classpath and then in the packagedClasses map.
      Parameters:
      my_class - the class to find.
      Returns:
      a jar file that contains the class, or null.
      Throws:
      IOException
    • getJar

      private static String getJar(Class<?> my_class)
      Invoke 'getJar' on a custom JarFinder implementation. Useful for some job configuration contexts (HBASE-8140) and also for testing on MRv2. check if we have HADOOP-9426.
      Parameters:
      my_class - the class to find.
      Returns:
      a jar file that contains the class, or null.
    • getRegionCount

      private static int getRegionCount(org.apache.hadoop.conf.Configuration conf, TableName tableName) throws IOException
      Throws:
      IOException