Class TableSnapshotInputFormatImpl
java.lang.Object
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl
Hadoop MR API-agnostic implementation for mapreduce over table snapshots.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Implementation class for InputSplit logic common between mapred and mapreduce.static class
Implementation class for RecordReader logic common between mapred and mapreduce. -
Field Summary
Modifier and TypeFieldDescriptionprivate static final float
private static final String
static final org.slf4j.Logger
static final String
For MapReduce jobs running multiple mappers per region, determines number of splits to generate per region.protected static final String
static final String
Whether to calculate the Snapshot region location by region location from meta.static final boolean
static final boolean
static final String
Whether to calculate the block location for splits.static final String
In some scenario, scan limited rows on each InputSplit for sampling data extractionstatic final String
Whether to enable scan metrics on Scan, default to truestatic final boolean
static final String
TheScan.ReadType
which should be set on theScan
to read the HBase Snapshot, default STREAM.static final Scan.ReadType
private static final String
static final String
For MapReduce jobs running multiple mappers per region, determines what split algorithm we should be using to find split points for scanners. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptioncalculateLocationsForInputSplit
(org.apache.hadoop.conf.Configuration conf, TableDescriptor htd, HRegionInfo hri, org.apache.hadoop.fs.Path tableDir) Compute block locations for snapshot files (which will get the locations for referred hfiles) only when localityEnabled is true.static void
cleanRestoreDir
(org.apache.hadoop.mapreduce.Job job, String snapshotName) clean restore directory after snapshot scan jobstatic Scan
extractScanFromConf
(org.apache.hadoop.conf.Configuration conf) getBestLocations
(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution) getBestLocations
(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution, int numTopsAtMost) This computes the locations to be passed from the InputSplit.static List<HRegionInfo>
getRegionInfosFromManifest
(SnapshotManifest manifest) static SnapshotManifest
getSnapshotManifest
(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path rootDir, org.apache.hadoop.fs.FileSystem fs) private static String
getSnapshotName
(org.apache.hadoop.conf.Configuration conf) getSplitAlgo
(org.apache.hadoop.conf.Configuration conf) getSplits
(org.apache.hadoop.conf.Configuration conf) getSplits
(Scan scan, SnapshotManifest manifest, List<HRegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf) getSplits
(Scan scan, SnapshotManifest manifest, List<HRegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf, RegionSplitter.SplitAlgorithm sa, int numSplits) static void
setInput
(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir) Configures the job to use TableSnapshotInputFormat to read from a snapshot.static void
setInput
(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) Configures the job to use TableSnapshotInputFormat to read from a snapshot.
-
Field Details
-
LOG
-
SNAPSHOT_NAME_KEY
- See Also:
-
RESTORE_DIR_KEY
- See Also:
-
LOCALITY_CUTOFF_MULTIPLIER
- See Also:
-
DEFAULT_LOCALITY_CUTOFF_MULTIPLIER
- See Also:
-
SPLIT_ALGO
For MapReduce jobs running multiple mappers per region, determines what split algorithm we should be using to find split points for scanners.- See Also:
-
NUM_SPLITS_PER_REGION
For MapReduce jobs running multiple mappers per region, determines number of splits to generate per region.- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_KEY
Whether to calculate the block location for splits. Default to true. If the computing layer runs outside of HBase cluster, the block locality does not master. Setting this value to false could skip the calculation and save some time. Set access modifier to "public" so that these could be accessed by test classes of both org.apache.hadoop.hbase.mapred and org.apache.hadoop.hbase.mapreduce.- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_ENABLED_DEFAULT
- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION
Whether to calculate the Snapshot region location by region location from meta. It is much faster than computing block locations for splits.- See Also:
-
SNAPSHOT_INPUTFORMAT_LOCALITY_BY_REGION_LOCATION_DEFAULT
- See Also:
-
SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT
In some scenario, scan limited rows on each InputSplit for sampling data extraction- See Also:
-
SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED
Whether to enable scan metrics on Scan, default to true- See Also:
-
SNAPSHOT_INPUTFORMAT_SCAN_METRICS_ENABLED_DEFAULT
- See Also:
-
SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE
TheScan.ReadType
which should be set on theScan
to read the HBase Snapshot, default STREAM.- See Also:
-
SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE_DEFAULT
-
-
Constructor Details
-
TableSnapshotInputFormatImpl
public TableSnapshotInputFormatImpl()
-
-
Method Details
-
getSplits
public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getSplitAlgo
public static RegionSplitter.SplitAlgorithm getSplitAlgo(org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getRegionInfosFromManifest
-
getSnapshotManifest
public static SnapshotManifest getSnapshotManifest(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path rootDir, org.apache.hadoop.fs.FileSystem fs) throws IOException - Throws:
IOException
-
extractScanFromConf
public static Scan extractScanFromConf(org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getSplits
public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(Scan scan, SnapshotManifest manifest, List<HRegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf) throws IOException - Throws:
IOException
-
getSplits
public static List<TableSnapshotInputFormatImpl.InputSplit> getSplits(Scan scan, SnapshotManifest manifest, List<HRegionInfo> regionManifests, org.apache.hadoop.fs.Path restoreDir, org.apache.hadoop.conf.Configuration conf, RegionSplitter.SplitAlgorithm sa, int numSplits) throws IOException - Throws:
IOException
-
calculateLocationsForInputSplit
private static List<String> calculateLocationsForInputSplit(org.apache.hadoop.conf.Configuration conf, TableDescriptor htd, HRegionInfo hri, org.apache.hadoop.fs.Path tableDir) throws IOException Compute block locations for snapshot files (which will get the locations for referred hfiles) only when localityEnabled is true.- Throws:
IOException
-
getBestLocations
private static List<String> getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution, int numTopsAtMost) This computes the locations to be passed from the InputSplit. MR/Yarn schedulers does not take weights into account, thus will treat every location passed from the input split as equal. We do not want to blindly pass all the locations, since we are creating one split per region, and the region's blocks are all distributed throughout the cluster unless favorite node assignment is used. On the expected stable case, only one location will contain most of the blocks as local. On the other hand, in favored node assignment, 3 nodes will contain highly local blocks. Here we are doing a simple heuristic, where we will pass all hosts which have at least 80% (hbase.tablesnapshotinputformat.locality.cutoff.multiplier) as much block locality as the top host with the best locality. Return at most numTopsAtMost locations if there are more than that. -
getBestLocations
public static List<String> getBestLocations(org.apache.hadoop.conf.Configuration conf, HDFSBlocksDistribution blockDistribution) -
getSnapshotName
-
setInput
public static void setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir) throws IOException Configures the job to use TableSnapshotInputFormat to read from a snapshot.- Parameters:
conf
- the job to configurationsnapshotName
- the name of the snapshot to read fromrestoreDir
- a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.- Throws:
IOException
- if an error occurs
-
setInput
public static void setInput(org.apache.hadoop.conf.Configuration conf, String snapshotName, org.apache.hadoop.fs.Path restoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException Configures the job to use TableSnapshotInputFormat to read from a snapshot.- Parameters:
conf
- the job to configuresnapshotName
- the name of the snapshot to read fromrestoreDir
- a temporary directory to restore the snapshot into. Current user should have write permissions to this directory, and this should not be a subdirectory of rootdir. After the job is finished, restoreDir can be deleted.numSplitsPerRegion
- how many input splits to generate per one regionsplitAlgo
- SplitAlgorithm to be used when generating InputSplits- Throws:
IOException
- if an error occurs
-
cleanRestoreDir
public static void cleanRestoreDir(org.apache.hadoop.mapreduce.Job job, String snapshotName) throws IOException clean restore directory after snapshot scan job- Parameters:
job
- the snapshot scan jobsnapshotName
- the name of the snapshot to read from- Throws:
IOException
- if an error occurs
-