Class WARCInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>
org.apache.hadoop.hbase.test.util.warc.WARCInputFormat
public class WARCInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>
Hadoop InputFormat for mapreduce jobs ('new' API) that want to process data in WARC files. Usage:
```java Job job = new Job(getConf()); job.setInputFormatClass(WARCInputFormat.class); ``` Mappers
should use a key of
LongWritable
(which is 1 for the first record in
a file, 2 for the second record, etc.) and a value of WARCWritable
.-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.Counter
-
Field Summary
Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionorg.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,
WARCWritable> createRecordReader
(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) Opens a WARC file (possibly compressed) for reading, and returns a RecordReader for accessing it.protected boolean
isSplitable
(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename) Always returns false, as WARC files cannot be split.Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
-
Constructor Details
-
WARCInputFormat
public WARCInputFormat()
-
-
Method Details
-
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,WARCWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException Opens a WARC file (possibly compressed) for reading, and returns a RecordReader for accessing it.- Specified by:
createRecordReader
in classorg.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.LongWritable,
WARCWritable> - Throws:
IOException
InterruptedException
-
isSplitable
protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename) Always returns false, as WARC files cannot be split.- Overrides:
isSplitable
in classorg.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,
WARCWritable>
-