Class WARCFileWriter
java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCFileWriter
Writes
WARCRecord
s to a WARC file, using Hadoop's filesystem APIs. (This means you can
write to HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied
to the MapReduce APIs -- that link is provided by the mapred
com.martinkl.warc.mapred.WARCOutputFormat
and the mapreduce
com.martinkl.warc.mapreduce.WARCOutputFormat
. WARCFileWriter keeps track of how much data
it has written (optionally gzip-compressed); when the file becomes larger than some threshold, it
is automatically closed and a new segment is started. A segment number is appended to the
filename for that purpose. The segment number always starts at 00000, and by default a new
segment is started when the file size exceeds 1GB. To change the target size for a segment, you
can set the `warc.output.segment.size` key in the Hadoop configuration to the number of bytes.
(Files may actually be a bit larger than this threshold, since we finish writing the current
record before opening a new file.)-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprivate long
private final org.apache.hadoop.io.compress.CompressionCodec
private final org.apache.hadoop.conf.Configuration
private DataOutputStream
static final long
private final String
private static final org.slf4j.Logger
private final long
private final org.apache.hadoop.util.Progressable
private long
private long
private final org.apache.hadoop.fs.Path
-
Constructor Summary
ConstructorDescriptionWARCFileWriter
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) Creates a WARC file, and opens it for writing.WARCFileWriter
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) Creates a WARC file, and opens it for writing. -
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
Flushes any buffered data and closes the file.private void
Creates an output segment file and sets up the output streams to point at it.static org.apache.hadoop.io.compress.CompressionCodec
getGzipCodec
(org.apache.hadoop.conf.Configuration conf) Instantiates a Hadoop codec for compressing and decompressing Gzip files.void
write
(WARCRecord record) Appends aWARCRecord
to the file, in WARC/1.0 format.void
write
(WARCWritable record) Appends aWARCRecord
wrapped in aWARCWritable
to the file.
-
Field Details
-
logger
-
DEFAULT_MAX_SEGMENT_SIZE
- See Also:
-
conf
-
codec
-
workOutputPath
-
progress
-
extensionFormat
-
maxSegmentSize
-
segmentsCreated
-
segmentsAttempted
-
bytesWritten
-
byteStream
-
dataStream
-
-
Constructor Details
-
WARCFileWriter
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) throws IOException Creates a WARC file, and opens it for writing. If a file with the same name already exists, an attempt number in the filename is incremented until we find a file that doesn't already exist.- Parameters:
conf
- The Hadoop configuration.codec
- If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.workOutputPath
- The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.- Throws:
IOException
-
WARCFileWriter
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) throws IOException Creates a WARC file, and opens it for writing. If a file with the same name already exists, it is *overwritten*. Note that this is different behaviour from the other constructor. Yes, this sucks. It will probably change in a future version.- Parameters:
conf
- The Hadoop configuration.codec
- If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.workOutputPath
- The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.progress
- An object used by the mapred API for tracking a task's progress.- Throws:
IOException
-
-
Method Details
-
getGzipCodec
public static org.apache.hadoop.io.compress.CompressionCodec getGzipCodec(org.apache.hadoop.conf.Configuration conf) Instantiates a Hadoop codec for compressing and decompressing Gzip files. This is the most common compression applied to WARC files.- Parameters:
conf
- The Hadoop configuration.
-
createSegment
Creates an output segment file and sets up the output streams to point at it. If the file already exists, retries with a different filename. This is a bit nasty -- after all,FileOutputFormat
's work directory concept is supposed to prevent filename clashes -- but it looks like Amazon Elastic MapReduce prevents use of per-task work directories if the output of a job is on S3. TODO: Investigate this and find a better solution.- Throws:
IOException
-
write
Appends aWARCRecord
to the file, in WARC/1.0 format.- Parameters:
record
- The record to be written.- Throws:
IOException
-
write
Appends aWARCRecord
wrapped in aWARCWritable
to the file.- Parameters:
record
- The wrapper around the record to be written.- Throws:
IOException
-
close
Flushes any buffered data and closes the file.- Throws:
IOException
-