Class WARCFileWriter

java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCFileWriter

public class WARCFileWriter extends Object
Writes WARCRecords to a WARC file, using Hadoop's filesystem APIs. (This means you can write to HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied to the MapReduce APIs -- that link is provided by the mapred com.martinkl.warc.mapred.WARCOutputFormat and the mapreduce com.martinkl.warc.mapreduce.WARCOutputFormat. WARCFileWriter keeps track of how much data it has written (optionally gzip-compressed); when the file becomes larger than some threshold, it is automatically closed and a new segment is started. A segment number is appended to the filename for that purpose. The segment number always starts at 00000, and by default a new segment is started when the file size exceeds 1GB. To change the target size for a segment, you can set the `warc.output.segment.size` key in the Hadoop configuration to the number of bytes. (Files may actually be a bit larger than this threshold, since we finish writing the current record before opening a new file.)
  • Field Details

  • Constructor Details

    • WARCFileWriter

      public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) throws IOException
      Creates a WARC file, and opens it for writing. If a file with the same name already exists, an attempt number in the filename is incremented until we find a file that doesn't already exist.
      Parameters:
      conf - The Hadoop configuration.
      codec - If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.
      workOutputPath - The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.
      Throws:
      IOException
    • WARCFileWriter

      public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) throws IOException
      Creates a WARC file, and opens it for writing. If a file with the same name already exists, it is *overwritten*. Note that this is different behaviour from the other constructor. Yes, this sucks. It will probably change in a future version.
      Parameters:
      conf - The Hadoop configuration.
      codec - If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.
      workOutputPath - The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.
      progress - An object used by the mapred API for tracking a task's progress.
      Throws:
      IOException
  • Method Details

    • getGzipCodec

      public static org.apache.hadoop.io.compress.CompressionCodec getGzipCodec(org.apache.hadoop.conf.Configuration conf)
      Instantiates a Hadoop codec for compressing and decompressing Gzip files. This is the most common compression applied to WARC files.
      Parameters:
      conf - The Hadoop configuration.
    • createSegment

      private void createSegment() throws IOException
      Creates an output segment file and sets up the output streams to point at it. If the file already exists, retries with a different filename. This is a bit nasty -- after all, FileOutputFormat's work directory concept is supposed to prevent filename clashes -- but it looks like Amazon Elastic MapReduce prevents use of per-task work directories if the output of a job is on S3. TODO: Investigate this and find a better solution.
      Throws:
      IOException
    • write

      public void write(WARCRecord record) throws IOException
      Appends a WARCRecord to the file, in WARC/1.0 format.
      Parameters:
      record - The record to be written.
      Throws:
      IOException
    • write

      public void write(WARCWritable record) throws IOException
      Appends a WARCRecord wrapped in a WARCWritable to the file.
      Parameters:
      record - The wrapper around the record to be written.
      Throws:
      IOException
    • close

      public void close() throws IOException
      Flushes any buffered data and closes the file.
      Throws:
      IOException