org.apache.hadoop.hbase.test.util.warc.WARCFileWriter

public class WARCFileWriter extends Object

Writes WARCRecords to a WARC file, using Hadoop's filesystem APIs. (This means you can write to HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied to the MapReduce APIs -- that link is provided by the mapred com.martinkl.warc.mapred.WARCOutputFormat and the mapreduce com.martinkl.warc.mapreduce.WARCOutputFormat. WARCFileWriter keeps track of how much data it has written (optionally gzip-compressed); when the file becomes larger than some threshold, it is automatically closed and a new segment is started. A segment number is appended to the filename for that purpose. The segment number always starts at 00000, and by default a new segment is started when the file size exceeds 1GB. To change the target size for a segment, you can set the `warc.output.segment.size` key in the Hadoop configuration to the number of bytes. (Files may actually be a bit larger than this threshold, since we finish writing the current record before opening a new file.)

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

private class

WARCFileWriter.CountingOutputStream
Field Summary

Fields

Modifier and Type

Field

Description

private WARCFileWriter.CountingOutputStream

byteStream

private long

bytesWritten

private final org.apache.hadoop.io.compress.CompressionCodec

codec

private final org.apache.hadoop.conf.Configuration

conf

private DataOutputStream

dataStream

static final long

DEFAULT_MAX_SEGMENT_SIZE

private final String

extensionFormat

private static final org.slf4j.Logger

logger

private final long

maxSegmentSize

private final org.apache.hadoop.util.Progressable

progress

private long

segmentsAttempted

private long

segmentsCreated

private final org.apache.hadoop.fs.Path

workOutputPath
Constructor Summary

Constructors

Constructor

Description

WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath)

Creates a WARC file, and opens it for writing.

WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress)

Creates a WARC file, and opens it for writing.
Method Summary

Modifier and Type

Method

Description

void

close()

Flushes any buffered data and closes the file.

private void

createSegment()

Creates an output segment file and sets up the output streams to point at it.

static org.apache.hadoop.io.compress.CompressionCodec

getGzipCodec(org.apache.hadoop.conf.Configuration conf)

Instantiates a Hadoop codec for compressing and decompressing Gzip files.

void

write(WARCRecord record)

Appends a WARCRecord to the file, in WARC/1.0 format.

void

write(WARCWritable record)

Appends a WARCRecord wrapped in a WARCWritable to the file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- logger
  
  private static final org.slf4j.Logger logger
- DEFAULT_MAX_SEGMENT_SIZE
  
  public static final long DEFAULT_MAX_SEGMENT_SIZE
  See Also:
  
  Constant Field Values
- conf
  
  private final org.apache.hadoop.conf.Configuration conf
- codec
  
  private final org.apache.hadoop.io.compress.CompressionCodec codec
- workOutputPath
  
  private final org.apache.hadoop.fs.Path workOutputPath
- progress
  
  private final org.apache.hadoop.util.Progressable progress
- extensionFormat
  
  private final String extensionFormat
- maxSegmentSize
  
  private final long maxSegmentSize
- segmentsCreated
  
  private long segmentsCreated
- segmentsAttempted
  
  private long segmentsAttempted
- bytesWritten
  
  private long bytesWritten
- byteStream
  
  private WARCFileWriter.CountingOutputStream byteStream
- dataStream
  
  private DataOutputStream dataStream
Constructor Details
- WARCFileWriter
  
  public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) throws IOException
  
  Creates a WARC file, and opens it for writing. If a file with the same name already exists, an attempt number in the filename is incremented until we find a file that doesn't already exist.
  
  Parameters:
  
  conf - The Hadoop configuration.
  
  codec - If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.
  
  workOutputPath - The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.
  
  Throws:
  
  IOException
- WARCFileWriter
  
  public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) throws IOException
  
  Creates a WARC file, and opens it for writing. If a file with the same name already exists, it is *overwritten*. Note that this is different behaviour from the other constructor. Yes, this sucks. It will probably change in a future version.
  
  Parameters:
  
  conf - The Hadoop configuration.
  
  codec - If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.
  
  workOutputPath - The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.
  
  progress - An object used by the mapred API for tracking a task's progress.
  
  Throws:
  
  IOException
Method Details
- getGzipCodec
  
  public static org.apache.hadoop.io.compress.CompressionCodec getGzipCodec(org.apache.hadoop.conf.Configuration conf)
  
  Instantiates a Hadoop codec for compressing and decompressing Gzip files. This is the most common compression applied to WARC files.
  
  Parameters:
  
  conf - The Hadoop configuration.
- createSegment
  
  private void createSegment() throws IOException
  
  Creates an output segment file and sets up the output streams to point at it. If the file already exists, retries with a different filename. This is a bit nasty -- after all, FileOutputFormat's work directory concept is supposed to prevent filename clashes -- but it looks like Amazon Elastic MapReduce prevents use of per-task work directories if the output of a job is on S3. TODO: Investigate this and find a better solution.
  
  Throws:
  
  IOException
- write
  
  public void write(WARCRecord record) throws IOException
  
  Appends a WARCRecord to the file, in WARC/1.0 format.
  
  Parameters:
  
  record - The record to be written.
  
  Throws:
  
  IOException
- write
  
  public void write(WARCWritable record) throws IOException
  
  Appends a WARCRecord wrapped in a WARCWritable to the file.
  
  Parameters:
  
  record - The wrapper around the record to be written.
  
  Throws:
  
  IOException
- close
  
  public void close() throws IOException
  
  Flushes any buffered data and closes the file.
  
  Throws:
  
  IOException

Class WARCFileWriter

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

logger

DEFAULT_MAX_SEGMENT_SIZE

conf

codec

workOutputPath

progress

extensionFormat

maxSegmentSize

segmentsCreated

segmentsAttempted

bytesWritten

byteStream

dataStream

Constructor Details

WARCFileWriter

WARCFileWriter

Method Details

getGzipCodec

createSegment

write

write

close