Class WARCRecord
java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCRecord
Immutable implementation of a record in a WARC file. You create a
WARCRecord
by parsing
it out of a DataInput
stream.
The file format is documented in the
ISO Standard. In
a nutshell, it's a textual format consisting of lines delimited by `\r\n`. Each record has the
following structure:
- A line indicating the WARC version number, such as `WARC/1.0`.
- Several header lines (in key-value format, similar to HTTP or email headers), giving information about the record. The header is terminated by an empty line.
- A body consisting of raw bytes (the number of bytes is indicated in one of the headers).
- A final separator of `\r\n\r\n` before the next record starts.
WARCRecord.Header.getRecordType()
.-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprivate final byte[]
private static final Pattern
private static final String
private static final byte[]
private final WARCRecord.Header
private static final Pattern
static final String
-
Constructor Summary
ConstructorDescriptionWARCRecord
(DataInput in) Creates a new WARCRecord by parsing it out of aDataInput
stream. -
Method Summary
Modifier and TypeMethodDescriptionbyte[]
Returns the body of the record, as an unparsed raw array of bytes.Returns the parsed header structure of the WARC record.private static WARCRecord.Header
readHeader
(DataInput in) private static String
private static void
toString()
Returns a human-readable string representation of the record.void
write
(DataOutput out) Writes this record to aDataOutput
stream.
-
Field Details
-
WARC_VERSION
- See Also:
-
VERSION_PATTERN
-
CONTINUATION_PATTERN
-
CRLF
- See Also:
-
CRLF_BYTES
-
header
-
-
-
Constructor Details
-
WARCRecord
Creates a new WARCRecord by parsing it out of aDataInput
stream.- Parameters:
in
- The input source from which one record will be read.- Throws:
IOException
-
-
Method Details
-
readHeader
- Throws:
IOException
-
readLine
- Throws:
IOException
-
readSeparator
- Throws:
IOException
-
getHeader
Returns the parsed header structure of the WARC record. -
getContent
Returns the body of the record, as an unparsed raw array of bytes. The content of the body depends on the type of record (seeWARCRecord.Header.getRecordType()
). For example, in the case of a `response` type header, the body consists of the full HTTP response returned by the server (HTTP headers followed by the body). -
write
Writes this record to aDataOutput
stream. The output may, in some edge cases, be not byte-for-byte identical to what was parsed from aDataInput
. However it has the same meaning and should not lose any information.- Parameters:
out
- The output stream to which this record should be appended.- Throws:
IOException
-
toString
Returns a human-readable string representation of the record.
-