Class WARCFileReader
java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCFileReader
Reads
WARCRecord
s from a WARC file, using Hadoop's filesystem APIs. (This means you can
read from HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied
to the MapReduce APIs -- that link is provided by the mapred
com.martinkl.warc.mapred.WARCInputFormat
and the mapreduce
com.martinkl.warc.mapreduce.WARCInputFormat
.-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprivate long
private DataInputStream
private final long
private static final org.slf4j.Logger
private long
-
Constructor Summary
ConstructorDescriptionWARCFileReader
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path filePath) Opens a file for reading. -
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
Closes the file.long
Returns the number of bytes that have been read from file since it was opened.float
Returns the proportion of the file that has been read, as a number between 0.0 and 1.0.long
Returns the number of records that have been read since the file was opened.read()
Reads the next record from the file.
-
Field Details
-
logger
-
fileSize
-
byteStream
-
dataStream
-
bytesRead
-
recordsRead
-
-
Constructor Details
-
WARCFileReader
public WARCFileReader(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path filePath) throws IOException Opens a file for reading. If the filename ends in `.gz`, it is automatically decompressed on the fly.- Parameters:
conf
- The Hadoop configuration.filePath
- The Hadoop path to the file that should be read.- Throws:
IOException
-
-
Method Details
-
read
Reads the next record from the file.- Returns:
- The record that was read.
- Throws:
IOException
-
close
Closes the file. No more reading is possible after the file has been closed.- Throws:
IOException
-
getRecordsRead
Returns the number of records that have been read since the file was opened. -
getBytesRead
Returns the number of bytes that have been read from file since it was opened. If the file is compressed, this refers to the compressed file size. -
getProgress
Returns the proportion of the file that has been read, as a number between 0.0 and 1.0.
-