org.apache.hadoop.hbase.test.util.warc.WARCFileReader

public class WARCFileReader extends Object

Reads WARCRecords from a WARC file, using Hadoop's filesystem APIs. (This means you can read from HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied to the MapReduce APIs -- that link is provided by the mapred com.martinkl.warc.mapred.WARCInputFormat and the mapreduce com.martinkl.warc.mapreduce.WARCInputFormat.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

private class

WARCFileReader.CountingInputStream
Field Summary

Fields

Modifier and Type

Field

Description

private long

bytesRead

private WARCFileReader.CountingInputStream

byteStream

private DataInputStream

dataStream

private final long

fileSize

private static final org.slf4j.Logger

logger

private long

recordsRead
Constructor Summary

Constructors

Constructor

Description

WARCFileReader(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path filePath)

Opens a file for reading.
Method Summary

Modifier and Type

Method

Description

void

close()

Closes the file.

long

getBytesRead()

Returns the number of bytes that have been read from file since it was opened.

float

getProgress()

Returns the proportion of the file that has been read, as a number between 0.0 and 1.0.

long

getRecordsRead()

Returns the number of records that have been read since the file was opened.

WARCRecord

read()

Reads the next record from the file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- logger
  
  private static final org.slf4j.Logger logger
- fileSize
  
  private final long fileSize
- byteStream
  
  private WARCFileReader.CountingInputStream byteStream
- dataStream
  
  private DataInputStream dataStream
- bytesRead
  
  private long bytesRead
- recordsRead
  
  private long recordsRead
Constructor Details
- WARCFileReader
  
  public WARCFileReader(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path filePath) throws IOException
  
  Opens a file for reading. If the filename ends in `.gz`, it is automatically decompressed on the fly.
  
  Parameters:
  
  conf - The Hadoop configuration.
  
  filePath - The Hadoop path to the file that should be read.
  
  Throws:
  
  IOException
Method Details
- read
  
  public WARCRecord read() throws IOException
  
  Reads the next record from the file.
  
  Returns:
  
  The record that was read.
  
  Throws:
  
  IOException
- close
  
  public void close() throws IOException
  
  Closes the file. No more reading is possible after the file has been closed.
  
  Throws:
  
  IOException
- getRecordsRead
  
  public long getRecordsRead()
  
  Returns the number of records that have been read since the file was opened.
- getBytesRead
  
  public long getBytesRead()
  
  Returns the number of bytes that have been read from file since it was opened. If the file is compressed, this refers to the compressed file size.
- getProgress
  
  public float getProgress()
  
  Returns the proportion of the file that has been read, as a number between 0.0 and 1.0.

Class WARCFileReader

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

logger

fileSize

byteStream

dataStream

bytesRead

recordsRead

Constructor Details

WARCFileReader

Method Details

read

close

getRecordsRead

getBytesRead

getProgress