Class WARCFileReader

java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCFileReader

public class WARCFileReader extends Object
Reads WARCRecords from a WARC file, using Hadoop's filesystem APIs. (This means you can read from HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied to the MapReduce APIs -- that link is provided by the mapred com.martinkl.warc.mapred.WARCInputFormat and the mapreduce com.martinkl.warc.mapreduce.WARCInputFormat.
  • Field Details

  • Constructor Details

    • WARCFileReader

      public WARCFileReader(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path filePath) throws IOException
      Opens a file for reading. If the filename ends in `.gz`, it is automatically decompressed on the fly.
      Parameters:
      conf - The Hadoop configuration.
      filePath - The Hadoop path to the file that should be read.
      Throws:
      IOException
  • Method Details

    • read

      public WARCRecord read() throws IOException
      Reads the next record from the file.
      Returns:
      The record that was read.
      Throws:
      IOException
    • close

      public void close() throws IOException
      Closes the file. No more reading is possible after the file has been closed.
      Throws:
      IOException
    • getRecordsRead

      public long getRecordsRead()
      Returns the number of records that have been read since the file was opened.
    • getBytesRead

      public long getBytesRead()
      Returns the number of bytes that have been read from file since it was opened. If the file is compressed, this refers to the compressed file size.
    • getProgress

      public float getProgress()
      Returns the proportion of the file that has been read, as a number between 0.0 and 1.0.