Class WARCInputFormat

java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>
org.apache.hadoop.hbase.test.util.warc.WARCInputFormat

public class WARCInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>
Hadoop InputFormat for mapreduce jobs ('new' API) that want to process data in WARC files. Usage: ```java Job job = new Job(getConf()); job.setInputFormatClass(WARCInputFormat.class); ``` Mappers should use a key of LongWritable (which is 1 for the first record in a file, 2 for the second record, etc.) and a value of WARCWritable.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    private static class 
     

    Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

    org.apache.hadoop.mapreduce.lib.input.FileInputFormat.Counter
  • Field Summary

    Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

    DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,WARCWritable>
    createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
    Opens a WARC file (possibly compressed) for reading, and returns a RecordReader for accessing it.
    protected boolean
    isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)
    Always returns false, as WARC files cannot be split.

    Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

    addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

  • Method Details

    • createRecordReader

      public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,WARCWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException
      Opens a WARC file (possibly compressed) for reading, and returns a RecordReader for accessing it.
      Specified by:
      createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>
      Throws:
      IOException
      InterruptedException
    • isSplitable

      protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)
      Always returns false, as WARC files cannot be split.
      Overrides:
      isSplitable in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>