org.apache.hadoop.mapreduce.InputFormat<K,V>

org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>

org.apache.hadoop.hbase.test.util.warc.WARCInputFormat

public class WARCInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>

Hadoop InputFormat for mapreduce jobs ('new' API) that want to process data in WARC files. Usage: ```java Job job = new Job(getConf()); job.setInputFormatClass(WARCInputFormat.class); ``` Mappers should use a key of LongWritable (which is 1 for the first record in a file, 2 for the second record, etc.) and a value of WARCWritable.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

private static class

WARCInputFormat.WARCReader

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.Counter
Field Summary

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
Constructor Summary

Constructors

Constructor

Description

WARCInputFormat()
Method Summary

Modifier and Type

Method

Description

org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,WARCWritable>

createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)

Opens a WARC file (possibly compressed) for reading, and returns a RecordReader for accessing it.

protected boolean

isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)

Always returns false, as WARC files cannot be split.

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize, shrinkStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- WARCInputFormat
  
  public WARCInputFormat()
Method Details
- createRecordReader
  
  public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,WARCWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException
  
  Opens a WARC file (possibly compressed) for reading, and returns a RecordReader for accessing it.
  
  Specified by:
  
  createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>
  
  Throws:
  
  IOException
  
  InterruptedException
- isSplitable
  
  protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)
  
  Always returns false, as WARC files cannot be split.
  
  Overrides:
  
  isSplitable in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,WARCWritable>

Class WARCInputFormat

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Field Summary

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Methods inherited from class java.lang.Object

Constructor Details

WARCInputFormat

Method Details

createRecordReader

isSplitable