WARCFileWriter (Apache HBase 2.5.0 Test API)

java.lang.Object
- org.apache.hadoop.hbase.test.util.warc.WARCFileWriter

```
public class WARCFileWriter
extends Object
```
Writes WARCRecords to a WARC file, using Hadoop's filesystem APIs. (This means you can write to HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied to the MapReduce APIs -- that link is provided by the mapred com.martinkl.warc.mapred.WARCOutputFormat and the mapreduce com.martinkl.warc.mapreduce.WARCOutputFormat. WARCFileWriter keeps track of how much data it has written (optionally gzip-compressed); when the file becomes larger than some threshold, it is automatically closed and a new segment is started. A segment number is appended to the filename for that purpose. The segment number always starts at 00000, and by default a new segment is started when the file size exceeds 1GB. To change the target size for a segment, you can set the `warc.output.segment.size` key in the Hadoop configuration to the number of bytes. (Files may actually be a bit larger than this threshold, since we finish writing the current record before opening a new file.)

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

private class WARCFileWriter.CountingOutputStream

Nested Classes
Modifier and Type	Class and Description
`private class`	`WARCFileWriter.CountingOutputStream`

Field Summary

Fields
Modifier and Type	Field and Description
`private WARCFileWriter.CountingOutputStream`	`byteStream`
`private long`	`bytesWritten`
`private org.apache.hadoop.io.compress.CompressionCodec`	`codec`
`private org.apache.hadoop.conf.Configuration`	`conf`
`private DataOutputStream`	`dataStream`
`static long`	`DEFAULT_MAX_SEGMENT_SIZE`
`private String`	`extensionFormat`
`private static org.slf4j.Logger`	`logger`
`private long`	`maxSegmentSize`
`private org.apache.hadoop.util.Progressable`	`progress`
`private long`	`segmentsAttempted`
`private long`	`segmentsCreated`
`private org.apache.hadoop.fs.Path`	`workOutputPath`

Constructor Summary

Constructors
Constructor and Description
`WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath)` Creates a WARC file, and opens it for writing.
`WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress)` Creates a WARC file, and opens it for writing.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`close()` Flushes any buffered data and closes the file.
`private void`	`createSegment()` Creates an output segment file and sets up the output streams to point at it.
`static org.apache.hadoop.io.compress.CompressionCodec`	`getGzipCodec(org.apache.hadoop.conf.Configuration conf)` Instantiates a Hadoop codec for compressing and decompressing Gzip files.
`void`	`write(WARCRecord record)` Appends a `WARCRecord` to the file, in WARC/1.0 format.
`void`	`write(WARCWritable record)` Appends a `WARCRecord` wrapped in a `WARCWritable` to the file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - logger
```
private static final org.slf4j.Logger logger
```
  - DEFAULT_MAX_SEGMENT_SIZE
```
public static final long DEFAULT_MAX_SEGMENT_SIZE
```
    See Also:
    
    Constant Field Values
  - conf
```
private final org.apache.hadoop.conf.Configuration conf
```
  - codec
```
private final org.apache.hadoop.io.compress.CompressionCodec codec
```
  - workOutputPath
```
private final org.apache.hadoop.fs.Path workOutputPath
```
  - progress
```
private final org.apache.hadoop.util.Progressable progress
```
  - extensionFormat
```
private final String extensionFormat
```
  - maxSegmentSize
```
private final long maxSegmentSize
```
  - segmentsCreated
```
private long segmentsCreated
```
  - segmentsAttempted
```
private long segmentsAttempted
```
  - bytesWritten
```
private long bytesWritten
```
  - byteStream
```
private WARCFileWriter.CountingOutputStream byteStream
```
  - dataStream
```
private DataOutputStream dataStream
```
- Constructor Detail
  - WARCFileWriter
```
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf,
                      org.apache.hadoop.io.compress.CompressionCodec codec,
                      org.apache.hadoop.fs.Path workOutputPath)
               throws IOException
```
    Creates a WARC file, and opens it for writing. If a file with the same name already exists, an attempt number in the filename is incremented until we find a file that doesn't already exist.
    
    Parameters:
    
    conf - The Hadoop configuration.
    
    codec - If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.
    
    workOutputPath - The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.
    
    Throws:
    
    IOException
  - WARCFileWriter
```
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf,
                      org.apache.hadoop.io.compress.CompressionCodec codec,
                      org.apache.hadoop.fs.Path workOutputPath,
                      org.apache.hadoop.util.Progressable progress)
               throws IOException
```
    Creates a WARC file, and opens it for writing. If a file with the same name already exists, it is *overwritten*. Note that this is different behaviour from the other constructor. Yes, this sucks. It will probably change in a future version.
    
    Parameters:
    
    conf - The Hadoop configuration.
    
    codec - If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.
    
    workOutputPath - The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.
    
    progress - An object used by the mapred API for tracking a task's progress.
    
    Throws:
    
    IOException
- Method Detail
  - getGzipCodec
```
public static org.apache.hadoop.io.compress.CompressionCodec getGzipCodec(org.apache.hadoop.conf.Configuration conf)
```
    Instantiates a Hadoop codec for compressing and decompressing Gzip files. This is the most common compression applied to WARC files.
    
    Parameters:
    
    conf - The Hadoop configuration.
  - createSegment
```
private void createSegment()
                    throws IOException
```
    Creates an output segment file and sets up the output streams to point at it. If the file already exists, retries with a different filename. This is a bit nasty -- after all, FileOutputFormat's work directory concept is supposed to prevent filename clashes -- but it looks like Amazon Elastic MapReduce prevents use of per-task work directories if the output of a job is on S3. TODO: Investigate this and find a better solution.
    
    Throws:
    
    IOException
  - write
```
public void write(WARCRecord record)
           throws IOException
```
    Appends a WARCRecord to the file, in WARC/1.0 format.
    
    Parameters:
    
    record - The record to be written.
    
    Throws:
    
    IOException
  - write
```
public void write(WARCWritable record)
           throws IOException
```
    Appends a WARCRecord wrapped in a WARCWritable to the file.
    
    Parameters:
    
    record - The wrapper around the record to be written.
    
    Throws:
    
    IOException
  - close
```
public void close()
           throws IOException
```
    Flushes any buffered data and closes the file.
    
    Throws:
    
    IOException

Class WARCFileWriter

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

logger

DEFAULT_MAX_SEGMENT_SIZE

conf

codec

workOutputPath

progress

extensionFormat

maxSegmentSize

segmentsCreated

segmentsAttempted

bytesWritten

byteStream

dataStream

Constructor Detail

WARCFileWriter

WARCFileWriter

Method Detail

getGzipCodec

createSegment

write

write

close