public class WARCFileWriter extends Object
WARCRecord
s to a WARC file, using Hadoop's filesystem APIs. (This means you can
write to HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied
to the MapReduce APIs -- that link is provided by the mapred
com.martinkl.warc.mapred.WARCOutputFormat
and the mapreduce
com.martinkl.warc.mapreduce.WARCOutputFormat
. WARCFileWriter keeps track of how much data
it has written (optionally gzip-compressed); when the file becomes larger than some threshold, it
is automatically closed and a new segment is started. A segment number is appended to the
filename for that purpose. The segment number always starts at 00000, and by default a new
segment is started when the file size exceeds 1GB. To change the target size for a segment, you
can set the `warc.output.segment.size` key in the Hadoop configuration to the number of bytes.
(Files may actually be a bit larger than this threshold, since we finish writing the current
record before opening a new file.)Modifier and Type | Class and Description |
---|---|
private class |
WARCFileWriter.CountingOutputStream |
Modifier and Type | Field and Description |
---|---|
private WARCFileWriter.CountingOutputStream |
byteStream |
private long |
bytesWritten |
private org.apache.hadoop.io.compress.CompressionCodec |
codec |
private org.apache.hadoop.conf.Configuration |
conf |
private DataOutputStream |
dataStream |
static long |
DEFAULT_MAX_SEGMENT_SIZE |
private String |
extensionFormat |
private static org.slf4j.Logger |
logger |
private long |
maxSegmentSize |
private org.apache.hadoop.util.Progressable |
progress |
private long |
segmentsAttempted |
private long |
segmentsCreated |
private org.apache.hadoop.fs.Path |
workOutputPath |
Constructor and Description |
---|
WARCFileWriter(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.io.compress.CompressionCodec codec,
org.apache.hadoop.fs.Path workOutputPath)
Creates a WARC file, and opens it for writing.
|
WARCFileWriter(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.io.compress.CompressionCodec codec,
org.apache.hadoop.fs.Path workOutputPath,
org.apache.hadoop.util.Progressable progress)
Creates a WARC file, and opens it for writing.
|
Modifier and Type | Method and Description |
---|---|
void |
close()
Flushes any buffered data and closes the file.
|
private void |
createSegment()
Creates an output segment file and sets up the output streams to point at it.
|
static org.apache.hadoop.io.compress.CompressionCodec |
getGzipCodec(org.apache.hadoop.conf.Configuration conf)
Instantiates a Hadoop codec for compressing and decompressing Gzip files.
|
void |
write(WARCRecord record)
Appends a
WARCRecord to the file, in WARC/1.0 format. |
void |
write(WARCWritable record)
Appends a
WARCRecord wrapped in a WARCWritable to the file. |
private static final org.slf4j.Logger logger
public static final long DEFAULT_MAX_SEGMENT_SIZE
private final org.apache.hadoop.conf.Configuration conf
private final org.apache.hadoop.io.compress.CompressionCodec codec
private final org.apache.hadoop.fs.Path workOutputPath
private final org.apache.hadoop.util.Progressable progress
private final String extensionFormat
private final long maxSegmentSize
private long segmentsCreated
private long segmentsAttempted
private long bytesWritten
private WARCFileWriter.CountingOutputStream byteStream
private DataOutputStream dataStream
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) throws IOException
conf
- The Hadoop configuration.codec
- If null, the file is uncompressed. If non-null, this compression codec
will be used. The codec's default file extension is appended to the
filename.workOutputPath
- The directory and filename prefix to which the data should be written. We
append a segment number and filename extensions to it.IOException
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) throws IOException
conf
- The Hadoop configuration.codec
- If null, the file is uncompressed. If non-null, this compression codec
will be used. The codec's default file extension is appended to the
filename.workOutputPath
- The directory and filename prefix to which the data should be written. We
append a segment number and filename extensions to it.progress
- An object used by the mapred API for tracking a task's progress.IOException
public static org.apache.hadoop.io.compress.CompressionCodec getGzipCodec(org.apache.hadoop.conf.Configuration conf)
conf
- The Hadoop configuration.private void createSegment() throws IOException
FileOutputFormat
's work directory concept is supposed to prevent filename clashes --
but it looks like Amazon Elastic MapReduce prevents use of per-task work directories if the
output of a job is on S3. TODO: Investigate this and find a better solution.IOException
public void write(WARCRecord record) throws IOException
WARCRecord
to the file, in WARC/1.0 format.record
- The record to be written.IOException
public void write(WARCWritable record) throws IOException
WARCRecord
wrapped in a WARCWritable
to the file.record
- The wrapper around the record to be written.IOException
public void close() throws IOException
IOException
Copyright © 2007–2020 The Apache Software Foundation. All rights reserved.