Class WARCFileWriter
java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCFileWriter
Writes
WARCRecords to a WARC file, using Hadoop's filesystem APIs. (This means you can
write to HDFS, S3 or any other filesystem supported by Hadoop). This implementation is not tied
to the MapReduce APIs -- that link is provided by the mapred
com.martinkl.warc.mapred.WARCOutputFormat and the mapreduce
com.martinkl.warc.mapreduce.WARCOutputFormat. WARCFileWriter keeps track of how much data
it has written (optionally gzip-compressed); when the file becomes larger than some threshold, it
is automatically closed and a new segment is started. A segment number is appended to the
filename for that purpose. The segment number always starts at 00000, and by default a new
segment is started when the file size exceeds 1GB. To change the target size for a segment, you
can set the `warc.output.segment.size` key in the Hadoop configuration to the number of bytes.
(Files may actually be a bit larger than this threshold, since we finish writing the current
record before opening a new file.)-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate longprivate final org.apache.hadoop.io.compress.CompressionCodecprivate final org.apache.hadoop.conf.Configurationprivate DataOutputStreamstatic final longprivate final Stringprivate static final org.slf4j.Loggerprivate final longprivate final org.apache.hadoop.util.Progressableprivate longprivate longprivate final org.apache.hadoop.fs.Path -
Constructor Summary
ConstructorsConstructorDescriptionWARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) Creates a WARC file, and opens it for writing.WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) Creates a WARC file, and opens it for writing. -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Flushes any buffered data and closes the file.private voidCreates an output segment file and sets up the output streams to point at it.static org.apache.hadoop.io.compress.CompressionCodecgetGzipCodec(org.apache.hadoop.conf.Configuration conf) Instantiates a Hadoop codec for compressing and decompressing Gzip files.voidwrite(WARCRecord record) Appends aWARCRecordto the file, in WARC/1.0 format.voidwrite(WARCWritable record) Appends aWARCRecordwrapped in aWARCWritableto the file.
-
Field Details
-
logger
-
DEFAULT_MAX_SEGMENT_SIZE
- See Also:
-
conf
-
codec
-
workOutputPath
-
progress
-
extensionFormat
-
maxSegmentSize
-
segmentsCreated
-
segmentsAttempted
-
bytesWritten
-
byteStream
-
dataStream
-
-
Constructor Details
-
WARCFileWriter
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath) throws IOException Creates a WARC file, and opens it for writing. If a file with the same name already exists, an attempt number in the filename is incremented until we find a file that doesn't already exist.- Parameters:
conf- The Hadoop configuration.codec- If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.workOutputPath- The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.- Throws:
IOException
-
WARCFileWriter
public WARCFileWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.io.compress.CompressionCodec codec, org.apache.hadoop.fs.Path workOutputPath, org.apache.hadoop.util.Progressable progress) throws IOException Creates a WARC file, and opens it for writing. If a file with the same name already exists, it is *overwritten*. Note that this is different behaviour from the other constructor. Yes, this sucks. It will probably change in a future version.- Parameters:
conf- The Hadoop configuration.codec- If null, the file is uncompressed. If non-null, this compression codec will be used. The codec's default file extension is appended to the filename.workOutputPath- The directory and filename prefix to which the data should be written. We append a segment number and filename extensions to it.progress- An object used by the mapred API for tracking a task's progress.- Throws:
IOException
-
-
Method Details
-
getGzipCodec
public static org.apache.hadoop.io.compress.CompressionCodec getGzipCodec(org.apache.hadoop.conf.Configuration conf) Instantiates a Hadoop codec for compressing and decompressing Gzip files. This is the most common compression applied to WARC files.- Parameters:
conf- The Hadoop configuration.
-
createSegment
Creates an output segment file and sets up the output streams to point at it. If the file already exists, retries with a different filename. This is a bit nasty -- after all,FileOutputFormat's work directory concept is supposed to prevent filename clashes -- but it looks like Amazon Elastic MapReduce prevents use of per-task work directories if the output of a job is on S3. TODO: Investigate this and find a better solution.- Throws:
IOException
-
write
Appends aWARCRecordto the file, in WARC/1.0 format.- Parameters:
record- The record to be written.- Throws:
IOException
-
write
Appends aWARCRecordwrapped in aWARCWritableto the file.- Parameters:
record- The wrapper around the record to be written.- Throws:
IOException
-
close
Flushes any buffered data and closes the file.- Throws:
IOException
-