Class WARCRecord
java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCRecord
Immutable implementation of a record in a WARC file. You create a
WARCRecord by parsing
it out of a DataInput stream.
The file format is documented in the
ISO Standard. In
a nutshell, it's a textual format consisting of lines delimited by `\r\n`. Each record has the
following structure:
- A line indicating the WARC version number, such as `WARC/1.0`.
- Several header lines (in key-value format, similar to HTTP or email headers), giving information about the record. The header is terminated by an empty line.
- A body consisting of raw bytes (the number of bytes is indicated in one of the headers).
- A final separator of `\r\n\r\n` before the next record starts.
WARCRecord.Header.getRecordType().-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final byte[]private static final Patternprivate static final Stringprivate static final byte[]private final WARCRecord.Headerprivate static final Patternstatic final String -
Constructor Summary
ConstructorsConstructorDescriptionWARCRecord(DataInput in) Creates a new WARCRecord by parsing it out of aDataInputstream. -
Method Summary
Modifier and TypeMethodDescriptionbyte[]Returns the body of the record, as an unparsed raw array of bytes.Returns the parsed header structure of the WARC record.private static WARCRecord.HeaderreadHeader(DataInput in) private static Stringprivate static voidtoString()Returns a human-readable string representation of the record.voidwrite(DataOutput out) Writes this record to aDataOutputstream.
-
Field Details
-
WARC_VERSION
- See Also:
-
VERSION_PATTERN
-
CONTINUATION_PATTERN
-
CRLF
- See Also:
-
CRLF_BYTES
-
header
-
-
-
Constructor Details
-
WARCRecord
Creates a new WARCRecord by parsing it out of aDataInputstream.- Parameters:
in- The input source from which one record will be read.- Throws:
IOException
-
-
Method Details
-
readHeader
- Throws:
IOException
-
readLine
- Throws:
IOException
-
readSeparator
- Throws:
IOException
-
getHeader
Returns the parsed header structure of the WARC record. -
getContent
Returns the body of the record, as an unparsed raw array of bytes. The content of the body depends on the type of record (seeWARCRecord.Header.getRecordType()). For example, in the case of a `response` type header, the body consists of the full HTTP response returned by the server (HTTP headers followed by the body). -
write
Writes this record to aDataOutputstream. The output may, in some edge cases, be not byte-for-byte identical to what was parsed from aDataInput. However it has the same meaning and should not lose any information.- Parameters:
out- The output stream to which this record should be appended.- Throws:
IOException
-
toString
Returns a human-readable string representation of the record.
-