public class WARCRecord extends Object
WARCRecord
by parsing
it out of a DataInput
stream. The file format is documented in the [ISO
Standard](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf). In a nutshell, it's
a textual format consisting of lines delimited by `\r\n`. Each record has the following
structure: 1. A line indicating the WARC version number, such as `WARC/1.0`. 2. Several header
lines (in key-value format, similar to HTTP or email headers), giving information about the
record. The header is terminated by an empty line. 3. A body consisting of raw bytes (the number
of bytes is indicated in one of the headers). 4. A final separator of `\r\n\r\n` before the next
record starts. There are various different types of records, as documented on
WARCRecord.Header.getRecordType()
.Modifier and Type | Class and Description |
---|---|
static class |
WARCRecord.Header
Contains the parsed headers of a
WARCRecord . |
Modifier and Type | Field and Description |
---|---|
private byte[] |
content |
private static Pattern |
CONTINUATION_PATTERN |
private static String |
CRLF |
private static byte[] |
CRLF_BYTES |
private WARCRecord.Header |
header |
private static Pattern |
VERSION_PATTERN |
static String |
WARC_VERSION |
Constructor and Description |
---|
WARCRecord(DataInput in)
Creates a new WARCRecord by parsing it out of a
DataInput stream. |
Modifier and Type | Method and Description |
---|---|
byte[] |
getContent()
Returns the body of the record, as an unparsed raw array of bytes.
|
WARCRecord.Header |
getHeader()
Returns the parsed header structure of the WARC record.
|
private static WARCRecord.Header |
readHeader(DataInput in) |
private static String |
readLine(DataInput in) |
private static void |
readSeparator(DataInput in) |
String |
toString()
Returns a human-readable string representation of the record.
|
void |
write(DataOutput out)
Writes this record to a
DataOutput stream. |
public static final String WARC_VERSION
private static final Pattern VERSION_PATTERN
private static final Pattern CONTINUATION_PATTERN
private static final String CRLF
private static final byte[] CRLF_BYTES
private final WARCRecord.Header header
private final byte[] content
public WARCRecord(DataInput in) throws IOException
DataInput
stream.in
- The input source from which one record will be read.IOException
private static WARCRecord.Header readHeader(DataInput in) throws IOException
IOException
private static String readLine(DataInput in) throws IOException
IOException
private static void readSeparator(DataInput in) throws IOException
IOException
public WARCRecord.Header getHeader()
public byte[] getContent()
WARCRecord.Header.getRecordType()
). For example, in the case of
a `response` type header, the body consists of the full HTTP response returned by the server
(HTTP headers followed by the body).public void write(DataOutput out) throws IOException
DataOutput
stream. The output may, in some edge cases, be not
byte-for-byte identical to what was parsed from a DataInput
. However it has the same
meaning and should not lose any information.out
- The output stream to which this record should be appended.IOException
Copyright © 2007–2020 The Apache Software Foundation. All rights reserved.