public class WARCRecord extends Object
WARCRecord by parsing
 it out of a DataInput stream. The file format is documented in the [ISO
 Standard](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf). In a nutshell, it's
 a textual format consisting of lines delimited by `\r\n`. Each record has the following
 structure: 1. A line indicating the WARC version number, such as `WARC/1.0`. 2. Several header
 lines (in key-value format, similar to HTTP or email headers), giving information about the
 record. The header is terminated by an empty line. 3. A body consisting of raw bytes (the number
 of bytes is indicated in one of the headers). 4. A final separator of `\r\n\r\n` before the next
 record starts. There are various different types of records, as documented on
 WARCRecord.Header.getRecordType().| Modifier and Type | Class and Description | 
|---|---|
static class  | 
WARCRecord.Header
Contains the parsed headers of a  
WARCRecord. | 
| Modifier and Type | Field and Description | 
|---|---|
private byte[] | 
content  | 
private static Pattern | 
CONTINUATION_PATTERN  | 
private static String | 
CRLF  | 
private static byte[] | 
CRLF_BYTES  | 
private WARCRecord.Header | 
header  | 
private static Pattern | 
VERSION_PATTERN  | 
static String | 
WARC_VERSION  | 
| Constructor and Description | 
|---|
WARCRecord(DataInput in)
Creates a new WARCRecord by parsing it out of a  
DataInput stream. | 
| Modifier and Type | Method and Description | 
|---|---|
byte[] | 
getContent()
Returns the body of the record, as an unparsed raw array of bytes. 
 | 
WARCRecord.Header | 
getHeader()
Returns the parsed header structure of the WARC record. 
 | 
private static WARCRecord.Header | 
readHeader(DataInput in)  | 
private static String | 
readLine(DataInput in)  | 
private static void | 
readSeparator(DataInput in)  | 
String | 
toString()
Returns a human-readable string representation of the record. 
 | 
void | 
write(DataOutput out)
Writes this record to a  
DataOutput stream. | 
public static final String WARC_VERSION
private static final Pattern VERSION_PATTERN
private static final Pattern CONTINUATION_PATTERN
private static final String CRLF
private static final byte[] CRLF_BYTES
private final WARCRecord.Header header
private final byte[] content
public WARCRecord(DataInput in) throws IOException
DataInput stream.in - The input source from which one record will be read.IOExceptionprivate static WARCRecord.Header readHeader(DataInput in) throws IOException
IOExceptionprivate static String readLine(DataInput in) throws IOException
IOExceptionprivate static void readSeparator(DataInput in) throws IOException
IOExceptionpublic WARCRecord.Header getHeader()
public byte[] getContent()
WARCRecord.Header.getRecordType()). For example, in the case of
 a `response` type header, the body consists of the full HTTP response returned by the server
 (HTTP headers followed by the body).public void write(DataOutput out) throws IOException
DataOutput stream. The output may, in some edge cases, be not
 byte-for-byte identical to what was parsed from a DataInput. However it has the same
 meaning and should not lose any information.out - The output stream to which this record should be appended.IOExceptionCopyright © 2007–2020 The Apache Software Foundation. All rights reserved.