Class WARCRecord.Header

java.lang.Object
org.apache.hadoop.hbase.test.util.warc.WARCRecord.Header
Enclosing class:
WARCRecord

public static final class WARCRecord.Header extends Object
Contains the parsed headers of a WARCRecord. Each record contains a number of headers in key-value format, where some header keys are standardised, but nonstandard ones can be added.

The documentation of the methods in this class is excerpted from the WARC 1.0 specification. Please see the specification for more detail.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private final Map<String,String>
     
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
     
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    The number of bytes in the body of the record, similar to RFC2616.
    The MIME type (RFC2045) of the information contained in the record's block.
    A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601.
    Returns the value of a selected header field, or null if there is no header with that field name.
    An identifier assigned to the current record that is globally unique for its period of intended use.
    Returns the type of WARC record (the value of the `WARC-Type` header field).
    The original URI whose capture gave rise to the information content in this record.
    Formats this header in WARC/1.0 format, consisting of a version line followed by colon-delimited key-value pairs, and `\r\n` line endings.
    void
    Appends this header to a DataOutput stream, in WARC/1.0 format.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

  • Constructor Details

  • Method Details

    • getRecordType

      Returns the type of WARC record (the value of the `WARC-Type` header field). WARC 1.0 defines the following record types: (for full definitions, see the spec.
      • `warcinfo`: Describes the records that follow it, up through end of file, end of input, or until next `warcinfo` record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains information about the web crawl which generated the following records.

        The format of this descriptive record block may vary, though the use of the `"application/warc-fields"` content-type is recommended. (...)

      • `response`: The record should contain a complete scheme-specific response, including network protocol information where possible. For a target-URI of the `http` or `https` schemes, a `response` record block should contain the full HTTP response received over the network, including headers. That is, it contains the 'Response' message defined by section 6 of HTTP/1.1 (RFC2616).

        The WARC record's Content-Type field should contain the value defined by HTTP/1.1, `"application/http;msgtype=response"`. The payload of the record is defined as its 'entity-body' (per RFC2616), with any transfer-encoding removed.

      • `resource`: The record contains a resource, without full protocol response information. For example: a file directly retrieved from a locally accessible repository or the result of a networked retrieval where the protocol information has been discarded. For a target-URI of the `http` or `https` schemes, a `resource` record block shall contain the returned 'entity-body' (per RFC2616, with any transfer-encodings removed), possibly truncated.
      • `request`: The record holds the details of a complete scheme-specific request, including network protocol information where possible. For a target-URI of the `http` or `https` schemes, a `request` record block should contain the full HTTP request sent over the network, including headers. That is, it contains the 'Request' message defined by section 5 of HTTP/1.1 (RFC2616).

        The WARC record's Content-Type field should contain the value defined by HTTP/1.1, `"application/http;msgtype=request"`. The payload of a `request` record with a target-URI of scheme `http` or `https` is defined as its 'entity-body' (per RFC2616), with any transfer-encoding removed.

      • `metadata`: The record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. A `metadata` record will almost always refer to another record of another type, with that other record holding original harvested or transformed content.

        The format of the metadata record block may vary. The `"application/warc-fields"` format may be used.

      • `revisit`: The record describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record. Most typically, a `revisit` record is used instead of a `response` or `resource` record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.

        A `revisit` record shall contain a WARC-Profile field which determines the interpretation of the record's fields and record block. Please see the specification for details.

      • `conversion`: The record shall contain an alternative version of another record's content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information.
      • `continuation`: Record blocks from `continuation` records must be appended to corresponding prior record blocks (eg. from other WARC files) to create the logically complete full-sized original record. That is, `continuation` records are used when a record that would otherwise cause a WARC file size to exceed a desired limit is broken into segments. A continuation record shall contain the named fields `WARC-Segment-Origin-ID` and `WARC-Segment-Number`, and the last `continuation` record of a series shall contain a `WARC-Segment-Total-Length` field. Please see the specification for details.
      • Other record types may be added in future, so this list is not exclusive.
      Returns:
      The record's `WARC-Type` header field, as a string.
    • getDateString

      A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event shall use the same WARC-Date, even though the times of their writing will not be exactly synchronized.
      Returns:
      The record's `WARC-Date` header field, as a string.
    • getRecordID

      public String getRecordID()
      An identifier assigned to the current record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id shall be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as `http:` or `urn:`).
      Returns:
      The record's `WARC-Record-ID` header field, as a string.
    • getContentType

      The MIME type (RFC2045) of the information contained in the record's block. For example, in HTTP request and response records, this would be `application/http` as per section 19.1 of RFC2616 (or `application/http; msgtype=request` and `application/http; msgtype=response` respectively).

      In particular, the content-type is *not* the value of the HTTP Content-Type header in an HTTP response, but a MIME type to describe the full archived HTTP message (hence `application/http` if the block contains request or response headers).

      Returns:
      The record's `Content-Type` header field, as a string.
    • getTargetURI

      public String getTargetURI()
      The original URI whose capture gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. For a `revisit` record, it is the URI that was the target of a retrieval request. Indirectly, such as for a `metadata`, or `conversion` record, it is a copy of the `WARC-Target-URI` appearing in the original record to which the newer record pertains. The URI in this value shall be properly escaped according to RFC3986, and written with no internal whitespace.
      Returns:
      The record's `WARC-Target-URI` header field, as a string.
    • getContentLength

      public int getContentLength()
      The number of bytes in the body of the record, similar to RFC2616.
      Returns:
      The record's `Content-Length` header field, parsed into an int.
    • getField

      public String getField(String field)
      Returns the value of a selected header field, or null if there is no header with that field name.
      Parameters:
      field - The name of the header to return (case-sensitive).
      Returns:
      The value associated with that field name, or null if not present.
    • write

      public void write(DataOutput out) throws IOException
      Appends this header to a DataOutput stream, in WARC/1.0 format.
      Parameters:
      out - The data output to which the header should be written.
      Throws:
      IOException
    • toString

      public String toString()
      Formats this header in WARC/1.0 format, consisting of a version line followed by colon-delimited key-value pairs, and `\r\n` line endings.
      Overrides:
      toString in class Object