HTTP Explained

HTTP Explained


August 7, 1999

Although an understanding of HTTP is not strictly necessary for the development of CGI applications, some appreciation of "what's under the hood" will certainly help you to develop them with more fluency and confidence. As with any field of endeavour, a grasp of the fundamental underlying principles allows you to visualise the structures and processes involved in the CGI transactions between clients and servers - giving you a more comprehensive mental model on which to base your programming.

Underlying the user interface represented by browsers, is the network and the protocols that travel the wires to the servers or "engines" that process requests, and return the various media. The protocol of the web is known as HTTP, for HyperText Transfer Protocol. HTTP is the underlying mechanism on which CGI operates, and it directly determines what you can and cannot send or receive via CGI.

Tim Berners-Lee implemented the HTTP protocol in 1990-1 at CERN, the European Center for High-Energy Physics in Geneva, Switzerland. HTTP stands at the very core of the World Wide Web. According to the HTTP 1.0 specification,

The Hypertext Transfer Protocol (HTTP) is an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol which can be used for many tasks, such as name servers and distributed object management systems, through extension of its request methods (commands). A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.

  • A comprehensive addressing scheme
    The HTTP protocol uses the concept of reference provided by the Universal Resource Identifier (URI) as a location (URL) or name (URN), for indicating the resource on which a method is to be applied. When an HTML hyperlink is composed, the URL (Uniform Resource Locator) is of the general form http://host:port-number/path/file.html. More generally, a URL reference is of the type service://host/file.file-extension and in this way, the HTTP protocol can subsume the more basic Internet services.

  • Client-Server Architecture
    The HTTP protocol is based on a request/response paradigm. The communication generally takes place over a TCP/IP connection on the Internet. The default port is 80, but other ports can be used. This does not preclude the HTTP/1.0 protocol from being implemented on top of any other protocol on the Internet, so long as reliability can be guaranteed.

  • The HTTP protocol is connectionless and stateless
    After the server has responded to the client's request, the connection between client and server is dropped and forgotten. There is no "memory" between client connections. The pure HTTP server implementation treats every request as if it was brand-new, i.e. without context.

  • An extensible and open representation for data types
    HTTP uses Internet Media Types (formerly referred to as MIME Content-Types) to provide open and extensible data typing and type negotiation. When the HTTP Server transmits information back to the client, it includes a MIME-like (Multipart Internet Mail Extension) header to inform the client what kind of data follows the header. Translation then depends on the client possessing the appropriate utility (image viewer, movie player, etc.) corresponding to that data type. An HTTP transaction consists of a header followed optionally by an empty line and some data. The header will specify such things as the action required of the s erver, or the type of data being returned, or a status code.

    The header lines received from the client, if any, are placed by th e server into the CGI environment variables with the prefix HTTP_ followed by the header name. Any - characters in the header name a re changed to _ characters. The server may exclude any headers which it has already processed, such as Authorization, Content-type, and Content-length.

    • HTTP_ACCEPT

      The MIME types which the client will accept, as given by HTTP headers . Other protocols may need to get this information from elsewhere. Each item in this list should be separated by commas as per the H TTP spec.

      Format: type/subtype, type/subtype

    • HTTP_USER_AGENT

      The browser the client is using to send th e request. General format: software/version library/version.

    The server sends back to the client:
    • A sta tus code that indicates whether the request was successful or not. Typical error codes indicate that the requested file was not foun d, that the request was malformed, or that authentication is required to access the file.

    • The data itself. Since HTTP is li beral about sending documents of any format, it is ideal for transmitting multimedia such as graphics, audio, and video files.

      It also sends back information about the object being returned.

    Content-Type

    Indicates the media type of the data sent to the recipient or, in the case of the HEAD method, the media type that would have be en sent had the request been a GET. Content-Type: text/html

    Date

    The date and time at which the message was originated. Date: Tue, 15 Nov 1994 08:12:31 GMT

    Expires

    The date after which the information in the document ceases to be valid. Caching clients, including proxies, must not cache this cop y of the resource beyond the date given, unless its status has been updated by a later check of the origin server. Expires: Th u, 01 Dec 1994 16:00:00 GMT

    From

    An Internet e-mail address for the human user who controls the requesting user agent. From: Stars@WDVL.com The request is being performed on behalf of the person given, who accepts r esponsibility for the method performed. Robot agents should include this header so that the person responsible for runn ing the robot can be contacted if problems occur on the receiving end.

    If-Modified-Since

    U sed with the GET method to make it conditional: if the requested resource has not been modified since the time specifie d in this field, a copy of the resource will not be returned from the server; instead, a 304 (not modified) response will be returne d without any data. If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT

    Last-Modified

    Indicates the date and time at which the sender believes the resource was last modified. Useful for clients that eliminate unneces sary transfers by using caching. Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT

    Location

    The Location response header field defines the exact location of the resource that was identified by the request URI. If the value is a full URL, the server returns a "redirect" to the client to retrieve the specified object directly.
     Location: http://WWW.S
    tars.com/Tutorial/HTTP/index.html 
    If you want to reference another file on your own server, you should output a partial URL, such as the following:
    Location: /Tutorial/HTTP/index.html

    Referer

    Allows the client to s pecify, for the server''s benefit, the address (URI) of the resource from which the request URI was obtained. This allows a server t o generate lists of back-links to resources for interest, logging, optimized caching, etc. It also allows obsolete or mistyped links to be traced for maintenance. Referer: http://WWW.Stars.com/index.html

    Server

    The Server response header field contains information about the software used by the origin server to handle the request. Server: CERN/3.0 libwww/2.17

    User-Agent

    Information about the user agent originating the r equest. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations - such as inability to support HTML tables. User-Agent: CERN -LineMode/2.15 libwww/2.17b3 HTTP/1.0 allows an open-ended set of methods to be used to indicate the purpose of a request. The three most often used methods are GET, HEAD, and POST.
    The GET Method
    Information from a form using the GET method is appended onto the end of the action URI being requested. Your CGI program will receive the encoded form input in the environment variable QUERY_STRING.

    The GET method is used to ask for a specific document - when you click on a hyperlink, GET is being used. GET should probably be used when a URL access will not change the state of a database (by, for example, adding or deleting information) and POST should be used when an access will cause a change. Many database searches have no visible side-effects and make ideal applications of query forms using GET.

    The semantics of the GET method changes to a "conditional GET" if the request message includes an If-Modified-Since header field. A conditional GET method requests that the identified resource be transferred only if it has been modified since the date given by the If-Modified-Since header.

    The HEAD method
    The HEAD method is used to ask only for information about a document, not for the document itself. HEAD is much faster than GET, as a much smaller amount of data is transferred. It's often used by clients who use caching, to see if the document has changed since it was last accessed. If it was not, then the local copy can be reused, otherwise the updated version must be retrieved with a GET.

    The POST Method
    This method transmits all form input information immediately after the requested URI. Your CGI program will receive the encoded form input on stdin.

    POST /cgi-bin/post-query HTTP/1.0 Accept: text/html Accept: video/mpeg Accept: image/gif Accept: application/postscript User-Agent: Lynx/2.2 libwww/2.14 From: Stars@WDVL.com Content-type: application/x-www-form-urlencoded Content-length: 150 * a blank line * org=CyberWeb%20SoftWare &users=10000 &browsers=lynx
    • This is a "POST" query addressed for the program residing in the file at "/cgi-bin/post-query", that simply echoes the values it receives.

    • The client lists the MIME-types it is capable of accepting, and identifies itself and the version of the WWW library it is using.

    • Finally, it indicates the MIME-type it has used to encode the data it is sending, the number of characters included, and the list of variables and their values it has collected from the user.

    • MIME-type application/x-www-form-urlencoded means that the variable name-value pairs will be encoded the same way a URL is encoded. Any special characters, including puctuation, will be encoded as %nn where nn is the ASCII value for the character in hex.
    Here is an example of an HTTP response from a server to a client request:
     HTTP/1.0 200 OK Date: Wednesday, 02-Feb-95 23:04:12 
    GMT Server: NCSA/1.3 MIME-version: 1.0
    Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html
    Content-length: 2345 * a blank line * <HTML><HEAD><TITLE> . . .
    • The server agrees to use HTTP version 1.0 for communication and sends the status 200 indicating it has successfully processed the client's request.

    • It then sends the date and identifies itself as an NCSA HTTP server.

    • It also indicates it is using MIME version 1.0 to describe the information it is sending, and includes the MIME-type of the information about to be sent in the "Content-type:" header.

    • Finally, it sends the number of characters it is going to send, followed by a blank line and the data itself.

    • Client and server headers are RFC 822 compliant mail headers. A Client may send any number of Accept: headers and the server is expected to convert the data into a form the client can accept.
    The essential simplicity of HTTP has been a major factor in its rapid adoption, but this very simplicity has become its main drawback; the next generation of HTTP, dubbed " HTTP-NG", will be a replacement for HTTP 1.x with much higher performance and adding some extra features needed for use in commercial applications. It's designed to make it easy to implement the basic functionality needed by all browsers, whilst making the addition of more powerful features such as security and authentication much simpler.

    The current HTTP 1.0 often causes performance problems on the server side, and on the network, since it sets up a new connection for every request. Simon Spero has published a progress report on what the W3C calls "HTTP Next Generation", or HTTP-NG. HTTP-NG "divides up the connection [between client and server] into lots of different channels ... each object is returned over its own channel." HTTP-NG allows many different requests to be sent over a single connection. These requests are asynchronous - there's no need for the client to wait for a response before sending out a new request. The server can also respond to requests in any order it sees fit - it can even interweave the data from multiple objects, allowing several images to be transferred in "parallel".

    To make these multiple data streams easy to work with, HTTP-NG sends all its messages and data using a "session layer". This divides the connection up into lots of different channels. HTTP-NG sends all control messages (GET requests, meta-information etc) over a control channel. Each object is returned over in its own channel. This also makes redirection much more powerful - for example, if the object is a video the server can return the meta-information over the same connection, together with a URL pointing to a dedicated video transfer protocol that will fetch the data for the relevant object. This becomes very important when working with multimedia aware networking technologies, such as ATM or RSVP. The HTTP-NG protocol will permit complex data types such as video to redirect the URL to a video transfer protocol and only then will the data be fetched for the client.

    Additional Resources: