The Hypertext Transport Protocol (HTTP) is the common language that web browsers and web servers use to communicate with each other on the Internet. CGI is built on top of HTTP, so to understand CGI fully, it certainly helps to understand HTTP. One of the reasons CGI is so powerful is because it allows you to manipulate the metadata exchanged between the web browser and server and thus perform many useful tricks, including:
Serve content of varying type, language, or other encoding according to the client’s needs.
Check the user’s previous location.
Check the browser type and version and adapt your response to it.
Specify how long the client can cache a page before it is considered outdated and should be reloaded.
We won’t cover all of the details of HTTP, just what is important for our understanding of CGI. Specifically, we’ll focus on the request and response process: how browsers ask for and receive web pages.
If you are interested in understanding more about HTTP than we provide here, visit the World Wide Web Consortium’s web site at http://www.w3.org/Protocols/. On the other hand, if you are eager to get started writing CGI scripts, you may be tempted to skip this chapter. We encourage you not to. Although you can certainly learn to write CGI scripts without learning HTTP, without the bigger picture you may end up memorizing what to do instead of understanding why. This is certainly the most challenging chapter, however, because we cover a lot of material without many examples. So if you find it a little dry and want to peek ahead to the fun stuff, we’ll forgive you. Just be sure to return here later.
During our discussion of HTTP and CGI, we will be often be referring to URLs , or Uniform Resource Locators. If you have used the Web at all, then you are probably familiar with URLs. In web terms, a resource represents anything available on the web, whether it be an HTML page, an image, a CGI script, etc. URLs provide a standard way to locate these resources on the Web.
Note that URLs are not actually specific to HTTP; they can refer to resources in many protocols. Our discussion here will focus strictly on HTTP URLs.
HTTP URLs consist of a scheme, a host name, a port number, a path, a query string, and a fragment identifier, any of which may be omitted under certain circumstances (see Figure 2.1).
HTTP URLs contain the following elements:
The
scheme represents the protocol, and
for our purposes will either be http
or
https
.
https
represents a connection to a secure web server. Refer to The Secure Sockets Layer later in this chapter.
The host identifies the machine running a web server. It can be a domain name or an IP address, although it is a bad idea to use IP addresses in URLs and is strongly discouraged. The problem is that IP addresses often change for any number of reasons: a web site may move from one machine to another, or it may relocate to another network. Domain names can remain constant in these cases, allowing these changes to remain hidden from the user.
The
port number is optional and may appear
in URLs only if the host is also included. The host and port are
separated by a colon. If the port is not specified, port 80 is used
for http
URLs and port 443 is used for
https
URLs.
It is possible to configure a web server to answer other ports. This is often done if two different web servers need to operate on the same machine, or if a web server is operated by someone who does not have sufficient rights on the machine to start a server on these ports (e.g., only root may bind to ports below 1024 on Unix machines). However, servers using ports other than the standard 80 and 443 may be inaccessible to users behind firewalls. Some firewalls are configured to restrict access to all but a narrow set of ports representing the defaults for certain allowed protocols.
Path information represents the location of the resource being requested, such as an HTML file or a CGI script. Depending on how your web server is configured, it may or may not map to some actual file path on your system. As we mentioned last chapter, the URL path for CGI scripts generally begin with /cgi/ or /cgi-bin/ and these paths are mapped to a similarly-named directory in the web server, such as /usr/local/apache/cgi-bin.
Note that the URL for a script may include path information beyond the location of the script itself. For example, say you have a CGI at:
http://localhost/cgi/browse_docs.cgi
You can pass extra path information to the script by appending it to the end, for example:
http://localhost/cgi/browse_docs.cgi/docs/product/description.text
Here the path /docs/product/description.text is passed to the script. We explain how to access and use this additional path information in more detail in the next chapter.
A
query string passes additional parameters
to scripts. It is sometimes referred to as a
search string or an index.
It may contain name
and value pairs, in which each pair is separated from the next pair
by an ampersand
(&
), and the name and value are separated from
each other by an equals sign (=
).
We discuss how to parse and use this information in your scripts in
the next chapter.
Query strings can also include data that is not formatted as
name-value pairs. If a query string does not contain an equals sign,
it is often referred to as an index. Each argument should be
separated from the next by an
encoded space (encoded
either as
+
or
%20
; see Section 2.1.3
below). CGI scripts handle indexes a little differently, as we will
see in the next chapter.
Fragment identifiers refer to a specific section in a resource. Fragment identifiers are not sent to web servers, so you cannot access this component of the URLs in your CGI scripts. Instead, the browser fetches a resource and then applies the fragment identifier to locate the appropriate section in the resource. For HTML documents, fragment identifiers refer to anchor tags within the document:
<a name="anchor" >Here is the content you're after...</a>
The following URL would request the full document and then scroll to the section marked by the anchor tag:
http://localhost/document.html#anchor
Web browsers generally jump to the bottom of the document if no anchor for the fragment identifier is found.
Many of the elements within a URL are optional. You may omit the scheme, host, and port number in a URL if the URL is used in a context where these elements can be assumed. For example, if you include a URL in a link on an HTML page and leave out these elements, the browser will assume the link applies to a resource on the same machine as the link. There are two classes of URLs:
URLs that include the hostname are called absolute URLs. An example of an absolute URL is http://localhost/cgi/script.cgi.
URLs without a scheme, host, or port are called relative URLs. These can be further broken down into full and relative paths:
Relative URLs with an absolute path are sometimes referred to as full paths (even though they can also include a query string and fragment identifier). Full paths can be distinguished from URLs with relative paths because they always start with a forward slash. Note that in all these cases, the paths are virtual paths, and do not necessarily correspond to a path on the web server’s filesystem. An example of an absolute path is /index.html.
Relative URLs that begin with a character other than a forward slash are relative paths. Examples of relative paths include script.cgi and ../images/photo.jpg.
Many characters must be
encoded within a URL for a
variety of reasons. For example, certain characters such as
?
, #
, and /
have special meaning within URLs and will be misinterpreted unless
encoded. It is possible to name a file
doc#2.html on some systems, but the URL
http://localhost/doc#2.html would not point to
this document. It points to the fragment 2.html
in a (possibly nonexistent) file named doc. We
must encode the #
character so the web browser and
server recognize that it is part of the resource name instead.
Characters are encoded by representing them with a
percent sign (%
)
followed by the two-digit hexadecimal value for that character based
upon the ISO Latin 1 character set or ASCII character set (these
character sets are the same for the first eight bits). For example,
the #
symbol has a hexadecimal value of
0x23
, so it is encoded as %23
.
The following characters must be encoded:
Control characters: ASCII 0x00
through
0x1F
plus 0x7F
Eight-bit characters: ASCII 0x80
through
0xFF
Characters given special importance within URLs: ; / ? : @ & = + $ ,
Characters often used to delimit (quote) URLs: < > # % "
Characters considered unsafe because they may have special meaning
for other protocols used to transmit URLs (e.g., SMTP): { } | ^ [ ] `
Additionally,
spaces should be encoded as
+
although %20
is also allowed.
As you can see, most characters must be encoded; the list of
allowed characters is actually much
shorter:
Letters: a-z
and A-Z
Digits: 0-9
The following characters: - _ . ! ~ * ' ( )
It is actually permissible and not uncommon for any of the allowed characters to also be encoded by some software. Thus, any application that decodes a URL must decode every occurrence of a percentage sign followed by any two hexadecimal digits.
The following code encodes text for URLs:
sub url_encode { my $text = shift; $text =~ s/([^a-z0-9_.!~*'( ) -])/sprintf "%%%02X", ord($1)/ei; $text =~ tr/ /+/; return $text; }
Any character not in the allowed set is replaced by a percentage sign
and its two-digit hexadecimal equivalent. The three percentage signs
are necessary because percentage signs indicate format codes for
sprintf
, and literal percentage signs must be
indicated by two percentage signs. Our format code thus includes a
percentage sign, %%
, plus the format code for two
hexadecimal digits, %02X
.
Code to decode URL encoded text looks like this:
sub url_decode { my $text = shift; $text =~ tr/+/ /; $text =~ s/%([a-f0-9][a-f0-9])/chr( hex( $1 ) )/ei; return $text; }
Here we first translate any plus signs to spaces. Then we scan for a
percentage sign followed by two hexadecimal digits and use
Perl’s chr
function to convert the hexadecimal value into a
character.
Neither the encoding nor the decoding operations
can be safely repeated on the same text. Text encoded twice differs
from text encoded once because the percentage signs introduced in the
first step would themselves be encoded in the second. Likewise, you
cannot encode or decode entire URLs. If you were to decode a URL, you
could no longer reliably parse it, for you may have introduced
characters that would be misinterpreted such as /
or ?
. You should always parse a URL to get the
components you want before decoding them; likewise, encode components
before building them into a full URL.
Note that it’s good to understand how a wheel works but
reinventing it would be pointless. Even though you have just seen how
to encode and decode text for URLs, you shouldn’t do so
yourself. The
URI::URL module (actually it is a
collection of modules), available on CPAN (see Appendix B), provides many URL-related modules and
functions. One of the included modules, URI::Escape, provides the
url_escape
and
url_unescape
functions. Use them. The
subroutines in these modules have been vigorously tested, and future
versions will reflect any changes to HTTP as it evolves.[2]
Using standard subroutines will also make your code much clearer to
those who may have to maintain your code later (this includes you).
If, despite these warnings, you still insist on writing your own decoding code yourself, at least place it in appropriately named subroutines. Granted, some of these actions take only a line or two of code, but the code is quite cryptic, and these operations should be clearly labeled.
Now that we have a clearer understanding of URLs, let’s return to the main focus of this chapter: HTTP, the protocol that clients and servers use to communicate on the Web.
When a web browser requests a web page, it sends a request message to a web server. The message always includes a header, and sometimes it also includes a body. The web server in turn replies with a reply message. This message also always includes a header and it usually contains a body.
There are two features that are important in understanding HTTP:
It is a request/response protocol: each response is preceded by a request.
Although requests and responses each contain different information, the header/body structure is the same for both messages. The header contains meta-information —information about the message—and the body contains the content of the message.
Figure 2.2 shows an example of an HTTP transaction. Say you told your browser you wanted a document at http://localhost/index.html. The browser would connect to the machine at localhost on port 80 and send it the following message:
GET /index.html HTTP/1.1 Host: localhost Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/xbm, */* Accept-Language: en Connection: Keep-Alive User-Agent: Mozilla/4.0 (compatible; MSIE 4.5; Mac_PowerPC)
Assuming that a web server is running and the path maps to a valid document, the server would reply with the following message:
HTTP/1.1 200 OK Date: Sat, 18 Mar 2000 20:35:35 GMT Server: Apache/1.3.9 (Unix) Last-Modified: Wed, 20 May 1998 14:59:42 GMT ETag: "74916-656-3562efde" Content-Length: 141 Content-Type: text/html <HTML> <HEAD><TITLE>Sample Document</TITLE></HEAD> <BODY> <H1>Sample Document</H1> <P>This is a sample HTML document!</P> </BODY> </HTML>
In this example, the request includes a header but no content. The response includes both a header and HTML content, separated by a blank line (see Figure 2.3).
If you are familiar with the format of Internet email, this header and body syntax may look familiar to you. Historically, the format of HTTP messages is based upon many of the conventions used by Internet email, as established by MIME (Multipurpose Internet Mail Extensions). Do not be tricked into thinking that HTTP and MIME headers are the same, however. The similarity extends only to certain fields, and many early similarities have changed in later versions of HTTP.
Here are the important things to know about header syntax:
The first line of the header has a unique format and special meaning. It is called a request line in requests and a status line in replies.
The remainder of the header lines contain name-value pairs. The name and value are separated by a colon and any combination of spaces and/or tabs. These lines are called header fields .
Some header fields may have multiple values. This can be represented by having multiple header fields contain the same field name and different values or by including all the values in the header field separated by a comma.
Field names are not case-sensitive; e.g.,
Content-Type
is the same as
Content-type
.
Header fields don’t have to appear in any special order.
Every line in the header must be terminated by a carriage return and
line feed sequence, which is often abbreviated as
CRLF
and represented as