15.1 Hypertext Fundamentals

The World Wide Web was the first internet technology to engage people who were not otherwise computer users. Its popularity arises from ease of use. Most people can read and understand web pages on a computer screen, and they quickly learn how to navigate the web by clicking on “hypertext links.”

The term hypertext was coined in the 1970s by computer visionary Ted Nelson, who wanted to improve the computing experience. Nelson proposed computer-aided reading that linked one piece of text to another, making it easy for a reader to pursue a specific interest by jumping between related texts. Nelson credits the idea to the “memex” proposed in a 1945 article by former presidential science adviser Vannevar Bush (1890–1974).

In this section, we study the fundamentals of hypertext and the World Wide Web by examining the following three topics:

  1. Web standards

  2. Addressing web pages

  3. Retrieving a web page

We examine the first topic here and the other two topics in subsequent subsections.

Web Standards

The hypertext technology beneath the World Wide Web was developed by physicist Tim Berners-Lee, noted in Section 13.3.1. As in email delivery, the web relies on file transfers. When a browser requests a page, the server transmits a file containing that page. It may take minutes or days for an email to produce a reply, but we receive a rapid response to a web browser’s request.

Although file exchange remains the bedrock of web operations, far more technology underlies today’s web services. The web, like email, has two sets of networking standards:

  1. Formatting standards: how to construct files to be displayed by web browsers. These standards establish different types of markup languages that incorporate sophisticated document formatting into simple text-based documents.

  2. Protocol standards: how hosts communicate in order to display hypertext and locate parts of a hypertext document.

Berners-Lee authored the initial standards, and they were adopted rapidly by the internet community. Although the IETF is responsible for most internet technical standards, many of the basic web standards are managed by the World Wide Web Consortium (W3C). Founded in 1994, the W3C is supported by an international group of academic, private, and government institutions.

Formatting: Hypertext Markup Language

The fundamental format on the web is HTML. Standard off-the-shelf HTML yields documents that contain formatted text, images, and links to other documents on the web. FIGURE 15.1 shows an example of an extremely simple HTML page.

A screenshot of a webpage titled “Welcome to Amalgamated Widget” is shown.

FIGURE 15.1 A very, very simple web page.

Courtesy of Dr. Richard Smith.

FIGURE 15.2 shows the HTML source text that produced the simple page. The “markup” aspect of HTML allows us to mark parts of the document to appear in italics, underlined, and so on. We use text elements called tags to indicate the format we want. Tags also indicate paragraphs, titles, headings, tables, lists, list elements, and so on. Formatted emails, like the one in Figure 14.3, use HTML tags to indicate the formatting.

A screenshot of an html code is shown.

FIGURE 15.2 HTML source text that produced Figure 15.1.

The simplest tag contains its name enclosed in angle brackets. The page text in Figure 15.2 begins with the tags <html>, <head>, and <title>.

We use pairs of tags to delimit the text included in the markup. Our sample page uses the headline “Welcome to Amalgamated Widget!” to introduce its content. In Figure 15.2, the third line of HTML assigns that headline to the document’s title. We start the title assignment with a <title> tag and end it with </title>, which we call the “end tag.” The end tag contains the tag’s title prefixed with a slash (“/”) character.

Traditional web pages begin with the <html> tag to mark the file as containing HTML. The <head> and <title> tags mark introductory data. Some applications read only a page’s head tags. The <title> tag places the page’s name on the browser’s window, but the title does not otherwise appear on the page’s display.

The page’s <body> contains the displayed text. There are numbered heading tags specifying different heading levels: h1, h2, h3, and so on. Because the <title> tag doesn’t place our headline on the page, we use the <h1> tag to display it in the page’s text. The paragraph tag <p> indicates simple text to be formatted into a readable paragraph. This text may be further embellished with boldface, underscores, and so on.

Hypertext Links

An HTML page is hypertext because it contains hypertext links, which we will simply call links. When we click on a link, we retrieve information from a different page, either on the same web server or on a different server. We use the <a> tag in HTML to construct a link. FIGURE 15.3 illustrates a simple link. Every link has two parts:

  1. Highlighted text: When we click our mouse on the highlighted text, the web browser follows the link.

  2. Target: When the browser follows a link, it retrieves the web page indicated by the link’s target.

A hypertext link from an HTML document is shown.

FIGURE 15.3 A hypertext link from an HTML document.

The starting tag in Figure 15.3 contains the link’s target: It is the value of the “href” argument. A tag may carry one or more arguments, each one containing the argument’s name, an “=” sign, and the argument’s value.

In this case, the link’s target is another HTML file named “about.html” stored on the same web server. If a file’s link refers to a different web server, then we use a web address, as will be described in Section 15.1.1.

The starting tag and end tag enclose the highlighted text. The end tag never contains arguments, even if they appear in the starting tag.

Most browsers rely on a mouse-based GUI to handle links. If we move our mouse over a link and click on it, the browser follows the link. If we just place the mouse over the link without clicking, we remain on the current page. Most browsers also provide a discreet status display at the bottom of the browser’s window.

When we place the mouse over a link without clicking, most browsers display the target’s file name or web address in the status display. This allows us to identify the link’s destination before we decide to click on it.

Cascading Style Sheets

Originally, authors of web pages relied on HTML tags to control both the style and the structure of web pages. Today, most sites use Cascading Style Sheets, or CSS files, to control the page’s appearance. These files use a distinctive “.css” suffix. Web pages still use the HTML tags to describe the structure and flow of information on the page. The CSS files specify header formatting, typeface selection, bullet formats, and so on.

CSS files provide an easier way to manage web page formats across multiple pages on larger sites. They also provide more effective control over page formatting; for example, CSS allows more precise specification of typefaces and sizes. Even though most browsers still recognize the older HTML formatting tags, most websites avoid using them, because CSS-based formatting is more powerful and easier to manage.

Hypertext Transfer Protocol

The HTTP is the fundamental protocol for moving data on the web. On the client side, we have web browsers like Microsoft’s Internet Explorer, Mozilla’s Firefox, and Apple’s Safari. Because browsers are application programs that run on desktops, anyone can write their own, and many other browsers exist. All use HTTP to exchange data and HTML plus CSS to format the data for display. Different browsers also provide various levels of support for XML, the “Extensible Markup Language.”

Web server software may be extremely simple. Virus expert Fred Cohen, for example, implemented a simple HTTP server in only 80 lines of source code. Modern servers that implement SSL and scripting languages are far larger and more complex. Today, there are two widely used web server packages:

  1. Apache: a full-featured web server maintained and distributed as open source. By most measures, it is the most widely used web server on the internet.

  2. Internet Information Services (IIS): the full-featured commercial web server distributed by Microsoft. IIS supports most of the same web protocols and features as Apache, plus it supports many proprietary extensions. There are IIS websites that work correctly only when visited by Microsoft’s Internet Explorer browser.

The major differences between these servers rarely arise in purely static websites. Most of the proprietary extensions in IIS appear when the website uses scripts. We discuss scripts in Section 15.3.

Retrieving Data from Other Files or Sites

When a tag in a web page provides a link or an image, it often refers to a different file. For images, the browser uses HTTP to retrieve the image and then displays it on the page. An HTML file also may retrieve some of its HTML text from another HTML file.

Most important, an HTML file may retrieve its content either from its own server or from any other web server on the internet. Any HTML tag that accepts a file name will also accept a reference to another page or file elsewhere on the web.

For example, some websites display advertising that is provided by a separate advertising service. The service tailors the advertising to the specific user who visits the site, as well as to the site itself. The website contents come from the server being visited, but the tailored ads come from the servers for the advertising service.

15.1.1 Addressing Web Pages

We locate a web page by typing in that oddly formatted identifier called the Uniform Resource Locator (URL). FIGURE 15.4 illustrates its basic format.

An illustration depicts the format of a URL.

FIGURE 15.4 Format of a URL, which is a web page URI.

Although the web community at large calls these identifiers URLs, modern browsers actually use a more general form called the Uniform Resource Identifier (URI). Strictly speaking, a URL locates while a URI identifies a resource. This text follows the common usage and calls web page identifiers URLs.

The leftmost part of the URL indicates the scheme for interpreting the text following the colon. Typical web-oriented schemes identify the protocol being used to retrieve a page: http, https, ftp, file, and so on. The format to the right of the colon may vary from one scheme to another. Most schemes we use with web pages and resources refer to files stored on servers or hard drives.

Email URLs

Although weblinks usually lead to documents or images, some links activate particular applications. An email URL, for example, contains an email address. When we follow an email URL, the browser automatically starts an email client and passes it the email address in the URL. Web pages often provide a “Contact Us” link that contains an email URL.

FIGURE 15.5 describes the format of an email URL, which follows the format of a URI. To the left of the colon is “mailto,” the scheme’s name. To the right is an email address. The @ sign separates the user ID from the email server’s domain name.

An illustration depicts the format of an email URL. The email URL reads “mailto:kevin@eifsc.com” The different sections of the email URL are marked as follows: “mailto” : The scheme of this URI; “kevin”: UserID; “eifsc.com” : Email server’s domain name.

FIGURE 15.5 Format of an email URL.

Hosts and Authorities

Returning to the more common URL format in Figure 15.4, the authority part of the URL identifies the server containing the resource we want. In most cases, the authority simply contains the domain name of the server. The authority field may contain additional fields as shown in FIGURE 15.6. Email URLs are the only ones that routinely use the @ sign or anything to the left of it. The port number (shown on the right following a colon) appears occasionally.

An illustration depicts detailed format of the URL authority field.

FIGURE 15.6 Detailed format of the URL authority field.

The URL’s scheme usually identifies the network service being used with the resource. We identified specific TCP or UDP port numbers associated with well-known services. (See Section 11.3.) When a URL specifies “http,” the client retrieves the resource by opening a connection to port 80 on the server. Some resources may be available through nonstandard port numbers. For example, a web server might provide one set of pages through port 80 and a different set through port 8080. If the URL doesn’t use the default port, it must include the correct port number in its authority field. Figure 15.6 shows the correct format in which the nonstandard port number follows the domain name.

The user ID and password information are very, very rarely used when retrieving web documents. It is obviously a security risk to embed a reusable password in a URL, because we’ll disclose the password whenever we share that URL. Moreover, URLs often travel in plaintext across the internet. Some browsers, like Firefox, display a warning if a user includes that information in a URL.

Many browsers also allow us to enter a numeric IP address instead of a domain name. For example, many commercial gateways provide web-based administration if we direct the browser to the gateway’s IP address. To reach such a web server, we type its IP address in brackets, for example: [192.168.1.1].

Path Name

The authority field ends with a slash and is followed by the file’s path name. Different file systems use different path name delimiters (Table 3.1). Most web servers, however, expect to see the slash “/” character as the path name delimiter, even if the server runs atop Windows or MacOS. The server software interprets the path name and translates it as necessary before passing it to the file system.

Most file systems treat path and file names as being case sensitive. However, not all websites have case-sensitive paths. Many internet identifiers—notably domain names—ignore case when interpreting the text. Some web servers apply this to URLs. To do so, the server adjusts the path name before passing it to the file system.

Default Web Pages

When we type in a web page location, we want to type as little as possible. We often omit the “http://” suffix to the URL, and most browsers fill it in for us. People often type nothing more than the website’s domain name into the browser, and the browser locates its home page.

When we ask a web server for a page and we don’t actually provide a page name, the server tries to guess the page’s name. Typical guesses include:

For example, if we type “http://www.amawig.com” into a web browser, the server will eventually retrieve the home page “index.html” because that file is stored in the site’s home directory.

Most server software may be configured to look for additional default file names. In dynamic servers, the default names may select scripts to execute (see Section 15.3).

15.1.2 Retrieving a Static Web Page

The simplest websites consist of text files formatted in HTML, possibly combined with image files. When a client requests a page, the server retrieves the file containing that page and performs no additional processing. Many modern sites avoid storing static HTML; instead, they construct the page dynamically, as described in Section 15.3.

When Alice visits a web page, she provides a URL. In a static website, the server converts the URL into a file name reference and retrieves the associated file. If the file contains HTML, the browser interprets it and displays it. FIGURE 15.7 summarizes the process.

An illustration depicts retrieving a webpage using HTTP.

FIGURE 15.7 Retrieving a web page using HTTP.

The process involves the following four steps:

  1. Alice enters a URL into a browser or selects a link to visit. We will visit the page “http://www.amawig.com/about.html.”

  2. The browser calls the domain name resolver to retrieve the IP address for www.amawig.com. (See Section 12.3.2.)

  3. The browser opens a connection to the server, using the IP address retrieved by the resolver. The browser connects to the server’s port 80, unless a different port number appeared in the URL. The browser sends an HTTP GET command that contains the path name listed in the URL: about.html.

  4. The server retrieves the file about.html and transmits its contents across the same connection, then it closes the connection. The browser displays the page.

We used Wireshark to list the packets exchanged during a TCP connection. (See Figure 12.6.) That particular example shows another HTTP connection. Instead of retrieving a file named “about.html,” the Wireshark example retrieves the file “ncsi.txt.” Both the GET command and the web page file happen to be small enough to fit in individual packets.

The server’s response to a GET command begins with a status message: one line of text that announces whether the GET succeeded or not. If the GET succeeded, then the file’s contents follow the line of text.

Building a Page from Multiple Files

When the HTML file refers to other files, the browser must retrieve the contents in order to display the entire page. To retrieve the other files, the browser opens HTTP connections to the appropriate servers and requests the required files.

Originally, each file transfer required a separate connection. As web server traffic grew, however, sites wanted higher efficiency. When a typical web page refers to another file, the file often resides on the same server. The client wastes time and trouble to close one connection when it is immediately going to reopen it to transfer another file. The latest HTTP systems may transfer multiple files through a single connection.

Web Servers and Statelessness

Basic HTTP is a stateless protocol: The basic server software does not keep track of the client’s behavior from one connection to the next. There is no notion of a “session” in HTTP. This makes it relatively easy to implement an HTTP server. This also makes it difficult to implement web applications that keep track of a user’s behavior across several connections, like “shopping carts.” The applications use special techniques we examine in Section 15.3.2.

A static website consists of text files in HTML format plus images stored in standard formats recognized by popular browsers. Common image formats include the Graphics Interchange Format (GIF), Joint Photographic Engineering Group (JPEG) format, and Portable Network Graphics (PNG) format. Most browsers also will retrieve files in raw text format, with the “.txt” file name suffix, and display them in a simple fixed-width text format.

Web Directories and Search Engines

When the World Wide Web was new, the biggest challenge was to find things. In 1993, for example, Bob’s Uncle Joe knew that someone had produced an episode guide for The Twilight Zone TV series and had posted the file on a website, but there was no way to locate the file without a URL or a helpful link through some other site. In fact, this was how many people initially used personal web pages; they linked to interesting pages in other sites. There was no systematic way to locate information.

In the mid-1990s, two strategies emerged for locating information: directories and search engines. Directories provided a structured listing of websites based on their perceived subject matter. An informal directory developed by two graduate students in 1995 turned into the Yahoo! directory, the most prominent early directory. A search engine searches every page it can find on the web and constructs an index of pages that appear relevant to the search. Researchers at DEC’s Western Research Center developed Altavista, also in 1995, which was the first effective search engine.

Directories tend to rely on referrals. If Bob wants his web page to appear in a directory, he contacts the directory’s owner and provides the URL and description of his site. Some directories automatically add sites. Others assess sites individually to decide if they are appropriate to list.

Directories also may develop structural problems. A well-organized directory provides a list of the “best” sites in every category. Each category’s list should be short enough for a typical user to study. In practice, this requires an editor to periodically reorganize different topics as they fill up with individual entries. A large, general-purpose directory may have millions of entries and requires an extremely large team of editors.

Web Crawlers

Search engines, on the other hand, collect information about web content automatically. They use special software that systematically “crawls” the web, following the links it finds in web pages. This software typically is called a web crawler, or a spider, or even a robot. An individual site may control how search engines handle its site by installing a special file called “ROBOTS.TXT.” If the file is present, a well-behaved crawler examines its contents for instructions before crawling the site.

Crime via Search Engine

Search engines pose interesting security problems. If a site posts sensitive information by accident and a page links to it, then anyone can find the information using the search engine. Moreover, a clever search can locate sensitive or malicious information anywhere on the internet.

For example, sensitive personal information occasionally appears, including such details as credit card numbers. Card numbers follow a simple format, and they fall into particular numeric ranges depending on the issuing organization. In the past, search engines have allowed searches on those ranges. Such searches might locate pages that were made visible accidentally by an e-commerce site. In other cases, such pages contained information collected by a fraudster.

Although traditional search engines like Google and Bing locate and index documents and other digital media on the internet, search engines like “Shodan” locate other computing resources, like IoT devices, routers, SCADA systems, and even traffic light controls. Shodan is particularly well known for indexing household and private security webcams. If a webcam’s owner fails to password-protect the webcam or to choose a hard-to-guess password, then Shodan users can watch the video feed and may be able to control the camera remotely.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset