Now that we have covered the development environment, we can get to the fun material, that is, programming our first bot. To begin with, we will develop an HTTP package, which we can use in our bot applications to handle HTTP requests and responses.
Now that we have discussed our development environment, coding standards, and how basic HTTP requests and responses work, let's create an HTTP request class that actually does something useful. Again, if you are not familiar with developing PHP classes, and using PHP objects, you will want to research on these topics before reading this section. While this book is a starter book for developing PHP bots, I want to teach you the correct way to develop bots, which means using the power and reusability of classes and objects, also known as OOP (Object-Oriented Programming).
When we use well-designed classes, we can use them as objects in our current project, which we are developing for those classes. Also, other future projects that require the same type of functionality and logic can use these classes as objects. If you desire to be a productive and successful programmer, you will eventually need to accept this point of view, if you haven't already.
In this section, I am going to help you develop a basic HTTP request class which we will be able to use in all of our bots that need this type of tool. Later, we will create an HTTP response class that will easily allow our request class to return objects instead of arrays of information, which will be very useful.
Before you start developing a class, you should always spend some time primarily designing it. This way you won't have to think as much while you are developing the class. So, let's spend a little time thinking about the design of our HTTP request class.
Here are some requirements we will need in our HTTP request class:
These are some good requirements for a basic HTTP request class. Obviously, we could go much further with the design of our class and add other advanced options and methods such as debugging mode and logging. However, the we are developing this project, will get carried out just fine.
Before we start developing the HTTP request class, let's set up a project directory structure through the following steps. You will be able to use this directory structure for all the projects in this book.
'-- project_directory |-- lib | '-- HTTP | |--Request.php | '--Response.php |-- 01_command_line_app.php (from command line application example) '-- 02_http.php
project_directory/lib
directory.Request.php
file.Request
class located at project_directory/lib/HTTP/
:<?phpnamespace HTTP; /*** HTTP Request class – execute HTTP get and head requests ** @package HTTP */ class Request {
First, in our Request.php
file we set the namespace of HTTP
. This is a simple way of telling PHP that we want the Request
class in the HTTP
container. If you are not familiar with namespaces, you can research the topic further. However, for now, you can think of a namespace as a container. So, for every class, function, method, or constant in the namespace (or container) HTTP
should have something to do with HTTP logic. So, for example, if we had the namespace Database
, everything in that namespace would include logic and methods for database functionality.
Next, we leave some simple, yet helpful, comments that will allow any developer to quickly determine what the class is used for. And finally, we declare our class name as Request
. One important thing to note here is that this code will not work on any PHP Version prior to PHP 5.3, because PHP didn't include namespace syntax until PHP 5.3.
The first class method we are going to add to our Request
class is a method that can format a timeout value if a timeout value has been used, or return a default timeout value if no timeout value has been used. This method will make more sense later on when we develop the request methods. Here are the next lines of code in our Request
class, used for the formatTimeout
method:
Insert the following snippet of code to our Request
class located at project_directory/lib/HTTP/
:
/** * Format timeout in seconds, if no timeout use default timeout * * @param int|float $timeout (seconds) * @return int|float */ private static function __formatTimeout($timeout = 0) { $timeout = (float)$timeout; // format timeout value if($timeout < 0.1) { $timeout = 60; // default timeout } return $timeout; }
As you can see, this is a very simple method that takes an optional parameter called timeout
and formats the timeout value so that it can be used properly in our request methods. If no timeout value is passed to the method, it will return the default timeout value (60
seconds).
Next, we need to create a method that can parse the raw HTTP response that we receive from our request methods. Once the raw HTTP response has been parsed, we can use the response parts with our HTTP Response
class (we'll build later) to form a usable response object. Here is the parse response method:
Insert the following snippet of code to our Request
class:
/** * Parse HTTP response * * @param string $body * @param array $header * @return HTTPResponse */ private static function __parseResponse($body, $header) { $status_code = 0; $content_type = ''; if(is_array($header) && count($header) > 0) { foreach($header as $v) { // ex: HTTP/1.x XYZ Message if(substr($v, 0, 4) == 'HTTP' && strpos($v, ' ') !== false) { $status_code = (int)substr($v, strpos($v, ' '), 4); // parse status code } // ex: Content-Type: *; charset=* else if(strncasecmp($v, 'Content-Type:', 13) === 0) { $content_type = $v; } } } return new HTTPResponse($status_code, $content_type, $body, $header); }
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
This parse method takes two arguments: body
(string
) and the header
(array
). In the method, we first initialize the variables for status code and content type. These variables will be used in the HTTP Response
object, which is returned by the method. The next part of the method loops through the HTTP header and parses out the status code (HTTP response code) and the content type (MIME type). These variables are also used when creating the HTTP Response
object. And finally, we instantiate the HTTP Response
object—we build in the next section—and return it. This method will make more sense once we have finished creating the entire class, but we needed to include it first so that our request methods can utilize it.
Now, let's build the first request method of our Request
class—the get()
request method. The get()
method will be the most common request method that we will use in our bots. Here is the code for the get()
method:
Insert the following snippet of code to our Request
class:
/** * Execute HTTP GET request * * @param string $url * @param int|float $timeout (seconds) * @return HTTPResponse */ public static function get($url, $timeout = 0) { $context = stream_context_create(); stream_context_set_option($context, 'http', 'timeout', self::__formatTimeout($timeout)); $http_response_header = NULL; // allow updating $res_body = file_get_contents($url, false, $context); return self::__parseResponse($res_body, $http_response_header); }
The preceding method is very simple. We tell the get()
method—through the parameters—which URL to send to the GET
request. Then we can pass an optional timeout value. Our get()
method then creates the required context for the HTTP GET
request, sends the request, and creates and returns an HTTPResponse
object that we can use.
Next, we can build the head()
method. An HTTP HEAD request is a very simple GET
request that does not include a response message body. This request can be useful for simple requests, such as pinging an HTTP service on a web server.
Here is the code for the HEAD request. Insert the following snippet of code to our Request
class located at project_directory/lib/HTTP/
:
/** * Execute HTTP HEAD request * * @param string $url * @param int|float $timeout * @return HTTPResponse */ public static function head($url, $timeout = 0) { $context = stream_context_create(); stream_context_set_option($context, [ 'http' => [ 'method' => 'HEAD', 'timeout' => self::__formatTimeout($timeout) ] ]); $http_response_header = NULL; // allow updating $res_body = file_get_contents($url, false, $context); return self::__parseResponse($res_body, $http_response_header); }
The preceding method is very straightforward. First, we accept the URL and timeout values. Then, we configure a custom stream_context
, which is used when we fetch the HTTP HEAD request. Then, we use the built-in PHP function file_get_contents()
to send the HTTP HEAD request. Finally, we return the HTTPResponse
object, which is created in our __parseResponse()
method.
Now, we need to create an HTTP Response
class that can be used to easily transform our HTTP response data into a useable object. This way, once we receive an HTTP response in our Request
class, we simply pass the response data to our Response
object and we don't have to think about all the details and methods required to use the HTTP response data in our applications.
Create a file called Response.php
in the HTTP package directory and add the code available from the book source code download file project_directory/lib/HTTP/Response.php
at Packt Publishing's website.
As you can see, it takes quite a few lines of code to create our HTTPResponse
class, however, the class is very simple.
First, we define HTTP response status codes and status code messages. Status codes are used by web servers to denote the response status. For example, if an HTTP GET request is sent to a web server and the web server returns an HTTP response status code of 200
, we know that the request has been handled successfully. You can see by looking at the Response
class status codes and status code messages, that there are a variety of status codes and messages which the web server can respond.
Next, in the __construct()
method of the Response
class, we accept the status code, type
(content type), body
, and header
. These are the only parameters that we need to perform operations such as initialization, or instantiation with the Response
object in order to make it work properly for us.
Now that we have our HTTPRequest
and HTTPResponse
objects completed, let's take a look at why we are developing classes and using objects (Object-Oriented Programming or OOP) instead of using procedural lines of code (Procedure-Oriented programming or POP). To illustrate this point, I am going to have you execute your first HTTP HEAD request using our HTTP
classes.
Create the a file called 02_http.php
in your project directory where it will have access to the /project_directory/lib/HTTP
directory. Add the following code to the 02_http.php
file located at /project_directory/
:
<?php /** * Example HTTP GET request */ // include our classes require_once './lib/HTTP/Request.php'; require_once './lib/HTTP/Response.php'; // execute example HTTP GET request $response = HTTPRequest::head('http://www.google.com'), // print out HTTP response (HTTPResponse object) echo '<pre>' . print_r($response, true) . '</pre>';
In this code, first we include our Request
and Response
classes that we have developed. Next, we set the response
variable with the response (HTTPResponse
) object that is created by the HTTPRequest::head()
method. Finally, we print the HTTP Response
object for illustration, or debugging/testing purposes. If you execute this code in a web browser, you see something like the following:
HTTPResponse Object ( [__body:HTTPResponse:private] => [__encoding:HTTPResponse:private] => ISO-8859-1 [__header:HTTPResponse:private] => Array ( [0] => HTTP/1.0 200 OK [1] => Date: (date/time) GMT [2] => Expires: -1 [3] => Cache-Control: private, max-age=0 [4] => Content-Type: text/html; charset=ISO-8859-1 [5] => Set-Cookie: *** [6] => Set-Cookie: *** [7] => P3P: *** [8] => Server: gws [9] => X-XSS-Protection: 1; mode=block [10] => X-Frame-Options: SAMEORIGIN ) [__mime:HTTPResponse:private] => text/html [__status:HTTPResponse:private] => 200 [__status_message:HTTPResponse:private] => OK [success] => 1 )
Success! We have successfully executed an HTTP HEAD request, received a response, parsed it, created an HTTP Response
object, and printed the object. Now, we could easily use the object for more useful things; for example, change the 02_http.php
file located at /project_directory/
as follows:
// display response status if($response->success) { echo 'Successful request <br />'; } else { echo 'Error: request failed, status code: ' . $response->getStatusCode() . '<br />'; // prints status code }
If we were using procedural programming (POP) instead of classes and objects (OOP), it would be much more difficult to do this. Also, using classes with namespaces makes it easy for us to use them in other application frameworks and not have class naming conflicts. Also, this approach of programming makes it much easier to determine what type of logic and purpose a class is designed. For example, it would be easy for another programmer to conclude that the HTTPRequest
class is used to generate HTTP requests.
Now that we have our HTTP
package—set by the namespace HTTP
—completed, we can easily use it for other projects and applications. Sometimes, especially when using large packages or library files, it is hard to remember or find out what exactly needs to take place in order for us to get a package ready for use in our own software applications. A package might require extensive configuration settings, class autoloading, common file loading, external package or library files, and more.
A simple solution to this problem is using a bootstrap file. A bootstrap file can be used to initialize everything the package requires to load and initialize properly. Our HTTP
package doesn't require much file loading or any configuration settings, but for the sake of example, let's create a simple bootstrap file for our HTTP
package:
bootstrap.php
in the HTTP
package directory.bootstrap
class located at project_directory/lib/HTTP/
:<?php namespace HTTP; /** * Bootstrap file * * @package HTTP */ // load class files require_once './lib/HTTP/Request.php'; require_once './lib/HTTP/Response.php';
The preceding code resembles a very simple bootstrap file. We are simply loading the classes required in our HTTP
package. However, it will make the use of our HTTP
package even easier!
02_http.php
file, modify the code to use our bootstrap.php
file instead of loading the class files manually:<?php /** * Example HTTP GET request */ // load HTTP package with bootstrap file require_once './lib/HTTP/bootstrap.php'; // execute example HTTP GET request $response = HTTPRequest::get('http://www.google.com'), // display response status if($response->success) { echo 'Successful request <br />'; } else { echo 'Error: request failed, status code: ' . $response->getStatusCode() . '<br />'; // prints status code } // print out HTTP response (HTTPResponse object) echo '<pre>' . print_r($response, true) . '</pre>';
Although this is an extremely simple example of how a bootstrap file can be used to make the use of a package easier, it is still beneficial to any programmer who uses our code. Also, in later sections of this book we will be using bootstrap files when we develop our bot package.
At this point in the book, you should be aware of and comfortable with HTTP requests and responses, how to develop HTTP
packages (covered earlier in the book), and why we use bootstrap files.
With the knowledge you have gained, we are now ready to develop our first bot, which will be a simple bot that gathers data (documents) based on a list of URLs and datasets (field and field values) that we will require.
First, let's start by creating our bot package directory. So, create a directory called WebBot
so that the files in our project_directory/lib
directory look like the following:
'-- project_directory|-- lib | |-- HTTP (our existing HTTP package) | | '-- (HTTP package files here) | '-- WebBot | |-- bootstrap.php| |-- Document.php | '-- WebBot.php |-- (our other files)'-- 03_webbot.php
As you can see, we have a very clean and simple directory and file structure that any programmer should be able to easily follow and understand.
Next, open the file WebBot.php
file and add the code from the book source code download file project_directory/lib/WebBot/WebBot.php
at Packt Publishing's website.
In our WebBot
class, we first use the __construct()
method to pass the array
of URLs (or documents) we want to fetch, and the array
of document fields are used to define the datasets and regular expression patterns. Regular expression patterns are used to populate the dataset values (or document field values). If you are unfamiliar with regular expressions, now would be a good time to study them. Then, in the __construct()
method, we verify whether there are URLs to fetch or not. If there , we set an error message stating this problem.
Next, we use the __formatUrl()
method to properly format URLs we fetch data. This method will also set the correct protocol: either HTTP or HTTPS (Hypertext Transfer Protocol Secure). If the protocol is already set for the URL, for example http://www.[dom].com
, we ignore setting the protocol. Also, if the class configuration setting conf_force_https
is set to true
, we force the HTTPS protocol again unless the protocol is already set for the URL.
We then use the execute()
method to fetch data for each URL, set and add the Document
objects to the array
of documents, and track document statistics. This method also implements fetchdelay logic that will delay each fetch by x number of seconds if set in the class configuration settings conf_delay_between_fetches
. We also include the logic that only allows distinct URL fetches, meaning that, if we have already fetched data for a URL we won't fetch it again; this eliminates duplicate URL data fetches. The Document
object is used as a container for the URL data, and we can use the Document
object to use the URL data, the data fields, and their corresponding data field values.
In the execute()
method, you can see that we have performed a HTTPRequest::get()
request using the URL and our default timeout value—which is set with the class configuration settings conf_default_timeout
. We then pass the HTTPResponse
object that is returned by the HTTPRequest::get()
method to the Document
object. Then, the Document
object uses the data from the HTTPResponse
object to build the document data.
Finally, we include the getDocuments()
method, which simply returns all the Document
objects in an array
that we can use for our own purposes as we desire.
Next, we need to create a class called Document
that can be used to store document data and field names with their values. To do this we will carry out the following steps:
WebBot
class to the Document
class.project_directory/lib/WebBot/Document.php
at Packt Publishing's website.Our Document
class accepts the HTTPResponse
object that is set in WebBot
class's execute()
method, and the document fields and document ID.
Document __construct()
method, we set our class properties: the HTTP Response
object, the fields (and regular expression patterns), the document ID, and the URL that we use to fetch the HTTP response.200
), and if it isn't, we set the error with the status code and message.__setFields()
method.The __setFields()
method parses out and sets the field values from the HTTP response body. For example, if in our fields we have a title
field defined as $fields = ['title' => '<title>(.*)</title>'];
, the __setFields()
method will add the title
field and pull all values inside the <title>*</title>
tags into the HTML response body. So, if there were two title
tags in the URL data, the __setField()
method would add the field and its values to the document as follows:
['title'] => [ 0 => 'title x', 1 => 'title y' ]
If we have the WebBot
class configuration variable—conf_include_document_field_raw_values
—set to true
, the method will also add the raw values (it will include the tags or other strings as defined in the field's regular expression patterns) as a separate element, for example:
['title'] => [ 0 => 'title x', 1 => 'title y', 'raw' => [ 0 => '<title>title x</title>', 1 => '<title>title y</title>' ] ]
The preceding code is very useful when we want to extract specific data (or field values) from URL data.
To conclude the Document
class, we have two more methods as follows:
getFields()
: This method simply returns the fields and field valuesgetHttpResponse()
: This method can be used to get the HTTPResponse
object that was originally set by the WebBot execute()
methodThis will allow us to perform logical requests to internal objects if we wish.
Now we will create a bootstrap.php
file (at project_directory/lib/WebBot/
) to load the HTTP
package and our WebBot
package classes, and set our WebBot
class configuration settings:
<?php namespace WebBot; /** * Bootstrap file * * @package WebBot */ // load our HTTP package require_once './lib/HTTP/bootstrap.php'; // load our WebBot package classes require_once './lib/WebBot/Document.php'; require_once './lib/WebBot/WebBot.php'; // set unlimited execution time set_time_limit(0); // set default timeout to 30 seconds WebBotWebBot::$conf_default_timeout = 30; // set delay between fetches to 1 seconds WebBotWebBot::$conf_delay_between_fetches = 1; // do not use HTTPS protocol (we'll use HTTP protocol) WebBotWebBot::$conf_force_https = false; // do not include document field raw values WebBotWebBot::$conf_include_document_field_raw_values = false;
We use our HTTP
package to handle HTTP requests and responses. You have seen in our WebBot
class how we use HTTP requests to fetch the data, and then use the HTTP Response
object to store the fetched data in the previous two sections. That is why we need to include the bootstrap file to load the HTTP
package properly.
Then, we load our WebBot
package files. Because our WebBot
class uses the Document
class, we load that class file first.
Next, we use the built-in PHP function set_time_limit()
to tell the PHP interpreter that we want to allow unlimited execution time for our script. You don't necessarily have to use unlimited execute time. However, for testing reasons, we will use unlimited execution time for this example.
Finally, we set the WebBot
class configuration settings. These settings are used by the WebBot
object internally to make our bot work as we desire. We should always make the configuration settings as simple as possible to help other developers understand. This means we should also include detailed comments in our code to ensure easy usage of package configuration settings.
We have set up four configuration settings in our WebBot
class. These are static
and public
variables, meaning that we can set them from anywhere after we have included the WebBot
class, and once we set them they will remain the same for all WebBot
objects unless we change the configuration variables. If you do not understand the PHP keyword static
, now would be a good time to research this subject.
conf_default_timeout
. This variable is used to globally set the default timeout (in seconds) for all WebBot
objects we create. The timeout value tells the HTTPRequest
class how long it continue trying to send a request before stopping and deeming it as a bad request, or a timed-out request. By default, this configuration setting value is set to 30
(seconds).conf_delay_between_fetches
—is used to set a time delay (in seconds) between fetches (or HTTP requests). This can be very useful when gathering a lot of data from a website or web service. For example, say, you had to fetch one million documents from a website. You wouldn't want to unleash your bot with that type of mission without fetch delays because you could inevitably cause—to that website—problems due to massive requests. By default, this value is set to 0
, or no delay.WebBot
class configuration variable—conf_force_https
—when set to true
, can be used to force the HTTPS protocol. As mentioned earlier, this will not override any protocol that is already set in the URL. If the conf_force_https
variable is set to false
, the HTTP protocol will be used. By default, this value is set to false
.conf_include_document_field_raw_values
—when set to true
, will force the Document
object to include the raw
values gathered from the ' regular expression patterns. We've discussed configuration settings in detail in the WebBot Document Class section earlier in this book. By default, this value is set to false
.Now that we have our WebBot
class, WebBot
Document
class, and WebBot
bootstrap file completed, we can start testing our bot. Add the following code to the 03_webbot.php
file located at project_directory/
:
<?php /** * WebBot example */ // load WebBot library with bootstrap require_once './lib/WebBot/bootstrap.php'; // URLs to fetch data from $urls = [ 'search' => 'www.google.com', 'chrome' => 'www.google.com/intl/en/chrome/browser/', 'products' => 'www.google.com/intl/en/about/products/' ]; // document fields [document field ID => document field regex // pattern, [...]] $document_fields = [ 'title' => '<title.*>(.*)</title>', 'h2' => '<h2[^>]*?>(.*)</h2>', ]; // set WebBot object $webbot = new WebBotWebBot($urls, $document_fields); // execute fetch data from URLs $webbot->execute(); // display documents summary echo $webbot->total_documents . ' total documents <br />'; echo $webbot->total_documents_success . ' total documents fetched successfully <br />'; echo $webbot->total_documents_failed . ' total documents failed to fetch <br /><br />'; // check if fetch(es) successful if($webbot->success) { // display each document foreach($webbot->getDocuments() as /* WebBotDocument */ $document) { if($document->success) // was document data fetched successfully? { // display document meta data echo 'Document: ' . $document->id . '<br />'; echo 'URL: ' . $document->url . '<br />'; // display/print document fields and values $fields = $document->getFields(); echo '<pre>' . print_r($fields, true) . '</pre>'; } else // failed to fetch document data, display error { echo 'Document error: ' . $document->error . '<br />'; } } } else // not successful, display error { echo 'Failed, error: ' . $webbot->error; }
Primarily, we load our WebBot
package and its configuration by including the WebBot
bootstrap file. Next, we set the variable urls
, which is an array
of URLs that we want to fetch, and convert it into WebBot
Document
objects. In this example, I am using www.google.com
as the URL. This is for example purposes only, and you should use your own URLs. Use URLs that use HTML tags such as <title>*</title>
and <h2>*</h2>
.
Subsequently, we set the document_fields
variable with an array
of field IDs and field ' regular expression patterns. In the previous example, we are defining the document fields: title
and h2
. The title
field will include all values in the URL's data (or HTTP response body) are in the <title>*</title>
HTML tags. Likewise, the h2
field will include all values in the URL's data that are within the <h2>*</h2>
HTML tags. Again, if you are not familiar with regular expression patterns, you should read more about the topic.
In the next line of code, we set the webbot
variable with a WebBot
object. We pass our urls
and document_fields
variables to the object constructor method. These are the only parameters required by our WebBot
object, which makes it very simple to use and understand.
Following the instantiation of the WebBot
object, we call the WebBot
object's execute()
method. This tells the WebBot
object to start fetching the URL's data by making HTTP requests, and then build a document array of WebBot
class Document
objects.
In the next block of code, we test if the WebBot
object has successfully executed the fetches by checking the success
class property. If the success
property is true
, this doesn't necessarily mean that every URL fetch was executed successfully; it simply means the object was able to call an HTTP GET request for each URL.
In the next section, we loop through each document—that we get from the WebBot
method getDocuments()
—and test if the document—or the HTTP response body—was retrieved properly, using the Document
class property called success
. If the Document
class is ready, or was retrieved properly, we display the document ID, the document URL, and print the document fields and field values. Obviously, in real-world applications we would do something more useful with this data, but for this example we can see the results our bot has generated. If the document wasn't retrieved properly, perhaps the HTTP request encountered a 404 status code (request not found). This is where we will display the document error, which is the HTTP status code and status code message.
When we execute the project_directory/03_webbot.php
file in a web browser, we see something like the following:
3 total documents 3 total documents fetched successfully 0 total documents failed to fetch Document: search URL: http://www.google.com Array ( [title] => Array ( [0] => Google ) [h2] => Array ( [0] => Account Options ) ) Document: chrome URL: http://www.google.com/intl/en/chrome/browser/ Array ( [title] => Array ( [0] => Chrome Browser ) [h2] => Array ( [0] => Customize your browser [1] => Get Chrome for Mobile [2] => Up to 15 GB free storage [3] => Get a fast, free web browser ) ) Document: products URL: http://www.google.com/intl/en/about/products/ Array ( [title] => Array ( [0] => Google - Products ) [h2] => Array ( [0] => Web[1] => Mobile [2] => Media [3] => Geo [4] => Specialized Search [5] => Home & Office [6] => Social [7] => Innovation ) )
We can see from the results that our bot is operating as expected. First, we can see that we fetched a total of three documents, three documents were fetched successfully, and the bot failed to fetch zero documents. We can then see the title, URL, and fields (and field values) for each document. The bot has successfully parsed out each field and field value. This type of bot would be useful for harvesting various types of documents and document fields from desired websites or web services.