Handling errors (Simple)

Errors are something that you will encounter often in web scraping. The web is messy, and you can never be certain whether an element exists, or if a page returns the data you want, or even that a site's server is up and running. Being able to catch, throw, and handle these exceptions (errors) is essential to web scraping.

Getting ready

Take a look at the getUrl method in the previous scrapers. You'll notice the following lines:

try{
...
}catch(IOException e){
    System.out.println("There was an error connecting to the URL");
    return "";
}

This code snippet indicates that the statement in the try brackets is capable of throwing an IOException if things go wrong, and defines how to gracefully handle this exception if it occurs, inside the catch brackets.

In this recipe, we will look at, not just how to handle exceptions, but also how to create and throw our own exceptions, in more detail.

How to do it...

Although there are many built-in general exceptions Java has that can be thrown, it's best to be specific when it's possible to anticipate a problem. For this reason, we will create our own error classes for our software to throw.

  1. Create a new package in the Scraper project called com.packtpub.JavaScraping.HandlingErrors.
  2. Create the class LinkNotFoundException inside the new package. The code will look like this:
    package com.packtpub.JavaScraping.HandlingErrors;
    public class LinkNotFoundException extends Exception{
       public LinkNotFoundException(String msg) {
       super(msg);
       }
    }

    Modify WikiScraper.java to match the following (where methods are the same as they appear in previous sections, the body of the method is replaced by "..."):

    package com.packtpub.JavaScraping.HandlingErrors;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    import java.net.*;
    import java.io.*;
    import java.util.Random;
    
    public class WikiScraper {
      private static Random generator;
      public static void main(String[] args) {
        generator = new Random(31415926);
        for(int i = 0; i < 10; i++){
          try{
            scrapeTopic("/wiki/Java");
          }
          catch(LinkNotFoundException e){
            System.out.println("Trying again!");
          }
        }
      }
      
      public static void scrapeTopic(String url) throws LinkNotFoundException{
        String html = getUrl("http://www.wikipedia.org/"+url);
        Document doc = Jsoup.parse(html);
        Elements links = doc.select("#mw-content-text [href~=^/wiki/((?!:).)*$]");
        if(links.size() == 0){
          throw new LinkNotFoundException("No links on page, or page malformed");
        }
        int r = generator.nextInt(links.size());
        System.out.println("Random link is: "+links.get(r).text()+" at url: "+links.get(r).attr("href"));
        scrapeTopic(links.get(r).attr("href"));
      }
      
      public static String getUrl(String url){
        ... 
      }
    }

How it works...

This is the first example we've seen of extending classes in Java. If you've done object-oriented programming before, you're likely to be familiar with the concept of extending classes. When we extend a class, the extended class contains all the attributes and methods of the parent class. The classic zoological example of class extension goes something like this:

Public class Animal{...}
Public class FourLeggedMammal extends Animal(){...}
Public class Dog extends FourLeggedMammal{...}

In this example, the Dog class contains all the attributes/methods/behaviors of FourLeggedMammal, while perhaps adding some additional methods of their own, specific to Dog (for example, public String void bark()).

If you were to take a look at the Java core, you'd see many exception classes similar to the one we've just created. All exceptions extend the Java Exception class, which provides methods for getting the stack trace for the error, printing error messages, and other useful things.

Tip

Be aware that throwing an error requires a large amount of overhead, and should not be used in place of a normal return value. Errors should indicate that something has gone seriously wrong, and things need to be handled or reset appropriately.

There's more...

Oracle has an excellent Java tutorial series at http://docs.oracle.com/javase/tutorial/index.html that provides not just an introductory series of lessons to Java development but also resources on many advanced topics.

Although the throwing, catching, and handling of exceptions might seem like a fairly straightforward exercise as it was presented in this recipe, exception handling is still far from a simple subject. Oracle's lesson on exception handling, which can be found at http://docs.oracle.com/javase/tutorial/essential/exceptions/, is a good starting point for learning more advanced techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset