Writing a simple scraper (Simple)

Wikipedia is not just a helpful resource for researching or looking up information but also a very interesting website to scrape. They make no efforts to prevent scrapers from accessing the site, and, with a very well-marked-up HTML, they make it very easy to find the information you're looking for. In this project, we will scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article.

Getting ready

It is recommended that you complete the first two recipes in this book, or at least have some working knowledge of Java, and the ability to create and execute Java programs at this point.

As an example, we will use the article from the following Wikipedia link:

http://en.wikipedia.org/wiki/Java

Note that this article is about the Indonesian island nation Java, not the programming language. Regardless, it seems appropriate to use it as a test subject.

We will be using the jsoup library to parse HTML data from websites and convert it into a manipulatable object, with traversable, indexed values (much like an array). In this exercise, we will show you how to download, install, and use Java libraries. In addition, we'll also be covering some of the basics of the jsoup library in particular.

How to do it...

Now that we're starting to get into writing scrapers, let's create a new project to keep them all bundled together. Carry out the following steps for this task:

  1. As shown in the previous recipe, open Eclipse and create a new Java project called Scraper.
  2. Although we didn't create a Java package in the previous recipe, packages are still considered to be handy for bundling collections of classes together within a single project (projects contain multiple packages, and packages contain multiple classes). You can create a new package by highlighting the Scraper project in Eclipse and going to File | New | Package.
  3. By convention, in order to prevent programmers from creating packages with the same name (and causing namespace problems), packages are named starting with the reverse of your domain name (for example, com.mydomain.mypackagename). For the rest of the book, we will begin all our packages with com.packtpub.JavaScraping appended with the package name. Let's create a new package called com.packtpub.JavaScraping.SimpleScraper.
  4. Create a new class, WikiScraper, inside the src folder of the package.
  5. Download the jsoup core library, the first link, from the following URL:

    http://jsoup.org/download

  6. Place the .jar file you downloaded into the lib folder of the package you just created.
  7. In Eclipse, right-click in the Package Explorer window and select Refresh. This will allow Eclipse to update the Package Explorer to the current state of the workspace folder. Eclipse should show your jsoup-1.7.2.jar file (this file may have a different name depending on the version you're using) in the Package Explorer window.
  8. Right-click on the jsoup JAR file and select Build Path | Add to Build Path.
  9. In your WikiScraper class file, write the following code:
    package com.packtpub.JavaScraping.SimpleScraper;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import java.net.*;
    import java.io.*;
    
    public class WikiScraper {
      public static void main(String[] args) {
        scrapeTopic("/wiki/Python");
      }
      
      public static void scrapeTopic(String url){
        String html = getUrl("http://www.wikipedia.org/"+url);
        Document doc = Jsoup.parse(html);
        String contentText = doc.select("#mw-content-text > p").first().text();
        System.out.println(contentText);
      }
      
      public static String getUrl(String url){
        URL urlObj = null;
        try{
          urlObj = new URL(url);
        }
        catch(MalformedURLException e){
          System.out.println("The url was malformed!");
          return "";
        }
        URLConnection urlCon = null;
        BufferedReader in = null;
        String outputText = "";
        try{
          urlCon = urlObj.openConnection();
          in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
          String line = "";
          while((line = in.readLine()) != null){
            outputText += line;
          }
          in.close();
        }catch(IOException e){
          System.out.println("There was an error connecting to the URL");
          return "";
        }
        return outputText;
      }
    }

    Assuming you're connected to the internet, this should compile and run with no errors, and print the first paragraph of text from the article.

    Tip

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

How it works...

Unlike our HelloWorld example, there are a number of libraries needed to make this code work. We can incorporate all of these using the import statements before the class declaration. There are a number of jsoup objects needed, along with two Java libraries, java.io and java.net, which are needed for creating the connection and retrieving information from the Web.

As always, our program starts out in the main method of the class. This method calls the scrapeTopic method, which will eventually print the data that we are looking for (the first paragraph of text in the Wikipedia article) to the screen. scrapeTopic requires another method, getURL, in order to do this.

getUrl is a function that we will be using throughout the book. It takes in an arbitrary URL and returns the raw source code as a string. Understanding the details of this method isn't important right now—we'll dive into more depth about this in other sections of the book. Essentially, it creates a Java URL object from the URL string, and calls the openConnection method on that URL object. The openConnection method returns a URLConnection object, which can be used to create a BufferedReader object.

BufferedReader objects are designed to read from, potentially very long, streams of text, stopping at a certain size limit, or, very commonly, reading streams one line at a time. Depending on the potential size of the pages you might be reading (or if you're reading from an unknown source), it might be important to set a buffer size here. To simplify this exercise, however, we will continue to read as long as Java is able to.

The while loop here retrieves the text from the BufferedReader object one line at a time and adds it to outputText, which is then returned.

After the getURL method has returned the HTML string to scrapeTopic, jsoup is used. jsoup is a Java library that turns HTML strings (such as the string returned by our scraper) into more accessible objects. There are many ways to access and manipulate these objects, but the function you'll likely find yourself using most often is the select function. The jsoup select function returns a jsoup object (or an array of objects, if more than one element matches the argument to the select function), which can be further manipulated, or turned into text, and printed.

The crux of our script can be found in this line:

String contentText = doc.select("#mw-content-text > p").first().text();

This finds all the elements that match #mw-content-text > p (that is, all p elements that are the children of the elements with the CSS ID mw-content-text), selects the first element of this set, and turns the resulting object into plain text (stripping out all tags, such as <a> tags or other formatting that might be in the text).

The program ends by printing this line out to the console.

There's more...

Jsoup is a powerful library that we will be working with in many applications throughout this book. For uses that are not covered in the book, I encourage you to read the complete documentation at http://jsoup.org/apidocs/.

What if you find yourself working on a project where jsoup's capabilities aren't quite meeting your needs? There are literally dozens of Java-based HTML parsers on the market.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset