Faster scraping through threading (Intermediate)

Although processors were originally developed with only one core, modern multicore processors (processors with multiple separate and independent CPUs embedded on a single chip) run most efficiently when they are executing multiple threads (separate, independent sets of instructions) at once.

If you have a powerful quad-core processor completely dedicated to executing your program, but your program is not using threading, you are essentially limited to using 25 percent of your processor's capabilities. In this recipe, we will learn how to take advantage of your machine (especially important when running on a dedicated server) and speed up your scraping.

Getting ready

Writing software that uses threading can be extremely important in web scraping for several reasons:

  • You're often working on machines that are completely dedicated to scraping and you need to be able to take advantage of multicore processors appropriately
  • It often takes a long time for web pages to respond, and having multiple threads running at once can allow the processor to help optimize efficiency during these "down" times
  • You need all your scrapers to share resources easily; global variables, for example, are shared between all threads

Because we will be using the code from persisting data, it would be helpful to review the recipe Persisting data (Advanced), have your MySQL database ready, and download the JDBC drivers from http://dev.mysql.com/downloads/connector/j/. Alternatively, you can relatively easily replace the code that actually stores the results in a database and do something else with it, such as simply printing it on the screen.

How to do it...

Up until this point, every class we've written has either begun execution in the main method (if called statically) or in the method that we explicitly call (for example, myWebsite.getString()). However, threads, which extend the core Java Thread class, have a run method that is started when myThreadClass.start() is called.

  1. We will be basing our thread code off of the code created in the Persisting data (Advanced) recipe. Review this recipe if you haven't completed it already.
  2. Methods that remain unchanged are marked as methodName(){...}. You should create a new package, com.packtpub.JavaScraping.Threading, that contains two methods: ThreadStarter and WikiCrawler.

    The WikiCrawler method:

    package com.packtpub.JavaScraping.Threading;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.net.URLConnection;
    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.PreparedStatement;
    import java.sql.SQLException;
    import java.util.Properties;
    import java.util.Random;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    class WikiCrawler extends Thread {
      
      private static Random generator;
      private static Connection dbConn;
      
        public WikiCrawler(String str) {
          super(str);
        generator = new Random(27182845);
    
        String dbURL = "jdbc:mysql://localhost:3306/java";
        Properties connectionProps = new Properties();
        connectionProps.put("user", "<username>");
        connectionProps.put("password", "<password>");
        dbConn = null;
        try {
          dbConn = DriverManager.getConnection(dbURL, connectionProps);
        } catch (SQLException e) {
          System.out.println("There was a problem connecting to the database");
          e.printStackTrace();
        }
    
        PreparedStatement useStmt;
        try {
          useStmt = dbConn.prepareStatement("USE java");
          useStmt.executeUpdate();
        } catch (SQLException e) {
          e.printStackTrace();
        }   
        }
        
        public void run() {
          String newUrl = getName();
            for (int i = 0; i < 10; i++) {
                System.out.println(i + " " + getName());
                newUrl = scrapeTopic(newUrl);
                if(newUrl == ""){
                  newUrl = getName();
                }
                try {
                    sleep((int)(Math.random() * 1000));
                } catch (InterruptedException e) {}
            }
            System.out.println("DONE! " + getName());
        }
        
    
      public static String scrapeTopic(String url){
        ... 
      }    
    
      private static void writeToDB(String title, String url){
        ... 
      }
      
      public static String getUrl(String url){
    ... 
      }
        
    }
  3. Replace <username> and <password> in WikiCrawler with the username and password for your MySQL database.
  4. The following class, ThreadStarter, creates two WikiCrawler threads, each with a different Wikipedia URL, and starts them:
    package com.packtpub.JavaScraping.Threading;
    
    class ThreadStarter {
        public static void main (String[] args) {
            WikiCrawler javaIsland = new WikiCrawler("/wiki/Java");
            WikiCrawler javaLanguage = new WikiCrawler("/wiki/Java_(programming_language)");
            javaLanguage.start();
            javaIsland.start(); 
        }
    }

    The output would look something like this:

    How to do it...

How it Works...

As mentioned earlier, WikiCrawler.run() acts as a jump-off point in the thread execution, in much the same way that main does in a normal class. Although WikiCrawler.scrapeTopic() has been slightly modified from the code presented in the Persisting data (Advanced) recipe, it serves much the same purpose in that it retrieves from the web page for a given Wikipedia article, gets a new random number based on the number of links to other articles found, and writes a new line to a table in the database. In this case, however, it returns the next URL as a string for the main loop in the run method to use on its next pass.

It is tempting to create global variables in the Thread class for things like the starting URL (for example, /wiki/Java) that we feed it, but keep in mind that threads share memory. If this global variable were created, each thread would go off of the URL that was given to the last thread (in this case, /wiki/Java_Road). We use thread names to differentiate the threads. In this case, it is convenient to use the names as the URL that they are assigned to as a starting point for the web crawling.

The implementation of the ThreadStarter class is relatively straightforward. This class, as a normal, non-thread Java class, starts execution in the main method. It initializes four new WikiScraper threads and starts them in one line each.

There's more...

This was a very simple, two-class example designed to demonstrate how to create and run threads. Although the threads were all of one type in this example (the WikiScraper class), it is certainly possible to have threads that are designed only to wait for pages to load and return the page to a main class, threads to analyze the data in the pages, and threads to store the data in a database. Because threads share memory, it is easy to pass information between them.

Although it is impossible to be certain that a thread has executed (and therefore, if the data you want is present yet), you can certainly create a loop that continuously checks to see if the data has been added to an array yet, and, if so, does something with the data before continuing the loop and waiting for the next piece of data to be added.

Threading gets very complicated very fast, although it has immense power and can vastly improve the performance of your programs. For more information about different methods of using threads to improve runtime, the following links are a good start:

http://docs.oracle.com/javase/tutorial/essential/concurrency/

http://docs.oracle.com/javase/tutorial/essential/concurrency/pools.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset