Wikipedia is not just a helpful resource for researching or looking up information but also a very interesting website to scrape. They make no efforts to prevent scrapers from accessing the site, and, with a very well-marked-up HTML, they make it very easy to find the information you're looking for. In this project, we will scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article.
It is recommended that you complete the first two recipes in this book, or at least have some working knowledge of Java, and the ability to create and execute Java programs at this point.
As an example, we will use the article from the following Wikipedia link:
http://en.wikipedia.org/wiki/Java
Note that this article is about the Indonesian island nation Java, not the programming language. Regardless, it seems appropriate to use it as a test subject.
We will be using the jsoup library to parse HTML data from websites and convert it into a manipulatable object, with traversable, indexed values (much like an array). In this exercise, we will show you how to download, install, and use Java libraries. In addition, we'll also be covering some of the basics of the jsoup library in particular.
Now that we're starting to get into writing scrapers, let's create a new project to keep them all bundled together. Carry out the following steps for this task:
Scraper
.Scraper
project in Eclipse and going to File | New | Package.com.mydomain.mypackagename
). For the rest of the book, we will begin all our packages with com.packtpub.JavaScraping
appended with the package name. Let's create a new package called com.packtpub.JavaScraping.SimpleScraper
.WikiScraper
, inside the src
folder of the package..jar
file you downloaded into the lib
folder of the package you just created.workspace
folder. Eclipse should show your jsoup-1.7.2.jar
file (this file may have a different name depending on the version you're using) in the Package Explorer window.WikiScraper
class file, write the following code:package com.packtpub.JavaScraping.SimpleScraper; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import java.net.*; import java.io.*; public class WikiScraper { public static void main(String[] args) { scrapeTopic("/wiki/Python"); } public static void scrapeTopic(String url){ String html = getUrl("http://www.wikipedia.org/"+url); Document doc = Jsoup.parse(html); String contentText = doc.select("#mw-content-text > p").first().text(); System.out.println(contentText); } public static String getUrl(String url){ URL urlObj = null; try{ urlObj = new URL(url); } catch(MalformedURLException e){ System.out.println("The url was malformed!"); return ""; } URLConnection urlCon = null; BufferedReader in = null; String outputText = ""; try{ urlCon = urlObj.openConnection(); in = new BufferedReader(new InputStreamReader(urlCon.getInputStream())); String line = ""; while((line = in.readLine()) != null){ outputText += line; } in.close(); }catch(IOException e){ System.out.println("There was an error connecting to the URL"); return ""; } return outputText; } }
Assuming you're connected to the internet, this should compile and run with no errors, and print the first paragraph of text from the article.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Unlike our HelloWorld
example, there are a number of libraries needed to make this code work. We can incorporate all of these using the import
statements before the class declaration. There are a number of jsoup objects needed, along with two Java libraries, java.io
and java.net
, which are needed for creating the connection and retrieving information from the Web.
As always, our program starts out in the main method of the class. This method calls the scrapeTopic
method, which will eventually print the data that we are looking for (the first paragraph of text in the Wikipedia article) to the screen. scrapeTopic
requires another method, getURL
, in order to do this.
getUrl
is a function that we will be using throughout the book. It takes in an arbitrary URL and returns the raw source code as a string. Understanding the details of this method isn't important right now—we'll dive into more depth about this in other sections of the book. Essentially, it creates a Java URL object from the URL string, and calls the openConnection
method on that URL object. The openConnection
method returns a URLConnection
object, which can be used to create a BufferedReader
object.
BufferedReader
objects are designed to read from, potentially very long, streams of text, stopping at a certain size limit, or, very commonly, reading streams one line at a time. Depending on the potential size of the pages you might be reading (or if you're reading from an unknown source), it might be important to set a buffer size here. To simplify this exercise, however, we will continue to read as long as Java is able to.
The while
loop here retrieves the text from the BufferedReader
object one line at a time and adds it to outputText
, which is then returned.
After the getURL
method has returned the HTML string to scrapeTopic
, jsoup
is used. jsoup
is a Java library that turns HTML strings (such as the string returned by our scraper) into more accessible objects. There are many ways to access and manipulate these objects, but the function you'll likely find yourself using most often is the select
function. The jsoup select
function returns a jsoup object (or an array of objects, if more than one element matches the argument to the select
function), which can be further manipulated, or turned into text, and printed.
The crux of our script can be found in this line:
String contentText = doc.select("#mw-content-text > p").first().text();
This finds all the elements that match #mw-content-text > p
(that is, all p
elements that are the children of the elements with the CSS ID mw-content-text
), selects the first element of this set, and turns the resulting object into plain text (stripping out all tags, such as <a>
tags or other formatting that might be in the text).
The program ends by printing this line out to the console.
Jsoup is a powerful library that we will be working with in many applications throughout this book. For uses that are not covered in the book, I encourage you to read the complete documentation at http://jsoup.org/apidocs/.
What if you find yourself working on a project where jsoup's capabilities aren't quite meeting your needs? There are literally dozens of Java-based HTML parsers on the market.