Preface

Welcome to Instant Web Scraping with Java! Although web scraping may seem like a fairly specific topic, there's more to it than simply turning URLs into HTML. What happens when you find that a page has a redirect, the server has placed a rate limiter on your IP address, or the data you want is behind a wall of Ajax or a form? How do you solve problems in a robust way? What do you do with the data after you get it?

This book assumes that you have some basic foundation in programming, and probably know your way around the command line a little. You can troubleshoot and improvise, and aren't afraid to play around with the code. Although we'll hold your hand and walk you through the basic examples, you're not going to get very far unless you combine different techniques learned in the book in order to fit your needs. In the There's more... section of each recipe, we'll give you some pointers on where to go from here, but there's a lot of ground to cover and not many pages to do it in.

If you're new to Java, I would recommend starting by carefully reading the recipes on Setting up your Java Environment (Simple) and Writing and executing HelloWorld.java (Simple). More advanced programmers may be fine starting with the sections on RMI and threading. Whenever a later recipe depends on code used in a previous recipe, it will be mentioned in the Getting ready section of the recipe.

What this book covers

Setting up your Java Environment (Simple) explains that Java is not a lightweight language, and it requires some similarly heavy-duty installations. We'll guide you through setting up the Java Development Kit, as well as an Integrated Development Environment.

Writing and executing HelloWorld.java (Simple) shows you the basics of writing and executing basic Java programs, and introduces you to the architecture of the language. If you've never written a line of Java in your life, this recipe will be pretty helpful.

Writing a simple scraper (Simple) introduces you to some basic scraping functions and outputs some text grabbed from the Internet. If you only write one scraper this year, write this one.

Writing a more complicated scraper (Intermediate) teaches you that you have to crawl before you can walk. Let's write a scraper that traverses a series of links and runs indefinitely, reporting data back as it goes.

Handling errors (Simple) explains that the web is dark and full of terrors, but that won't stop our scrapers. Learn how to handle missing code, broken links, and other unexpected situations gracefully by throwing and catching errors.

Writing robust, scalable code (Advanced) teaches you how to create multiple packages, classes, and methods that communicate with one another, scrape multiple sites at once, and document your code so others can follow along. If you wanted to write one-page scripts you would have used Perl.

Persisting data (Advanced) explains how to connect to remote databases and store results from the web for later retrieval. Nothing lasts forever, but we can try.

Writing tests (Intermediate) shows you the tips and tricks of the JUnit testing framework. JUnit is especially important in web scraping, and many Java web scraping libraries are based off of JUnit. It will be used in several recipes in the book. It's important whether you want to check your own website for unexpected problems, or poke around someone else's.

Going undercover (Intermediate) explains how to masquerade as a web browser in order to prevent being blocked by sites. Although nearly every piece of data on the web is accessible to scrapers, some is a little harder to get to than others.

Submitting a basic form (Advances) teaches you to take off the gloves and actually interact with the sites by filling in fields, clicking buttons, and submitting forms. So far we've been looking but not touching when we crawl websites.

Scraping Ajax pages (Advanced) explains that the dynamic web brings some difficult challenges for web scrapers. Learn how to execute JavaScript as an actual web browser would, and access data that might be hidden by a layer of Ajax.

Faster scraping through threading (Intermediate) understands that you've moved beyond running scripts on your laptop and want to take full advantage of a dedicated server or larger machine, and explains how to create multithreaded scrapers that return data as fast as possible, without being hung up waiting for page loads.

Faster scraping with RMI (Advanced) shows you that using multiple servers for large scraping operations isn't necessary just for the increase in speed (although that is nice!) but for the increased robustness. Blocking one IP address is easy, but blocking all of them is hard. Learn how to communicate between servers using Java's built-in Remote Method Invocation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset