Let's take the largely useless capitalization application and do something practical with it. Here, our goal is to build a rudimentary spider. In doing so, we'll accomplish the following tasks:
These kinds of applications are written every day, and they're the ones that benefit the most from concurrency and non-blocking code.
It probably goes without saying, but this is not a particularly elegant web scraper. For starters, it only knows a few start points—the five URLs that we supply it. Also, it's neither recursive nor is it thread-safe in terms of data integrity.
That said, the following code works and demonstrates how we can use channels and the select
statements:
package main import( "fmt" "io/ioutil" "net/http" "time" ) var applicationStatus bool var urls []string var urlsProcessed int var foundUrls []string var fullText string var totalURLCount int var wg sync.WaitGroup var v1 int
First, we have our most basic global variables that we'll use for the application state. The applicationStatus
variable tells us that our spider process has begun and urls
is our slice of simple string URLs. The rest are idiomatic data storage variables and/or application flow mechanisms. The following code snippet is our function to read the URLs and pass them across the channel:
func readURLs(statusChannel chan int, textChannel chan string) { time.Sleep(time.Millisecond * 1) fmt.Println("Grabbing", len(urls), "urls") for i := 0; i < totalURLCount; i++ { fmt.Println("Url", i, urls[i]) resp, _ := http.Get(urls[i]) text, err := ioutil.ReadAll(resp.Body) textChannel <- string(text) if err != nil { fmt.Println("No HTML body") } statusChannel <- 0 } }
The readURLs
function assumes statusChannel
and textChannel
for communication and loops through the urls
variable slice, returning the text on textChannel
and a simple ping on statusChannel
. Next, let's look at the function that will append scraped text to the full text:
func addToScrapedText(textChannel chan string, processChannel chan bool) { for { select { case pC := <-processChannel: if pC == true { // hang on } if pC == false { close(textChannel) close(processChannel) } case tC := <-textChannel: fullText += tC } } }
We use the addToScrapedText
function to accumulate processed text and add it to a master text string. We also close our two primary channels when we get a kill signal on our processChannel
. Let's take a look at the evaluateStatus()
function:
func evaluateStatus(statusChannel chan int, textChannel chan string, processChannel chan bool) { for { select { case status := <-statusChannel: fmt.Print(urlsProcessed, totalURLCount) urlsProcessed++ if status == 0 { fmt.Println("Got url") } if status == 1 { close(statusChannel) } if urlsProcessed == totalURLCount { fmt.Println("Read all top-level URLs") processChannel <- false applicationStatus = false } } } }
At this juncture, all that the evaluateStatus
function does is determine what's happening in the overall scope of the application. When we send a 0
(our aforementioned ping) through this channel, we increment our urlsProcessed
variable. When we send a 1
, it's a message that we can close the channel. Finally, let's look at the main
function:
func main() { applicationStatus = true statusChannel := make(chan int) textChannel := make(chan string) processChannel := make(chan bool) totalURLCount = 0 urls = append(urls, "http://www.mastergoco.com/index1.html") urls = append(urls, "http://www.mastergoco.com/index2.html") urls = append(urls, "http://www.mastergoco.com/index3.html") urls = append(urls, "http://www.mastergoco.com/index4.html") urls = append(urls, "http://www.mastergoco.com/index5.html") fmt.Println("Starting spider") urlsProcessed = 0 totalURLCount = len(urls) go evaluateStatus(statusChannel, textChannel, processChannel) go readURLs(statusChannel, textChannel) go addToScrapedText(textChannel, processChannel) for { if applicationStatus == false { fmt.Println(fullText) fmt.Println("Done!") break } select { case sC := <-statusChannel: fmt.Println("Message on StatusChannel", sC) } } }
This is a basic extrapolation of our last function, the capitalization function. However, each piece here is responsible for some aspect of reading URLs or appending its respective content to a larger variable.
In the following code, we created a sort of master loop that lets you know when a URL has been grabbed on statusChannel
:
for { if applicationStatus == false { fmt.Println(fullText) fmt.Println("Done!") break } select { case sC := <- statusChannel: fmt.Println("Message on StatusChannel",sC) } }
Often, you'll see this wrapped in go func()
as part of a WaitGroup
struct, or not wrapped at all (depending on the type of feedback you require).
The control flow, in this case, is evaluateStatus
, which works as a channel monitor that lets us know when data crosses each channel and ends execution when it's complete. The readURLs
function immediately begins reading our URLs, extracting the underlying data and passing it on to textChannel
. At this point, our addToScrapedText
function takes each sent HTML file and appends it to the fullText
variable. When evaluateStatus
determines that all URLs have been read, it sets applicationStatus
to false
. At this point, the infinite loop at the bottom of main()
quits.
As mentioned, a crawler cannot come more rudimentary than this, but seeing a real-world example of how goroutines can work in congress will set us up for safer and more complex examples in the coming chapters.