Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Time for action – making a links extraction function

Sometimes it's handy to create tests in a separate stack and then to take the function you've made into your application stack. The following points will help you in making a links extraction function:

Create a new Mainstack and save it, just to be safe!
Add a couple of fields and a button.

Set the button's script to this:

on mouseUp
  put url "http://www.runrev.com/" into field 1
  put getLinks(field 1) into field 2
end mouseUp

Edit the stack script and create a function for getLinks. Start with returning what it has sent:
```
function getLinks pPageSource
  return pPageSource
end getLinks
```
If you try clicking on the button at this point, you will see that the whole page source appears in field 2.
We're going to use the filter function, and it needs the text to be in separate lines. So, we want every link to be in a line of its own. The replace function can do this nicely. Add these two lines to the script (before the "return" line, of course!):
```
  replace "/a>" with "/a>" & return in pPageSource
  replace "<a" with return & "<a" in pPageSource
```
Try clicking on the button now. The two fields will look much the same, but any lines that have a link in them will certainly be on a line of their own.
Add a line to filter the list, as it stands, to reduce it so that it shows only the lines with links in them:
```
  filter pPageSource with "*a href*/a>"
```
The * characters are wildcards that reduces the list so that it only contains the lines that have both a href and /a>. Try the button again.
Now you'll see that there are only lines with links in them, but they still include the junk either side of the link itself. The part we need is between the first and second quote marks, and using the itemdelimiter, we can get at that bit. Add the following lines:
```
set the itemdelimiter to quote
  repeat with a = 1 to the number of lines in pPageSource
    put item 2 of line a of pPageSource into line a of pPageSource
  end repeat
```
When you now click on the button, you should get a list of only the URL part of each line. However note that most of the links start with / and not http.
Make another function in the stack script that will change the links to full path:
```
function getPath pPageURL,pLinkURL
end getPath
```
Now, add the code needed to cope with the variations of URL (to function getPath), starting with it's full path:
```
  if pLinkURL contains "://" then
    return pLinkURL
  end if
```
If you recall from earlier, we saved the URL of the main page in a global variable, gPageURL. For the case where the link is a root relative (it starts with a /), we want to combine the host location and the link URL:
```
  set the itemdelimiter to "/"
  if char 1 of pLinkURL is "/" then
    return item 1 to 3 of pPageURL & pLinkURL
  else
```

When that first character is not /, it may start with ../ to step up one level in the server structure. Deleting the last part of the page URL will give us what we need to combine with the link URL:

if char 1 to 3 of pLinkURL is "../" then
  delete the last item of pPageURL
  delete the last item of pPageURL
  delete char 1 to 2 of pLinkURL
  return pPageURL & pLinkURL
else
For other cases we combine the page URL and the link URL:
  delete the last item of pPageURL
  return pPageURL & "/" & pLinkURL
  end if
end if

Lastly, if all of these checks fail, we will return an empty string, so that this strange structured link URL doesn't go on to confuse us later:
```
  return ""
end getPath
```

To get the getLinks function to use the getPath function, we need to make a change to the script shown in step 9:

  repeat with a = 1 to the number of lines in pPageSource
  put getPath(gPageURL,item 2 of line a of pPageSource) into line a of pPageSource
end repeat

What just happened?

In stages, we developed a function that can find the links in a web page's source text ending with a set of full path URLs that we can present to the user.

The missing links

The one missing piece in the test stack is the global variable that stores the page URL. In the case of the app stack, the value is provided by the browser control's browserFinishedLoading function, but here, we need to plug in a value for testing purposes.

Place a global declaration line in the button's script and the stack script. In the button script, fill in the variable with our test case value. The script will then be like this:

global gPageURL

on mouseUp
  put "http://www.runrev.com/" into gPageURL
  put url gPageURL into field 1
  put getLinks(field 1) into field 2
end mouseUp

Try the button now, you should see a list of full path URLs in your second field. If it works correctly, copy the two stack functions and the global declaration line and paste them into the stack script of the WebScraper stack.

One more thing…

The tab bar script includes an init line. This will call the card script; in this case, the Links card script, but it doesn't exist yet! Let's make it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Time for action – making a links extraction function

Create new playlist

Sign In

Sign Up

Time for action – making a links extraction function

What just happened?

The missing links

One more thing…

Table of Contents for
Time for action – making a links extraction function