Time for action – making a links extraction function

Sometimes it's handy to create tests in a separate stack and then to take the function you've made into your application stack. The following points will help you in making a links extraction function:

  1. Create a new Mainstack and save it, just to be safe!
  2. Add a couple of fields and a button.
  3. Set the button's script to this:
    on mouseUp
      put url "http://www.runrev.com/" into field 1
      put getLinks(field 1) into field 2
    end mouseUp
  4. Edit the stack script and create a function for getLinks. Start with returning what it has sent:
    function getLinks pPageSource
      return pPageSource
    end getLinks
  5. If you try clicking on the button at this point, you will see that the whole page source appears in field 2.
  6. We're going to use the filter function, and it needs the text to be in separate lines. So, we want every link to be in a line of its own. The replace function can do this nicely. Add these two lines to the script (before the "return" line, of course!):
      replace "/a>" with "/a>" & return in pPageSource
      replace "<a" with return & "<a" in pPageSource
  7. Try clicking on the button now. The two fields will look much the same, but any lines that have a link in them will certainly be on a line of their own.
  8. Add a line to filter the list, as it stands, to reduce it so that it shows only the lines with links in them:
      filter pPageSource with "*a href*/a>"
  9. The * characters are wildcards that reduces the list so that it only contains the lines that have both a href and /a>. Try the button again.
  10. Now you'll see that there are only lines with links in them, but they still include the junk either side of the link itself. The part we need is between the first and second quote marks, and using the itemdelimiter, we can get at that bit. Add the following lines:
    set the itemdelimiter to quote
      repeat with a = 1 to the number of lines in pPageSource
        put item 2 of line a of pPageSource into line a of pPageSource
      end repeat
  11. When you now click on the button, you should get a list of only the URL part of each line. However note that most of the links start with / and not http.
  12. Make another function in the stack script that will change the links to full path:
    function getPath pPageURL,pLinkURL
    end getPath
  13. Now, add the code needed to cope with the variations of URL (to function getPath), starting with it's full path:
      if pLinkURL contains "://" then
        return pLinkURL
      end if
  14. If you recall from earlier, we saved the URL of the main page in a global variable, gPageURL. For the case where the link is a root relative (it starts with a /), we want to combine the host location and the link URL:
      set the itemdelimiter to "/"
      if char 1 of pLinkURL is "/" then
        return item 1 to 3 of pPageURL & pLinkURL
      else
  15. When that first character is not /, it may start with ../ to step up one level in the server structure. Deleting the last part of the page URL will give us what we need to combine with the link URL:
    if char 1 to 3 of pLinkURL is "../" then
      delete the last item of pPageURL
      delete the last item of pPageURL
      delete char 1 to 2 of pLinkURL
      return pPageURL & pLinkURL
    else
    For other cases we combine the page URL and the link URL:
      delete the last item of pPageURL
      return pPageURL & "/" & pLinkURL
      end if
    end if
  16. Lastly, if all of these checks fail, we will return an empty string, so that this strange structured link URL doesn't go on to confuse us later:
      return ""
    end getPath
  17. To get the getLinks function to use the getPath function, we need to make a change to the script shown in step 9:
      repeat with a = 1 to the number of lines in pPageSource
      put getPath(gPageURL,item 2 of line a of pPageSource) into line a of pPageSource
    end repeat

What just happened?

In stages, we developed a function that can find the links in a web page's source text ending with a set of full path URLs that we can present to the user.

The missing links

The one missing piece in the test stack is the global variable that stores the page URL. In the case of the app stack, the value is provided by the browser control's browserFinishedLoading function, but here, we need to plug in a value for testing purposes.

Place a global declaration line in the button's script and the stack script. In the button script, fill in the variable with our test case value. The script will then be like this:

global gPageURL

on mouseUp
  put "http://www.runrev.com/" into gPageURL
  put url gPageURL into field 1
  put getLinks(field 1) into field 2
end mouseUp

Try the button now, you should see a list of full path URLs in your second field. If it works correctly, copy the two stack functions and the global declaration line and paste them into the stack script of the WebScraper stack.

One more thing…

The tab bar script includes an init line. This will call the card script; in this case, the Links card script, but it doesn't exist yet! Let's make it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset