Sometimes it's handy to create tests in a separate stack and then to take the function you've made into your application stack. The following points will help you in making a links extraction function:
on mouseUp put url "http://www.runrev.com/" into field 1 put getLinks(field 1) into field 2 end mouseUp
getLinks
. Start with returning what it has sent:function getLinks pPageSource return pPageSource end getLinks
replace
function can do this nicely. Add these two lines to the script (before the "return" line, of course!):replace "/a>" with "/a>" & return in pPageSource replace "<a" with return & "<a" in pPageSource
filter pPageSource with "*a href*/a>"
*
characters are wildcards that reduces the list so that it only contains the lines that have both a href
and /a>
. Try the button again.itemdelimiter
, we can get at that bit. Add the following lines:set the itemdelimiter to quote repeat with a = 1 to the number of lines in pPageSource put item 2 of line a of pPageSource into line a of pPageSource end repeat
/
and not http
.function getPath pPageURL,pLinkURL end getPath
getPath
), starting with it's full path:if pLinkURL contains "://" then return pLinkURL end if
gPageURL
. For the case where the link is a root relative (it starts with a /
), we want to combine the host location and the link URL:set the itemdelimiter to "/" if char 1 of pLinkURL is "/" then return item 1 to 3 of pPageURL & pLinkURL else
/
, it may start with ../
to step up one level in the server structure. Deleting the last part of the page URL will give us what we need to combine with the link URL:if char 1 to 3 of pLinkURL is "../" then delete the last item of pPageURL delete the last item of pPageURL delete char 1 to 2 of pLinkURL return pPageURL & pLinkURL else For other cases we combine the page URL and the link URL: delete the last item of pPageURL return pPageURL & "/" & pLinkURL end if end if
return "" end getPath
getLinks
function to use the getPath
function, we need to make a change to the script shown in step 9:repeat with a = 1 to the number of lines in pPageSource put getPath(gPageURL,item 2 of line a of pPageSource) into line a of pPageSource end repeat
In stages, we developed a function that can find the links in a web page's source text ending with a set of full path URLs that we can present to the user.
The one missing piece in the test stack is the global variable that stores the page URL. In the case of the app stack, the value is provided by the browser control's browserFinishedLoading
function, but here, we need to plug in a value for testing purposes.
Place a global declaration line in the button's script and the stack script. In the button script, fill in the variable with our test case value. The script will then be like this:
global gPageURL on mouseUp put "http://www.runrev.com/" into gPageURL put url gPageURL into field 1 put getLinks(field 1) into field 2 end mouseUp
Try the button now, you should see a list of full path URLs in your second field. If it works correctly, copy the two stack functions and the global declaration line and paste them into the stack script of the WebScraper stack.