We have all the URLs for our stories, but, unfortunately, this isn't enough to train on; we'll need the full article body. This in itself could become a huge challenge if we want to roll our own scraper, especially if we are going to be pulling stories from dozens of sites. We would need to write code to target the article body while carefully avoiding all the other site gunk that surrounds it. Fortunately, as far as we are concerned, there are a number of free services that will do this for us. I'm going to be using Embedly to do this, but there are a number of other services that you could use instead.
The first step is to sign up for Embedly API access. You can do that at https://app.embed.ly/signup. It is a straightforward process. Once you confirm your registration, you will receive an API key. That's really all you'll need. You'll just use that key in your HTTP request. Let's do that now:
import urllib EMBEDLY_KEY = 'your_embedly_api_key_here' def get_html(x): try: qurl = urllib.parse.quote(x) rhtml = requests.get('https://api.embedly.com/1/extract?url=' + qurl + '&key=' + EMBEDLY_KEY) ctnt = json.loads(rhtml.text).get('content') except: return None return ctnt
The preceding code results in the following output:
And with that, we have the HTML of each story.
Since the content is embedded in HTML markup, and we want to feed plain text into our model, we'll use a parser to strip out the markup tags:
from bs4 import BeautifulSoup def get_text(x): soup = BeautifulSoup(x, 'html5lib') text = soup.get_text() return text df.loc[:,'text'] = df['html'].map(get_text) df
The preceding code results in the following output:
And with that, we have our training set ready. We can now move on to a discussion of how to transform our text into something that a model can work with.