Training Doc2vec on yelp sentiment data

We use a random sample of 500,000 Yelp (see Chapter 13Working with Text Data) reviews with their associated star ratings (see notebook yelp_sentiment):

df = (pd.read_parquet('yelp_reviews.parquet', engine='fastparquet')
.loc[:, ['stars', 'text']])
stars = range(1, 6)
sample = pd.concat([df[df.stars==s].sample(n=100000) for s in stars])

We apply use simple pre-processing to remove stopwords and punctuation using NLTK's tokenizer and drop reviews with fewer than 10 tokens:

import nltk
nltk.download('stopwords')
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
tokenizer = RegexpTokenizer(r'w+')
stopword_set = set(stopwords.words('english'))

def clean(review):
tokens = tokenizer.tokenize(review)
return ' '.join([t for t in tokens if t not in stopword_set])

sample.text = sample.text.str.lower().apply(clean)
sample = sample[sample.text.str.split().str.len()>10]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset