The whole process involves the following specific tasks:
- Specify the search keyword.
- Create PubMed search_query from the keyword.
- Perform the search and limit it to the first 50 articles.
- Extract the abstract texts and store them in an object.
Here is the code to do the preceding task:
library(pubmed.mineR)
library(RISmed)
keyword <- "Deep Learning"
search_query <- EUtilsSummary(keyword, retmax=50)
summary(search_query)
extractedResult <- EUtilsGet(search_query)
pmid <- PMID(extractedResult)
years <- YearPubmed(extractedResult)
Jtitle <- Title(extractedResult)
articleTitle <- ArticleTitle(extractedResult)
abstracts <- AbstractText(extractedResult)
Once you have the abstracts in your R session, the next step is to do the pre-processing of the texts. Here are the steps for pre-processing:
- Convert all texts to either lowercase or uppercase.
- Remove punctuation from the text.
- Remove digits from the text.
- Remove stop words.
- Stemming the words, finding the root of a word or synonyms.
- Before implementing the tasks, you should create a corpus of the text data. The whole process is implemented using the function available in the tm library:
library(tm)
AbstractCorpus <- Corpus(VectorSource(abstracts))
AbstractCorpus <- tm_map(AbstractCorpus, content_transformer
(tolower))
AbstractCorpus <- tm_map(AbstractCorpus, removePunctuation)
AbstractCorpus <- tm_map(AbstractCorpus, removeNumbers)
Stopwords <- c(stopwords('english'))
AbstractCorpus <- tm_map(AbstractCorpus, removeWords,
Stopwords)
AbstractCorpus <- tm_map(AbstractCorpus, stemDocument)
- Once you have done all the initial processing, the final step is to create a term document matrix. This is a big sparse matrix. Whether a word is present in a document or not is indicated by a 1 or 0. To get the term document matrix, run the following code:
trmDocMat <- TermDocumentMatrix(AbstractCorpus, control =
list(minWordLength = 1))