Search engine

PyStemmer 1.0.1 consists of Snowball stemming algorithms that are used for performing information retrieval tasks and for constructing a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for performing stemming and information retrieval tasks in many languages, including many European languages.

We can construct a vector space search engine by converting the texts into vectors.

The following are the steps involved in constructing a vector space search engine:

  1. Consider the following code for the removal of stopwords and tokenization:

    A stemmer is a program that accepts words and converts them into stems. Tokens that have the same stem have nearly the same meanings. Stopwords are also eliminated from a text.

    def eliminatestopwords(self,list):
    """
    Eliminate words which occur often and have not much significance from context point of view.
    """
    return[ word for word in list if word not in self.stopwords ]
    
    def tokenize(self,string):
    """
    Perform the task of splitting text into stop words and tokens
    """
    Str=self.clean(str)
    Words=str.split("")
    return [self.stemmer.stem(word,0,len(word)-1) for word in words]
  2. Consider the following code for mapping keywords into vector dimensions:
    def obtainvectorkeywordindex(self, documentList):
    """
    In the document vectors,  generate the keyword for the given position of element
    """
    
    
    #Perform mapping of text into strings
    vocabstring = "".join(documentList)
    
    vocablist = self.parser.tokenise(vocabstring)
    #Eliminate common words that have no search significance
    vocablist = self.parser.eliminatestopwords(vocablist)
    uniqueVocablist = util.removeDuplicates(vocablist)
    
    vectorIndex={}
     offset=0
    #Attach a position to keywords that performs mapping with dimension that is used to depict this token
     for word in uniqueVocablist:
    vectorIndex[word]=offset
    offset+=1
     return vectorIndex  #(keyword:position)
  3. Here, a simple term count model is used. Consider the following code for the conversion of text strings into vectors:
    def constructVector(self, wordString):
    
            # Initialise the vector with 0's
            Vector_val = [0] * len(self.vectorKeywordIndex)
            tokList = self.parser.tokenize(tokString)
            tokList = self.parser.eliminatestopwords(tokList)
            for word in toklist:
                    vector[self.vectorKeywordIndex[word]] += 1; 
    # simple Term Count Model is used
            return vector
  4. Searching similar documents by finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle's value is 0 degrees and the vectors are said to be parallel (this means that the documents are said to be related). If the cosine value is 0 and value of the angle is 90 degrees, then the vectors are said to be perpendicular (this means that the documents are not said to be related). Let's see the code for computing the cosine between the text vectors using SciPy:
    def cosine(vec1, vec2):
    """
                    cosine  = ( X * Y ) / ||X|| x ||Y||
    """
    return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2)))
  5. We perform the mapping of keywords to vector space. We construct a temporary text that represents the items to be searched and then compare it with document vectors with the help of cosine measurement. Let's see the following code for searching the vector space:
    def searching(self,searchinglist):
    """ search for text that are  matched on the  basis oflist of items """
            askVector = self.buildQueryVector(searchinglist)
    
    ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors]
            ratings.sort(reverse=True)
            return ratings
  6. We will now consider the following code that can be used for detecting languages from the source text:
    >>>  import nltk
    >>>  import sys
    >>> try:
    from nltk import wordpunct_tokenize
    from nltk.corpus import stopwords
    except ImportError:
    print( 'Error has occured')
    
    
    
    #----------------------------------------------------------------------
    >>> def _calculate_languages_ratios(text):
    """
    Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1}
    """
     languages_ratios = {}
    '''
    nltk.wordpunct_tokenize() splits all punctuations into separate tokens
     wordpunct_tokenize("I hope you like the book interesting .")
    [' I',' hope ','you ','like ','the ','book' ,'interesting ','.']
    '''
    
    tok = wordpunct_tokenize(text)
    wor = [word.lower() for word in tok]
    
      # Compute occurence of unique stopwords in a text
    for language in stopwords.fileids():
    stopwords_set = set(stopwords.words(language))
    words_set = set(words)
    common_elements = words_set.intersection(stopwords_set)
    languages_ratios[language] = len(common_elements) 
    # language "score"
    return languages_ratios
    
    #----------------------------------------------------------------
    
    >>> def detect_language(text):
    """
    Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text.
    """
    ratios = _calculate_languages_ratios(text)
    most_rated_language = max(ratios, key=ratios.get)
    return most_rated_language
    
    
    
    if __name__=='__main__':
    
     text = '''
    All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a
    new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another.
    The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A
    personal or institutionalized system grounded in belief in a God or Gods and the activities connected
    with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good
    life'. It has never been the purpose of religion to divide people into groups of isolated followers that
    cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'.But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties
    and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the
    name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a
    number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with
    the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Bangladeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality.
    '''
    
    >>> language = detect_language(text)
    
    >>> print(language)

The preceding code will search for stopwords and detect the language of the text, that is, English.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset