Feature Extraction : Text Data

A lot of machine learning problems use text as feature/explanatory variables. But directly using text for training a machine learning model is not efficient, so the text must be transformed to a different representation that encodes much of its meaning in the feature vector.

The Bag-of-words model:

The most commonly used representation of text is bag-of-words model. This model uses a bag to encode the words that appear in the text. This model ignores text’s syntax, order and grammar. In this way of text representation a feature is created for each word that is present in the text. It is based on the intuition that the documents containing similar words generally have similar meanings. It can be effectively used for document classification.

Some terminologies:

Corpus : A collection of documents.
Vocabulary : All unique word in a corpus.
Dimension : The number of elements that comprise a feature vector.
Dictionary : It maps the vocabulary to indices in the feature vector.
Tokenization : The process of splitting a string in tokens(meaning sequence of characters).

In the bag-of-words representation, each element in the feature vector is a binary value that represents whether or not the corresponding word appeared in the document.

In python CountVectorizer class produce a bag-of-words representation from a string. The CountVectorizer class tokenizes using a regular expression that splits strings on white space and extracts sequences of characters that are two or more characters in length.

Python implementation:

>> from sklearn.feature_extraction.text import CountVectorizer

>> corpus = [ ‘UNC played Duke in basketball’, ‘ Duke lost the basketball game’]

>> vectorizer = CountVectorizer(binary=True)

>> print ( vectorizer.fit-transform(corpus).todense())

>> print (vectorizer.vocabulary_)

> [[1 1 0 1 0 1 0 1] [ 1 1 1 0 1 0 1 0]]

> {u’duke’ : 1, u’basketball’ : 0, u’lost’ : 4, u’played’ : 5, u’game’ : 2, u’unc’:7, u’in’ : 3, u’the’ : 6}

Here we have used the bag-of-words representation to create feature vector that encode whether or not a word from the corpus’s vocabulary appears in a document. These features do not encode grammar, word order, or the frequencies of words. This is intuitive that the frequency with which a word occurs in documents can also determine their similarity.

So, instead of using a binary value for each element in the feature vector, we will use an integer that represents the number of times that the word appeared in the document. This method assumes that all the documents of a corpus are of similar lengths.

Python implementation:

>> from sklearn.feature_extraction.text import CountVectorizer

>> corpus = [‘The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich’]

>> vectorizer = CountVectorizer(stop_words=’english’,binary=False)

>> print (vectorizer.fit_transform(corpus).todense())

[[2 2 1 3 1 1]]

{u’sandwich’: 2, u’wizard’: 4, u’dog’: 1, u’transfigured’: 3, u’ate’:0}

Another problem with this representation is that feature vector contain larger weights for terms that occur frequently in a document, even if those words occur frequently in most documents in the corpus. These terms don not help to represent the meaning of a particular document relative to the rest of the corpus. For example, most of the documents of a corpus of articles about Indian Cricket Team could include the words like cricket, players, batsmen etc. These words can’t be used to predict the similarity between different documents in the corpus and are considered as corpus specific words.

The inverse document frequency(IDF) is a measure of how rare or common a word occur in a corpus.

Python implementation:

>> from sklearn.feature_extraction.text import TfidfVectorizer

>>corpus = [ ‘The dog ate sandwich and I ate a sandwich’, ‘ The wizard transfigured a sandwich’ ]

>> vectorizer = TfidfVectorizer(stop_words=’english’)

>>print (vecorizer.fit_transform(corpus).todense())