Questions on NLP

Chinmayi Sahu
2 min readMar 8, 2021

Why should we convert text to a vector?

  1. The computer understands the numerical value
  2. Once we convert text to vector we can leverage the power of algebra

What rules need to be followed while converting text to a vector?

In the case of a review, if R1 and R2 are more similar compared to R3, the distance between V1 and V2 will be less compare to V3. Here R is a review and V is a vector. In other words, similar points are closer geometrically.

How to convert text to vector?

  1. BagOfWords

Import as: from sklearn.feature_extraction.text import CountVectorizer

Size of Vector: Unique words in Corpus

Each word can have a different number. It depends on how many times the word occurred.

Sparse vector — Most of the elements are 0.

Sparsity = Number of 0 elements / Total Elements

The value stored as a sparse matrix

There is also Binary Bag Of Words where the vector consists of 0 or 1

2. TFIDF:

Import as: from sklearn.feature_extraction.text import TfidfVectorizer

TF * log(IDf)

TF is probability i.e. value ranges from 0 to 1

IDF can be a large number. We take a log to get it into range.

For frequent word the IDF is low but TF is high and for rare word IDF is high but TF is low

3. Word2Vec (W2V)

Import as: from gensim.models import Word2Vec, KeyedVectors

w2v_model=KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True)

w2v_model.wv[‘computer’]

w2v_model.wv.most_similar(‘great’)

w2v_model.wv.similarity(‘woman’, ‘man’)

Each word is N Dimension. The more the dimension, the more powerful the vector is.

4. Avg-W2V

W2V(W1) + W2V(W2) + …….. + W2V(WN) / N

5. TFIDF-W2V

(TFIDF(W1) * W2V(W1) + TFIDF(W2) * W2V(W2) + …….. + TFIDF(WN) * W2V(WN)) / (TFIDF(W1) + TFIDF(W2) + ….. + TFIDF(WN))

Why we use log in IDF?

According to Zipf’s law, the natural property of words is power-law distributed. We take a log to make it Gaussian distributed.

What are the steps in text pre-processing?

  1. Remove the HTML tags
  2. Stop word removal (not is a stop word but it's an important word to determine polarity. We can use n-gram to solve the issue of removal of important stop words)
  3. Remove non-alphanumeric character
  4. Change to Lower case
  5. Lemmatization (nltk.stem.wordnet.WordNetLemmatizer)
  6. Stemming (nltk.stem.PorterStemmer / nltk.stem.SnowballStemmer)
  7. Tokenization
  8. Converting text to vector
  9. Thresholding

--

--