Questions on NLP
Why should we convert text to a vector?
- The computer understands the numerical value
- Once we convert text to vector we can leverage the power of algebra
What rules need to be followed while converting text to a vector?
In the case of a review, if R1 and R2 are more similar compared to R3, the distance between V1 and V2 will be less compare to V3. Here R is a review and V is a vector. In other words, similar points are closer geometrically.
How to convert text to vector?
Import as: from sklearn.feature_extraction.text import CountVectorizer
Size of Vector: Unique words in Corpus
Each word can have a different number. It depends on how many times the word occurred.
Sparse vector — Most of the elements are 0.
Sparsity = Number of 0 elements / Total Elements
The value stored as a sparse matrix
There is also Binary Bag Of Words where the vector consists of 0 or 1
Import as: from sklearn.feature_extraction.text import TfidfVectorizer
TF * log(IDf)
TF is probability i.e. value ranges from 0 to 1
IDF can be a large number. We take a log to get it into range.
For frequent word the IDF is low but TF is high and for rare word IDF is high but TF is low
3. Word2Vec (W2V)
Import as: from gensim.models import Word2Vec, KeyedVectors
Each word is N Dimension. The more the dimension, the more powerful the vector is.
W2V(W1) + W2V(W2) + …….. + W2V(WN) / N
(TFIDF(W1) * W2V(W1) + TFIDF(W2) * W2V(W2) + …….. + TFIDF(WN) * W2V(WN)) / (TFIDF(W1) + TFIDF(W2) + ….. + TFIDF(WN))
Why we use log in IDF?
According to Zipf’s law, the natural property of words is power-law distributed. We take a log to make it Gaussian distributed.
What are the steps in text pre-processing?
- Remove the HTML tags
- Stop word removal (not is a stop word but it's an important word to determine polarity. We can use n-gram to solve the issue of removal of important stop words)
- Remove non-alphanumeric character
- Change to Lower case
- Lemmatization (nltk.stem.wordnet.WordNetLemmatizer)
- Stemming (nltk.stem.PorterStemmer / nltk.stem.SnowballStemmer)
- Converting text to vector