  1. The computer understands the numerical value
  2. Once we convert text to vector we can leverage the power of algebra
  1. BagOfWords
  1. Remove the HTML tags
  2. Stop word removal (not is a stop word but it's an important word to determine polarity. We can use n-gram to solve the issue of removal of important stop words)
  3. Remove non-alphanumeric character
  4. Change to Lower case
  5. Lemmatization (nltk.stem.wordnet.WordNetLemmatizer)
  6. Stemming (nltk.stem.PorterStemmer / nltk.stem.SnowballStemmer)
  7. Tokenization
  8. Converting text to vector
  9. Thresholding




