All too often companies have only the vaguest idea about what kind of data they’re holding; because such data is very often hidden deeply away in a variety of databases and fragmented across different departments. We identify this data and bring it to light, making it visible, cohesive, comparable and easy to understand so that it really does support YOU in making the right decisions. And if need be, we can also identify any lacking data and define a concept to fill in the gap.

Text classification – Recommending Product based on their reviews


One of the leading retail company wanted to understand what is prompting users to recommend a certain product,  from their reviews for the given product.


The reviews for the product, the satisfaction score and information on whether they recommend the product to others or not, was scraped and this was the input data that was further pre processed as described below.

Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and removed punctuation. Words that have fewer than 3 characters and all stop words were removed.

Lemmatisation : Words were lemmatized, i.e., words in third person are changed to first person and verbs in past and future tenses are changed into present.

Stemming: Words were stemmed, i.e. words are reduced to their root form.

TF-IdF approach was then used to convert the cleaned text into features.


A pipeline was used to identify the key features (such as LASSO regression and some Filter methods) first and then applied a series of supervised machine learning  models on these shortlisted features and finalized the model on the basics of  accuracy  scores.


The derived classification algorithm will be used to predict whether customer will recommend the product or not based on the review .This eliminates the need to capture the recommendation score.

Venugopala Rao Manneni

A doctor in statistics from Osmania University. I have been working in the fields of data analysis and research for the last 14 years. My expertise is in data mining and machine learning – in these fields I’ve also published papers. I love to play cricket and badminton.

