Stay tuned – Receive JSM-news !

Join the JSM mailing list to receive our latest updates.
Email address
All too often companies have only the vaguest idea about what kind of data they’re holding; because such data is very often hidden deeply away in a variety of databases and fragmented across different departments. We identify this data and bring it to light, making it visible, cohesive, comparable and easy to understand so that it really does support YOU in making the right decisions. And if need be, we can also identify any lacking data and define a concept to fill in the gap.

Identifying which statistical concepts are applied on various research papers using Topic Modeling.


As Statistics is being major contribution for research and it overlaps a great deal with machine learning   as well as with deep learning applications, we have developed a model in order to find out which statistical concepts applied over particular research papers using topic modeling.


We have applied LDA (Latent Dirichlet Allocation) a concept of Topic modeling. Topic is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

We have scrapped the data from various abstracts from the research papers and then the text was preprocessed by performing the following steps. Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.Words that have fewer than 3 characters are removed.All stopwords are removed.

Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.

Words are stemmed — words are reduced to their root form.

Then the preprocessed text is converted into features by using TF_IDF algorithm.And these features used to derive the topics using LDA Technique and the derived topics are

  • Probability,
    • Distributions,
    • Inference,
    • Design of experiments,
    • Sampling theory,
    • Regression analysis,
    • Multivariate analysis
    • Stochastic process

Here is the list of the terms under each topic

And the topic model is represented using heatmap and dendogram to validate the model. Once the model is validated we developed classification algorithm using navie bayes to classify the future documenting to these topics.

Impact: Whenever we get any new document with this we can easily classify the document in which which statistical concept is applied.

Venugopala Rao Manneni

A doctor in statistics from Osmania University. I have been working in the fields of data analysis and research for the last 14 years. My expertise is in data mining and machine learning – in these fields I’ve also published papers. I love to play cricket and badminton.

More Posts

Leave a Reply