A very basic introduction to Topic Modelling
In this post, we will walk you through the concept of topic modelling.
Let’s say I have found your diary (yeah, I know all the great hiding spots!) and I have only two minutes to understand your innermost secrets! How about reading it from the scratch? In two minutes? Nah, not possible! But I have a text mining robo-buddy who can process and analyze the whole diary in less than two minutes and through topic modeling, extract all much of the information out of it. Text mining techniques can quickly derive valuable knowledge and insights from large-scale (unstructured) text-based datasets such as books, journals, articles, speeches, digital documents and emails.
What is Topic Modeling?
Topic modeling is a form of text mining, employing unsupervised and supervised statistical machine learning techniques to identify patterns in a corpus or large amount of unstructured text. It can take your huge collection of documents and group the words into clusters of words, identify topics, by a using process of similarity.
That sounds a bit technical and complicated so let’s simplify the process of topic modeling! Suppose you are reading a newspaper and you have a set of colored highlighters in your hand. Huh, old-fashioned? I know these days very few people read newspapers in print, everything is digital and highlighters are so yesterday! Pretend you are your father or your mother! So, as you are reading the newspaper you are highlighting the interesting keywords. One more assumption! You use a different color for highlighting the keywords of different themes. You group the keywords based on the assigned color and themes. Each list of words identified by a specific color is the list of keywords for a topic. The number of distinct colors you used represents the number of topics.
This is the most basic topic modeling. It facilitates understanding, organizing and summarizing huge text datasets. But remember, to be useful, automated topic models preferably need a large collection of text. If you have a short document it might be better to go old-fashioned and use highlighters! Spending some time to get to know the data is also helpful. By doing this you will have a general idea of what you expect the topic model to discover. For example, that diary might be devoted to your current and past relationships so I would expect my text mining robo-buddy to produce related topics. This can help you to assess the quality of the found topics better and refine the keyword sets, if required.
In the following posts, we will talk more about the different types of topic modeling and how we can say if a topic model is good or not…