Diving into Machine Learning!

Many believe robotics is on the cusp of becoming the next technological revolution and that we should expect a significant impact from “intelligent” robots in the near future. Even today, we can see robots and intelligent systems working here and there in different domains. But, robotics is only one application of machine learning and artificial intelligence (AI). Clearly, machine learning is serious business!  In this post, we review some machine learning techniques at a very abstract level.

In a sense, the stone base of computer software is the algorithm. You may consider algorithms as the step-wise commands, set by the human commander, for the computer to follow. But in machine learning, machines can learn the rules, discover hidden patterns and create new rules, and ultimately become their own commander! But, how do machines learn!? Although there are many different groupings, in general, machine learning techniques can be categorized into three types:

  • Supervised Learning: Here, the algorithm has access to labeled data with the label being the desired output. That’s, for example, how the classification module in WordStat works!
  • Unsupervised Learning: Here, the algorithm is not lucky enough to have labeled data and is forced to find patterns in the input data! An example is WordStat’s cluster analysis.
  • Reinforcement Learning: The machine here is very social, it interacts with a dynamic environment, receives rewards for good moves, and gets punished for mistakes! This is how your fancy car doesn’t need you to move around the parking lot!

Now, when it comes to text analytics you will hear a lot about Natural Language Processing (NLP) techniques. NLP is concerned with computer-human language interaction, aiming to develop a system able to interpret human text or speech. We, humans, are very difficult beasts for computers to master! We use a lot of tricky stuff when we communicate such as sarcasm, colloquialism, abbreviations… These inconsistencies can drive a normally placid computer crazy! It’s a tough day at the office having to deal with these unpredictable humanoids. Want to send a computer in search of Prozac? You can start misspelling words and make smiley facess 🙂 (oops a double whammy). Having started with simple rule-based text mining systems in the early days, text analytics pipelines have evolved significantly, especially over the past decade, employing NLP and machine learning techniques to explore unstructured data.

 

In general, you may consider three levels for a text analysis pipeline. At the low level, you need to perform some text processing tasks on the input to give some structure to the unstructured text data and make it understandable to the machine. These tasks may include tokenization, segmentation, stop-word removal, lemmatization, stemming, etc. The next layer of analysis is concerned with extracting abstract knowledge from the corpus. This mid-level analysis may include extracting themes and topics from the corpus, summarizing a collection of documents, or named-entity extraction. At the highest level of analysis, the user may be interested in identifying and categorizing opinions expressed in a corpus. Sentiment analysis, for example, which is a process that utilizes NLP and machine learning techniques to discover people’s opinion or feeling about a topic, falls into this category.

The good news is WordStat can help you perform a comprehensive text analysis at all of the above-mentioned levels! How? Check out our video tutorials to learn more!