What is K-Nearest Neighbors?!

In our previous post on machine learning, we explained that there are supervised and unsupervised beasts lurking in machine learning land! One of the simplest machine learning techniques is the K-nearest neighbors (KNN). In this post, we will briefly review the way KNN works.

We’ve all heard of lazy guys! With technology being more accessible and easier to use, it’s even easier to get lazy! What to do if you are on the couch and the remote is on the other side of the room?  Just relax, download an app to change the TV channel right from your smartphone! You are going to love KNN. It is very lazy! KNN is the couch potato of algorithms. It basically stores the entire dataset rather than modeling it. Well, you can use some complex data structures to improve the efficiency but there still is no model! So, the first question that comes to mind with respect to KNN is: “is it really machine learning!?” Hmmm, yeah it can be but it is definitely a “lazy learner.” KNN can be applied to both classification (i.e. predicting discrete values) and regression (i.e. predicting continuous values) problems. Oddly enough, being lazy can come in handy in some cases. There are some advantages to slothfulness! The main strength of KNN lies in the ease of interpretation. Besides, it is a relatively robust algorithm for noisy data. And, since there is no model, there is no training. But all the work happens at the time of the prediction so the computation cost can be significant in some cases. Okay, we have a lazy tool but how does it really work? Let’s assume you hired a sneaky ninja and obtained data on your colleagues’ weight and age group. For simplicity, assume there is some sort of positive relationship between age group and weight and you have a plot like the following:

Now, a new handsome guy is joining your company and somehow you already know his weight and rather than asking him directly, you want to predict his age group! How? Let’s put him on the plot first…

Technically speaking, you would like to find the label (age group) of the new data point (new guy). Suppose, the new data point can either belong to the red diamond or blue circle groups and nothing else. Now it’s time to find out the role of “K” in the KNN algorithm! As a parameter given to the algorithm, K tells the algorithm how many neighbor data points it should consider when making the decision about the new data. For example, if we set K=3 then the algorithm checks for the 3 nearest neighbors to the new data point and in the case of a classification, it assigns the majority label among the 3 nearest neighbors to the new data point. Hmmm, sounds easy! KNN can be also used for regression problems. There instead of returning the label with the highest vote, the algorithm returns a continuous value by, for example, averaging the outcomes of the nearest neighbors.

Although KNN is very simple it can perform nicely in many cases. As seen, the most important parameter to tune is K. Setting different values for K can affect the performance of the algorithm. But how to tune it up?!? Wait for the next post 😊