K-NN Algorithm Explained
Introduction to K-Nearest Neighbour Algorithm using Examples.
The K-nearest neighbors (KNN) algorithm is a simple technique and works by finding the distance between a new data point and all the existing data points.
Introduction to KNN
It is the simplest learning Algorithm that is used in clustering methods. This algorithm simply partitions the observation or data points into k clusters each where each observation belongs to the cluster with the nearest mean serving as a representation of the cluster. It assigns data points to one of the k groups. In k-means clustering, each cluster is defined by creating a centroid for each cluster. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are each other.
What is KNN all about:
KNN algorithm assumes the similarity between a data point and available data points around this data in a cluster and new data into a cluster that is most similar to the available categories.
KNN algorithm can be used for Regression as well as for classification but mostly it is used for Classification problems.
It is also a Lazy Learner Algorithm as it does not learn from the training set immediately but stores the data set and performs an action on the data set later.
For Example, Suppose we have an Image and we want to classify it into a cat or a dog. We can use the KNN algorithm to find out the data Features of the new image and see if it has similar features of the new data set to the cats and dogs images. Based on the most similar features, it will put it in either the cat or dog category.
Where to use KNN
KNN is mostly used in simple recommendation systems, image recognition, and general decision models. Many e-commerce companies use it in order to recommend different products to buy. Remember the activity we did around increasing the conversion ratio of a website earlier.
How KNN works
The KNN working can be Explained on the basis of the following algorithm:
Step 1: Select the number K of the neighbors.
Step 2: Calculate the Euclidean distance of K number of neighbors.
Step 3: Take the K-Nearest neighbors as per the calculated Euclidean distance.
Step 4: Among these K neighbors, count the number of data points in each category.
Step 5: Assign the new data points to that category for which the number of neighbors is maximum.
OUR MODEL IS READY!
Advantages of KNN Algorithm
It is simple to implement.
It is resistant to noisy training data.
It is effective if training data is large.
Disadvantages of KNN Algorithm
Determining the right value of K can be difficult.
Calculating the Euclidean distance between the data point for all the training data sets may involve significant effort.
We can see the same colored data point are close to each other. The KNN algorithm is based on this assumption. The most important part here is to select the K in the Cluster.
How do we start? Choosing the right value of K
A simple way to do this is to start with a K which is suited for your data and try various options till it reduces the number of errors while ensuring the accuracy of the model
Few Important things to keep in mind:
As the value of K approaches 1, predictions become less stable.
As we increase the value of K, predictions become more stable due to averaging but after a certain point errors increase indicating that the value of K has beyond normal
Normally, K can be an Odd number for the tiebreaker.
Recommender Systems Based on KNN
We have seen various content platforms such as Youtube, Amazon, and Netflix and now we know how they are able to magically recommend us movies we like using KNN.
Gather Movies Data
We could use some more data from the UCI Machine Learning Repository, IMDb's data set, or create it manually using various content platforms.
Explore, Clean, and Prepare Data
There may be missing data or errors in the data set, so we need to fix them before running the classifier models.
KNN implementation relies on structured data so it needs to be in a table format.
Now, what could be the possible parameters in these movies?
Use the Algorithm
If you have used content platforms, they have a feature called "Play me something" or "Watch similar". The back-end of these platforms has a recommendation data set based on KNN.
The K-nearest neighbors (KNN) algorithm is a simple technique and works by finding the distances between a new data point and all existing data points.
Key features of KNN are:
KNN model is based on surrounding data points as Clusters.
We need to map the property of a new data point to match it to the nearest matching cluster.
Finding the value of K is the most important part of this algorithm.
I post Blogs on Maths, Machine Learning, and Python every week. You can find them here: (Gaurav Pandey).
Thanks for Reading!