A Complete Guide On KNN Algorithm In R With Examples
With the amount of data that we’re generating, the need for advanced Machine Learning Algorithms has increased. One such algorithm is the K Nearest Neighbour algorithm. In this blog on KNN Algorithm In R, you will understand how the KNN algorithm works and its implementation using the R Language.
It is essential that you know Machine Learning basics before you get started with this KNN Algorithm In R blog. Here’s a list of Machine learning blogs to get you started:
What Is KNN Algorithm?
KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a new data point into the target class, depending on the features of its neighboring data points.
Let’s try to understand the KNN algorithm with a simple example. Let’s say we want a machine to distinguish between images of cats & dogs. To do this we must input a dataset of cat and dog images and we have to train our model to detect the animals based on certain features. For example, features such as pointy ears can be used to identify cats and similarly we can identify dogs based on their long ears.
After studying the dataset during the training phase, when a new image is given to the model, the KNN algorithm will classify it into either cats or dogs depending on the similarity in their features. So if the new image has pointy ears, it will classify that image as a cat because it is similar to the cat images. In this manner, the KNN algorithm classifies data points based on how similar they are to their neighboring data points.
Now let’s discuss the features of the KNN algorithm.
Features Of KNN Algorithm
The KNN algorithm has the following features:
- KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of the data points.
- It is one of the most simple Machine learning algorithms and it can be easily implemented for a varied set of problems.
- It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor and classifies the data point into the class it is most similar to.
- Unlike most algorithms, KNN is a non-parametric model which means that it does not make any assumptions about the data set. This makes the algorithm more effective since it can handle realistic data.
- KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a discriminative function from the training data.
- KNN can be used for solving both classification and regression problems.
KNN Algorithm Example
To make you understand how KNN algorithm works, let’s consider the following scenario:
- In the above image, we have two classes of data, namely class A (squares) and Class B (triangles)
- The problem statement is to assign the new input data point to one of the two classes by using the KNN algorithm
- The first step in the KNN algorithm is to define the value of ‘K’. But what does the ‘K’ in the KNN algorithm stand for?
- ‘K’ stands for the number of Nearest Neighbors and hence the name K Nearest Neighbors (KNN).
- In the above image, I’ve defined the value of ‘K’ as 3. This means that the algorithm will consider the three neighbors that are the closest to the new data point in order to decide the class of this new data point.
- The closeness between the data points is calculated by using measures such as Euclidean and Manhattan distance, which I’ll be explaining below.
- At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if I were to classify the new data point based on ‘K’ = 3, then it would be assigned to Class A (squares).
- But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to.
- At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.
In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog.
Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how.
- Consider the above image, here we’re going to measure the distance between P1 and P2 by using the Euclidian Distance measure.
- The coordinates for P1 and P2 are (1,4) and (5,1) respectively.
- The Euclidian Distance can be calculated like so:
It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
To sum it up, let’s look at the pseudocode for KNN Algorithm.
KNN Algorithm Pseudocode
Consider the set, (Xi, Ci),
- Where Xi denotes feature variables and ‘i’ are data points ranging from i=1, 2, ….., n
- Ci denotes the output class for Xi for each i
The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of ‘i’ by assuming that the total number of classes is denoted by ‘c’.
Now let’s pretend that there’s a data point ‘x’ whose output class needs to be predicted. This can be done by using the K-Nearest Neighbour (KNN) Algorithm.
KNN Algorithm Pseudocode:
Calculate D(x, xi), where 'i' =1, 2, ….., n and 'D' is the Euclidean measure between the data points.
The calculated Euclidean distances must be arranged in ascending order.
Initialize k and take the first k distances from the sorted list.
Figure out the k points for the respective k distances.
Calculate ki, which indicates the number of data points belonging to the ith class among k points i.e. k ≥ 0
If ki >kj ∀ i ≠ j; put x in class i.
The above pseudocode can be used for solving a classification problem by using the KNN Algorithm.
Before we get into the practical implementation of KNN, let’s look at a real-world use case of the KNN algorithm.
KNN Algorithm Use-Case
Surely you have shopped on Amazon! Have you ever noticed that when you buy a product, Amazon gives you a list of recommendations based on your purchase? Not only this, Amazon displays a section which says, ‘customers who bought this item also bought this.. ‘.
Machine learning plays a huge role in Amazon’s recommendation system. The logic behind a recommendation engine is to suggest products to customers based on other customers who have a similar shopping behavior.
Consider an example, let’s say that a customer A who loves mystery novels bought the Game Of Thrones and Lord Of The Rings book series. Now a couple of weeks later, another customer B who reads books of the same genre buys Lord Of The Rings. He does not buy the Game of Thrones book series, but Amazon recommends it customer B since his shopping behaviors and his choice in books is quite similar to customer A.
Therefore, Amazon recommends products to customers based on how similar their shopping behaviors are. This similarity can be understood by implementing the KNN algorithm which is mainly based on feature similarity.
Now that you know how KNN works and how it is used in real-world applications, let’s discuss the implementation of KNN using the R language. If you’re not familiar with the R language, you can go through this video recorded by our Machine Learning experts.
Practical Implementation Of KNN Algorithm In R
Problem Statement: To study a bank credit dataset and build a Machine Learning model that predicts whether an applicant’s loan can be approved or not based on his socio-economic profile.
Dataset Description: The bank credit dataset contains information about 1000s of applicants. This includes their account balance, credit amount, age, occupation, loan records, etc. By using this data, we can predict whether or not to approve the loan of an applicant.
Logic: This problem statement can be solved using the KNN algorithm that will classify the applicant’s loan request into two classes:
Now that you know the objective of this project, let’s get started with the coding part.
Step 1: Import the dataset
#Import the dataset
loan <- read.csv("C:/Users/zulaikha/Desktop/DATASETS/knn dataset/credit_data.csv")
After importing the dataset, let’s take a look at the structure of the dataset:
str(loan)
'data.frame': 1000 obs. of 21 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
$ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
$ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
$ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
$ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
$ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
$ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
$ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
$ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
$ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
$ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
$ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
$ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
$ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
$ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
$ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
$ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
$ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
$ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
$ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
Note that, the ‘Creditability’ variable is our output variable or the target variable. The value of the credibility variable represents whether an applicant’s loan is approved or rejected.
Step 2: Data Cleaning
From the structure of the dataset, we can see that there are 21 predictor variables that will help us decide whether or not an applicant’s loan must be approved.
Some of these variables are not essential in predicting the loan of an applicant, for example, variables such as Telephone, Concurrent. Credits, Duration.in.Current.address, Type.of.apartment, etc. Such variables must be removed because they will only increase the complexity of the Machine Learning model.
loan.subset <- loan[c('Creditability','Age..years.','Sex...Marital.Status','Occupation','Account.Balance','Credit.Amount','Length.of.current.employment','Purpose')]
In the above code snippet, I’ve filtered down the predictor variables. Now let’s take a look at how our dataset looks:
str(loan.subset)
'data.frame': 1000 obs. of 8 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
$ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
$ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
$ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
$ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
$ Length.of.current.employment: int 2 3 4 3 3 2 4 2 1 1 ...
$ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
Now we have narrowed down 21 variables to 8 predictor variables that are significant for building the model.
Step 3: Data Normalization
You must always normalize the data set so that the output remains unbiased. To explain this, let’s take a look at the first few observations in our data set.
head(loan.subset)
  Creditability Age..years. Sex...Marital.Status Occupation Account.Balance Credit.Amount
1 1 21 2 3 1 1049
2 1 36 3 3 1 2799
3 1 23 2 2 2 841
4 1 39 3 2 1 2122
5 1 38 3 2 1 2171
6 1 48 3 2 1 2241
  Length.of.current.employment Purpose
1 2 2
2 3 0
3 4 9
4 3 0
5 3 0
6 2 0
Notice the Credit amount variable, its value scale is in 1000s, whereas the rest of the variables are in single digits or 2 digits. If the data isn’t normalized it will lead to a baised outcome.
#Normalization
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }
In the below code snippet, we’re storing the normalized data set in the ‘loan.subset.n’ variable and also we’re removing the ‘Credibility’ variable since it’s the response variable that needs to be predicted.
loan.subset.n <- as.data.frame(lapply(loan.subset[,2:8], normalize))
This is the normalized data set:
head(loan.subset.n)
   Age..years. Sex..Marital Occupation Account.Balance Credit.Amount
1 0.03571429 0.3333333 0.6666667 0.0000000 0.04396390
2 0.30357143 0.6666667 0.6666667 0.0000000 0.14025531
3 0.07142857 0.3333333 0.3333333 0.3333333 0.03251898
4 0.35714286 0.6666667 0.3333333 0.0000000 0.10300429
5 0.33928571 0.6666667 0.3333333 0.0000000 0.10570045
6 0.51785714 0.6666667 0.3333333 0.0000000 0.10955211
Length.of.current.employment Purpose
            0.25 0.2
            0.50 0.0
            0.75 0.9
            0.50 0.0
            0.50 0.0
            0.25 0.0
Step 4: Data Splicing
After cleaning the data set and formatting it, the next step is data splicing. Data splicing basically involves splitting the data set into training and testing data set. This is done in the following code snippet:
set.seed(123)
dat.d <- sample(1:nrow(loan.subset.n),size=nrow(loan.subset.n)*0.7,replace = FALSE) #random selection of 70% data.
train.loan <- loan.subset[dat.d,] # 70% training data
test.loan <- loan.subset[-dat.d,] # remaining 30% test data
After deriving the training and testing data set, the below code snippet is going to create a separate data frame for the ‘Creditability’ variable so that our final outcome can be compared with the actual value.
#Creating seperate dataframe for 'Creditability' feature which is our target.
train.loan_labels <- loan.subset[dat.d,1]
test.loan_labels <-loan.subset[-dat.d,1]
Step 5: Building a Machine Learning model
At this stage, we have to build a model by using the training data set. Since we’re using the KNN algorithm to build the model, we must first install the ‘class’ package provided by R. This package has the KNN function in it:
#Install class package
install.packages('class')
# Load class package
library(class)
Next, we’re going to calculate the number of observations in the training data set. The reason we’re doing this is that we want to initialize the value of ‘K’ in the KNN model. One of the ways to find the optimal K value is to calculate the square root of the total number of observations in the data set. This square root will give you the ‘K’ value.
#Find the number of observation
NROW(train.loan_labels) 
[1] 700
So, we have 700 observations in our training data set. The square root of 700 is around 26.45, therefore we’ll create two models. One with ‘K’ value as 26 and the other model with a ‘K’ value as 27.
knn.26 <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=26)
knn.27 <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=27)
Step 6: Model Evaluation
After building the model, it is time to calculate the accuracy of the created models:
#Calculate the proportion of correct classification for k = 26, 27
ACC.26 <- 100 * sum(test.loan_labels == knn.26)/NROW(test.loan_labels)
ACC.27 <- 100 * sum(test.loan_labels == knn.27)/NROW(test.loan_labels)
ACC.26
[1] 67.66667
ACC.27
[1] 67.33333
As shown above, the accuracy for K = 26 is 67.66 and for K = 27 it is 67.33. We can also check the predicted outcome against the actual value in tabular form:
# Check prediction against actual value in tabular form for k=26
table(knn.26 ,test.loan_labels)
       test.loan_labels
knn.26 0 1
0 11 7
1 90 192
knn.26
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
[51] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[101] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[151] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
[201] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
[251] 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
Levels: 0 1
# Check prediction against actual value in tabular form for k=27
table(knn.27 ,test.loan_labels)
       test.loan_labels
knn.27 0 1
0 11 8
1 90 191
knn.27
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
[51] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[101] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[151] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
[201] 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
[251] 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
Levels: 0 1
You can also use the confusion matrix to calculate the accuracy. To do this we must first install the infamous Caret package:
install.packages('caret')
library(caret)
Now, let’s use the confusion matrix to calculate the accuracy of the KNN model with K value set to 26:
confusionMatrix(table(knn.26 ,test.loan_labels))
Confusion Matrix and Statistics
       test.loan_labels
knn.26 0 1
     0 11 7
     1 90 192
                                          
               Accuracy : 0.6767          
                 95% CI : (0.6205, 0.7293)
    No Information Rate : 0.6633          
    P-Value [Acc > NIR] : 0.3365          
                                          
                  Kappa : 0.0924          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.10891         
            Specificity : 0.96482         
         Pos Pred Value : 0.61111         
         Neg Pred Value : 0.68085         
             Prevalence : 0.33667         
         Detection Rate : 0.03667         
   Detection Prevalence : 0.06000         
      Balanced Accuracy : 0.53687         
                                          
       'Positive' Class : 0
So, from the output, we can see that our model predicts the outcome with an accuracy of 67.67% which is good since we worked with a small data set. A point to remember is that the more data (optimal data) you feed the machine, the more efficient the model will be.
Step 7: Optimization
In order to improve the accuracy of the model, you can use n number of techniques such as the Elbow method and maximum percentage accuracy graph. In the below code snippet, I’ve created a loop that calculates the accuracy of the KNN model for ‘K’ values ranging from 1 to 28. This way you can check which ‘K’ value will result in the most accurate model:
i=1
k.optm=1
for (i in 1:28){
+ knn.mod <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=i)
+ k.optm[i] <- 100 * sum(test.loan_labels == knn.mod)/NROW(test.loan_labels)
+ k=i
+ cat(k,'=',k.optm[i],'\n')
+ }
1 = 60.33333
2 = 58.33333
3 = 60.33333
4 = 61
5 = 62.33333
6 = 62
7 = 63.33333
8 = 63.33333
9 = 63.33333
10 = 64.66667
11 = 64.66667
12 = 65.33333
13 = 66
14 = 64
15 = 66.66667
16 = 67.66667
17 = 67.66667
18 = 67.33333
19 = 67.66667
20 = 67.66667
21 = 66.33333
22 = 67
23 = 67.66667
24 = 67
25 = 68
26 = 67.66667
27 = 67.33333
28 = 66.66667
From the output you can see that for K = 25, we achieve the maximum accuracy, i.e. 68%. We can also represent this graphically, like so:
#Accuracy plot
plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")
The above graph shows that for ‘K’ value of 25 we get the maximum accuracy. Now that you know how to build a KNN model, I’ll leave it up to you to build a model with ‘K’ value as 25.
Now that you know how the KNN algorithm works, I’m sure you’re curious to learn more about the various Machine learning algorithms. Here’s a list of blogs that covers the different types of Machine Learning algorithms in depth:
With this, we come to the end of this KNN Algorithm In R blog. I hope you found this blog informative, if you have any doubts, leave a comment and we’ll get back to you.
Stay tuned for more blogs on trending technologies.


Thank you for the article.
Ho would you compare KNN vs Random Forest?, what is better?