K-nearest neighbors (KNN) is a foundational technique in machine learning (ML). This guide will help you understand KNN, how it works, and its applications, benefits, and challenges.
Table of contents
What is the k-nearest neighbors algorithm?
Difference between k-nearest neighbors and other algorithms
How is KNN used in machine learning?
What is the k-nearest neighbors algorithm?
The k-nearest neighbors (KNN) algorithm is a supervised learning technique used for both classification and regression. KNN determines the label (classification) or predicted value (regression) of a given data point by evaluating nearby data points in the dataset.
How does KNN work?
KNN is based on the premise that data points that are spatially close to each other in a dataset tend to have similar values or belong to similar categories. KNN uses this simple but powerful idea to classify a new data point by finding a preset number (the hyperparameter k) of neighboring data points within the labeled training dataset. This value, k, is one of the KNN hyperparameters, which are preset configuration variables that ML practitioners use to control how the algorithm learns.
Then, the algorithm determines which of the neighboring values are closest to the new data point, and assigns it the same label or category as its neighbors. The chosen value of k affects model performance. Smaller values increase noise sensitivity, while larger values increase robustness but may cause the KNN to miss local patterns.
The closeness, or distance, between data points is calculated using metrics originally developed to measure the similarity of points in a mathematical space. Common metrics include Euclidean distance, Manhattan distance, and Minkowski distance. KNN performance is affected by the chosen metric, and different metrics perform better with different types and sizes of data.
For example, the number of dimensions in the data, which are individual attributes describing each data point, can affect metric performance. Regardless of the chosen distance metric, the goal is to categorize or predict a new data point based on its distance from other data points.
- Euclidean distance is the distance along a straight line between two points in space and is the most commonly used metric. It’s best used for data with a lower number of dimensions and no significant outliers.
- Manhattan distance is the sum of the absolute differences between the coordinates of the data points being measured. This metric is useful when data is high-dimensional or when data points form a grid-like structure.
- Minkowski distance is a tunable metric that can act like either the Euclidean or Manhattan distance depending on the value of an adjustable parameter. Adjusting this parameter controls how distance is calculated, which is useful for adapting KNN to different types of data.
Other, less common metrics include Chebyshev, Hamming, and Mahalanobis distances. These metrics are more specialized, and are suited for particular data types and distributions. For example, the Mahalanobis distance measures the distance of a point from a distribution of points, taking into account the relationships between variables. As such, Mahalanobis distance is well suited for working with data where features use different scales.
KNN is often called a “lazy” learning algorithm because it doesn’t need training, unlike many other algorithms. Instead, KNN stores data and uses it to make decisions only when new data points need regression or classification. However, this means that predictions often have high computational requirements since the entire dataset is evaluated for each prediction.
∫
Difference between k-nearest neighbors and other algorithms
KNN is part of a larger family of supervised ML techniques geared toward classification and regression, which includes decision trees / random forests, logistic regression, and support vector machines (SVMs). However, KNN differs from these techniques due to its simplicity and direct approach to handling data, among other factors.
Decision trees and random forests
Like KNN, decision trees and random forests are used for classification and regression. However, these algorithms use explicit rules learned from the data during training, unlike KNN’s distance-based approach. Decision trees and random forests tend to have faster prediction speeds because they have pre-trained rules. This means they are better suited than KNN for real-time prediction tasks and handling large datasets.
Logistic regression
Logistic regression assumes that data is linearly distributed and classifies data using a straight line or hyperplane (a boundary separating data points in higher-dimensional spaces) to separate data into categories. KNN, on the other hand, doesn’t assume a particular data distribution. As such, KNN can adapt more easily to complex or non-linear data, while logistic regression is best used with linear data.
Support vector machines
Instead of looking at distances between points like KNN, support vector machines (SVM) focus on creating a clear dividing line between groups of data points, often with the goal of making the gap between them as wide as possible. SVM is great at handling complex datasets with many features or when a clear separation between data point groups is necessary. In comparison, KNN is simpler to use and understand but doesn’t perform as well on large datasets.
How is KNN used in machine learning?
Many ML algorithms can handle only one type of task. KNN stands out for its ability to handle not one but two common use cases: classification and regression.
Classification
KNN classifies data points by using a distance metric to determine the k-nearest neighbors and assigning a label to the new data point based on the neighbors’ labels. Common KNN classification use cases include email spam classification, grouping customers into categories based on purchase history, and handwritten number recognition.
Regression
KNN performs regression by estimating the value of a data point based on the average (or weighted average) of its k-nearest neighbors. For example, KNN can predict house prices based on similar properties in the neighborhood, stock prices based on historical data for similar stocks, or temperature based on historical weather data in similar locations.
Applications of the KNN algorithm in ML
Due to its relative simplicity, and ability to perform both classification and regression, KNN has a wide range of applications. These include image recognition, recommendation systems, and text classification.
Image recognition
Image recognition is one of the most common applications of KNN due to its classification abilities. KNN performs image recognition by comparing features in the unknown image, like colors and shapes, to features in a labeled image dataset. This makes KNN useful in fields like computer vision.
Recommendation systems
KNN can recommend products or content to users by comparing their preference data to the data of similar users. For example, if a user has listened to several classic jazz songs, KNN can find users with similar preferences and recommend songs that those users enjoyed. As such, KNN can help personalize the user experience by recommending products or content based on similar data.
Text classification
Text classification seeks to classify uncategorized text based on its similarity to pre-categorized text. KNN’s ability to evaluate the closeness of word patterns makes it an effective tool for this use case. Text classification is particularly useful for tasks like sentiment analysis, where texts are classified as positive, negative, or neutral, or determining the category of a news article.
Advantages of the KNN algorithm in ML
KNN has several notable benefits, including its simplicity, versatility, and lack of a training phase.
Simplicity
Compared to many other ML algorithms, KNN is easy to understand and use. The logic behind KNN is intuitive—it classifies or predicts (regression) new data points based on the values of nearby data points—making it a popular choice for ML practitioners, especially beginners. In addition, other than choosing a value for k, minimal hyperparameter tuning is required to use KNN.
Versatility
KNN can be used for both classification and regression tasks, which means that it can be applied to a large range of problems and types of data, from image recognition to numerical value prediction. Unlike specialized algorithms limited to one type of task, KNN can be applied to any appropriately structured labeled dataset.
Explicit training phase
Many ML models require a time and resource-intensive training phase before becoming useful. KNN, on the other hand, simply stores the training data and uses it directly at prediction time. As such, KNN can be updated with new data, which is immediately available for use in prediction. This makes KNN particularly appealing for small datasets.
Disadvantages of the KNN algorithm in ML
Despite its strengths, KNN also comes with several challenges. These include high computational and memory costs, sensitivity to noise and irrelevant features, and the “curse of dimensionality.”
Computational cost of prediction
Since KNN calculates the distance between a new data point and every data point in its overall training dataset every time it makes a prediction, the computational cost of prediction increases quickly as the dataset grows. This can result in slow predictions when the dataset is large, or the KNN is run on insufficient hardware.
Curse of dimensionality
KNN suffers from the so-called “curse of dimensionality,” which limits its ability to handle high-dimensional data. As the number of features in a dataset increases, most data points become sparse and almost equidistant from each other. As such, distance metrics become less useful, which makes it hard for KNN to find neighbors in high-dimensional datasets that are truly nearby.
Memory intensive
A unique feature of KNN is that it stores the entire training dataset in memory for use at prediction time. When dealing with limited memory or large datasets, this can be problematic and impractical. Other ML algorithms avoid this challenge by condensing and distilling training data down into learned features through model training and parameter optimization. KNN, on the other hand, must retain every data point, which means that memory grows linearly with training dataset size.
Sensitivity to noise and irrelevant features
The power of KNN lies in its simple, intuitive distance calculation. However, this also means that unimportant features or noise can cause misleading distance calculations, negatively affecting prediction accuracy. As such, feature selection or dimensionality reduction techniques, like principal component analysis (PCA), are often used with KNN to make sure the important features have the most influence on the prediction.