K-means Clustering and its use-case in the Security Domain
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Unsupervised Learning is a machine learning technique in which, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.
Clustering is one of the most common exploratory data analysis techniques used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.
The various types of clustering are:
- Hierarchical clustering
- Partitioning clustering
Hierarchical clustering is further subdivided into:
- Agglomerative clustering
- Divisive clustering
Partitioning clustering is further subdivided into:
- K-Means clustering
- Fuzzy C-Means clustering
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. Data points, are clustered based on feature similarity.
Where does the k-means clustering algorithm is used?
The k-mean clustering algorithm is used in Machine Learning models where we have to do unsupervised learning with improper historical data, so for that case, we use the k-means clustering algorithm.
What are the basic steps for K-means clustering?
- Step 1: Choose the number of clusters k.
- Step 2: Select k random points from the data as centroids.
- Step 3: Assign all the points to the closest cluster centroid.
- Step 4: Re-compute the centroids of newly formed clusters.
- Step 5: Repeat steps 3 and 4.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life, like:
- Academic performance
- Diagnostic systems
- Search engines
- Wireless sensor networks
How Does K-Means Clustering Work?
The flowchart below shows how k-means clustering works:
Limitations of K-means Clustering
Sometimes, it is quite tough to forecast the number of clusters, or the value of k.
- The output is highly influenced by original input, for example, the number of clusters.
- An array of data substantially hits the concluding outcomes.
- In some cases, clusters show complex spatial views, then executing clustering is not a good choice.
- Also, rescaling is sometimes conscious, it can’t be done by normalization or standardization of data points, the output gets changed entirely.
K-Means Use-Cases in the Security Domain
- Identifying crime localities-
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Call record detail analysis-
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
3. Automatic clustering of it alerts-
Large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
4. Crime document classification-
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.
5. Cyber-profiling criminals
Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
6. Rideshare data analysis
the publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.