What is K-means Clustering & it’s real use-cases in security domain

5 min readJul 18, 2021

So, Let us start by Knowing how are machine learning tasks performed:

Machine Learning tasks can be performed in two ways:
1. Supervised Learning (Labeled Data)
2. Unsupervised Learning(Unlabeled Data)

Supervised Learning

A task is supervised if you are using labled data we use the term labeled to refer to that data already contains the solutions, called labels.

Unsupervised Learning

Atask is considered to be unsupervised if you are using unlabeled data.This means you don’t need to provide the model with any kind of label or solution while the model is being trained.

Major steps in Machine Learning Process
Step1 : Define the Problem
Step2 : Build the Dataset
Step3 : Train the model
step4 : Evaluate the model
Step5 : Use the model

Now time to understand what is K-mean Clustering

K-means clustering is a unsupervised learning algorithm. In this case there is no well defined dataset unlike supervised learning we have labelled data.

In this we have a set of data where we have to group them as the name suggest we want to put them is cluster. By this I mean putting objects together which are similar in nature or have similar characteristics. So that is what K-mean Clustering is all about.

The term “K” is a number , we are basically telling the system how many number of cluster we want .

Lets us consider a example we will identify the bowlers and batsmen with K-means clustering. In layman term’s we have the list of players and we want to define that list in two groups bowler’s and batsmen.

Hence it is quite obvious that batsmen will have higher number of runs and bowler will have more number of wickets.

On Y-axis we have runs made and on the X-axis we wickets taken. Here the value of “K” is 2 which mean we have 2 clusters batsmen and bowler.

Now we will add centroids concept in between . Every cluster will have it’s own centroid value . On the bases of Centroids values we will group up the data present in the list.

Now the next step is to calculate distance of each of the datapoints from each of the randomly assigned centroids. For every points the distance is measured from both the centroids.Then which ever distance is than that datapoint is assigned to it’s nearest centroid. The distanced between the observed value and Centroid value is called as Euclidean distance

If the clusters are not stable then repositioning of the centroids take’s place until the clusters are fully stable.

Code for K-means Clustering:

Importing the various lib that are required which

Now we are creating Blobs which creates Clusters of Dataset which is readily available in Scikit learn.

To know more about Scikit learn library Click below:

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

scikit-learn.org

We have stated we centers=3 which means we want 3 test kind clusters as you can distinguish there are 3 clusters.

After that importing random data from make_blob and creating cluster or our instances and the we use fit keyword like anyother Machine Learning model to train the model.

y_kmeans is value of datapoints from the centriods.

Final Output:

K-means use-cases in security domain

Data archive is to preserve so that it stays there for future studies and generations and is not perishable unless one delete’s it. There are a quite fewways that data can be lost from a file. It might get deleted accidentally or malicious deletion of data. While there is a lot of software available that can look a watch for such specific known threats on a operating system, these software detects unique anomalous behavior, such as random file removal patterns.

Our approach to detecting this kind of problem is machine learning. We can create a machine learning model and make train in such a way that it can understand the normal behaviour of deletion and mark it in it’s model and if anything outside this comes in treat it as a outliners.

This is called as data inspection, anything that is outside the norm.

We have trained the file deletion patterns and implemented a k-means clustering solution to detect anomalous file deletions. This approach can also be used to detect other anomalies.

Unsupervised learning is often used in the field of anomaly detection, e.g. detecting security breaches, where labeled data is unavailable.

This technique identifies groups or clusters of similar data and can be used to identify anomalous events (outliers).

Find code here:

sachinjoshi72/K-Means-Clustering

Contribute to sachinjoshi72/K-Means-Clustering development by creating an account on GitHub.

github.com

Connect me here:

Sachin Joshi - Summer Internship - LinuxWorld Informatics Pvt Ltd | LinkedIn

Currently persuing Bachelors of Science in Information Technology (Bsc.IT), I have created and presented projects based…

www.linkedin.com

— — — — — — — -Thankyou for Reading, Do share the Blog. — — — — — — —