How confusion matrics can help ML models in Cyber Crime Investigation

Using machine learning concept in malware analysis.

Sachin Joshi
6 min readJun 5, 2021

Let us first understand what is confusion matrics.

WHAT IS CONFUSION MATRIX?

A confusion matrix is a table that is often used to specify the performance of a classification model (or “classifier”) on a set of test data for which the Actual values are known to us. When we want to measure the effectiveness of our trained model. And it is where the Confusion matrix comes into the show . Confusion Matrix is a performance measurement for machine learning classification

Let’s now define the terminology we have used in the diagram and our usecase, these are whole numbers (not rates):

  • true positives (TP): These are cases in which Model has predicted “yes” (predicting they have the disease), and in reality they do have the disease.
  • true negatives (TN): Our model has predicted “no”, and in reality they don’t have the disease.
  • false positives (FP): Our ML model has predicted “yes”, but they in reality don’t actually have the disease. (Also known as a “Type I error.”)
  • false negatives (FN): Trained model has predicted “no”, but in reality they do have the disease. (Also known as a “Type II error.”)

How to Calculate Confusion Matrix for a 2-class classification problem?

Let’s understand confusion matrix through math.

  • Precision:

The precision metric shows the accuracy of the positive class. It measures how likely the prediction of the positive class is correct.

TP/predicted yes = 100/110 = 0.91

  • Accuracy:

Accuracy is the ratio of Total correct predictions made by the model to total data provided

Overall, how often is the classifier correct?

(TP+TN)/total = (100+50)/165 = 0.91

  • Error Rate:

Overall, how often is it wrong?

(FP+FN)/total = (10+5)/165 = 0.09

equivalent to 1 minus Accuracy

also known as “Error Rate”

also known as “Error Rate”

  • True Positive Rate:

When it’s actually yes, how often does it predict yes?

TP/actual yes = 100/105 = 0.95

also known as “Sensitivity” or “Recall”

  • False Positive Rate:

When it’s actually no, how often does it predict yes?

FP/actual no = 10/60 = 0.17

  • True Negative Rate:

When it’s actually no, how often does it predict no?

TN/actual no = 50/60 = 0.83

equivalent to 1 minus False Positive Rate

also known as “Specificity”

Precision:

When it predicts yes, how often is it correct?

TP/predicted yes = 100/110 = 0.91

Prevalence:

How often does the yes condition actually occur in our sample?

actual yes/total = 105/165 = 0.64

WHAT IS CYBERSECURITY?

Cybersecurity is the way in which systems, networks, and programs utilize technologies, processes, and practices to protect against digital attacks. Cyberattacks often target sensitive information and data, and by gaining access to this data, cyber criminals extort money from users and companies, interrupt normal processes, and take down entire sites.

Effective cybersecurity is a crucial component to any business, and even more is at stake for small- and medium-sized organizations, as they often lack the resources to recover from such attacks. In our modern data-driven world, protecting against cyber-attacks is becoming increasingly challenging due to the growing amount of data and devices available.

Cybersecurity Threats

There is massive gathering around the internet has also given new proportions to a longstanding challenge in technology. Cyberthreats are now one of the most urgent concerns in the world according to the Global Risks Report

In this troublesome context, machine learning either can aggravate or mitigate this issue. As the WEF points out,New malware in the IT industries uses machine learning techniques to avoid detection. But at the same time, in malware analysis, techniques are being developed and implemented to better detect these cyber threats (World Economic Forum, 2019). In general, the objective of malware analysis is to identify and typify certain structures that compose a malware, by dissecting its binary code and observing its behavior.

Types of Cybersecurity Threats

There are a few main types of cybersecurity threats any individual and organization should be aware of:

  • Phishing: Phishing attacks involve fraudulent emails that look as if they come from a creditable sources but in reality they are spoof messages to deploy malacious software. The most common type of cyber-attack, phishing can steal sensitive information like credit card numbers and login credentials. Solutions include awareness training or a technology that filters these emails.
  • Ransomware: Ransomware is a type of malicious software that aims to extort money by restricting access to files or the computer system until its ransom amount is paid. However, payment does not mean the system will be restored, or files will be recovered. A new report suggests that organisations in India are hit by an average of 213 ransomware attacks each week(according to IndianExpress)
  • Malware: Malware is a type of software that aims to gain access to a computer and cause damage.It is a piece of code made by cyberbullies to attack any programmable device, service, or a network.
  • Social engineering: Social engineering is a tactic that aims to reveal sensitive information by soliciting monetary payments or gaining access to confidential data. Social engineering is often combined with other threats to increase effectiveness. In this victims are manipulated to make security errors and lend away the sensitive data without knowing ,This can be done by human interactions with the victims on social media or by sending SMS or voice message.

To illustrate how machine learning can be employed in malware analysis, consider the task of disassembling a malicious program. To properly obtain the binary code in its intelligible form, it is crucial to know which compiler generated the executable program. Without this information, it is rather impractical to identify a correspondence between a set of incomprehensible commands and standard library functions. This problem is known as the compiler provenance (CP) problem.

Machine learning is a widely used mechanism for malware detection which is heavily reliant on the selection of features to makeup data for analysis. In this research, we utilise the machine learning procedure to enable the machine to make further predictions on distinguishing malicious software from benign software. In this work, the algorithms that will be used to carry out the machine learning procedure include: instance-based (IB1), Random Forest (RF), Na¨ıve Bayes and Support vector machine (SVM) . Based on our research these algorithms help produce accurate detection. In machine learning, there are metrics used to evaluate the effectiveness of each algorithm [14]. In this work, we use a confusion matrix and Youdens index the weighted mean of these metrics are also taken into consideration since it finds the mean by assigning the weight of each element such that each element contributes to the final result based on how much importance it carries .

SIMPLE USECASE USED IN INDUSTRIES TO PREDICT CYBERTHREATS

Cyber Crime investigation using confusion matrics

True positive (tp), false positive (fp), true negative (tb), and false negative values (fn) are used to calculate the following performance measures:

  1. True Positive Rate/recall/sensitivity (tpr): the fraction of malware samples correctly identified as ransomware;
  2. False Positive Rate (fpr = 1 — tnr): the fraction of goodware samples incorrectly identified as being malware;
  3. True Negative Rate/specificity (tnr): the fraction of goodware samples correctly identified as goodware;
  4. False Negative Rate (fnr = 1 — tpr): the fraction of ransomware samples incorrectly classified as goodware; and
  5. Accuracy is reported as the fraction of all samples correctly identified. More specifically, Accuracy = tpr+tnr/ tpr+tnr+fpr+fnr ;
  6. Precision is calculated as precision = tp/ tp+fp ; and

7. Youdens index is calculated as Y = tpr + tnr − 1

Thankyou for reading, Do share the article.

--

--