Members Names :

1. Introduction

đź’ˇ Project Objective: The primary goal of this study is to develop and evaluate Machine Learning models to classify network traffic as either "Normal" or specific attack categories. This report details the implementation of a Supervised Learning approach using XGBoost, highlighting the critical impact of data preprocessing and hyperparameter optimization.

The rapid proliferation of the Internet of Things (IoT) has revolutionized digital connectivity, embedding smart devices into critical infrastructure, healthcare, and smart homes. However, this expansion has introduced significant security vulnerabilities. Due to limited processing power and lack of standardized security protocols, IoT devices are frequent targets for cybercriminals, often being compromised to form "Botnets" capable of launching massive Distributed Denial of Service (DDoS) attacks.

Traditional security mechanisms are often insufficient against dynamic IoT attacks. Consequently, there is an urgent need for intelligent Network Intrusion Detection Systems (NIDS) capable of identifying malicious traffic patterns in real-time.

For this project, we utilize the BoT-IoT dataset, created by the UNSW Canberra Cyber Centre. Unlike older datasets, BoT-IoT was generated in a realistic testbed environment incorporating both normal traffic and various botnet attacks (DDoS, DoS, OS Fingerprinting, and Service Scanning).

2.Data Cleaning Process

Before performing any machine learning experiments, the raw BoT-IoT dataset underwent a general data cleaning procedure to ensure consistency, remove noise, and prepare the data for subsequent preprocessing steps. The initial distribution of attack categories (Figure: Pie Chart Before Cleaning) showed many labels that appeared redundant, overly granular, or incorrectly formatted. Several categories represented the same type of attack but were split into sub-labels (e.g., DoS UDP, DoS HTTP, DDoS UDP, DDoS TCP), and some labels for reconnaissance and theft appeared as very small slices, indicating either inconsistent naming or extremely rare occurrences.

During cleaning, the dataset was processed to:

category_pie_before.png