Skip to content

Commit

Permalink
operational-technology-use-case (#35)
Browse files Browse the repository at this point in the history
Operational technology attack/fault detection notebook, scripts and an example model.
closes issue #36

Authors:
  - https://github.com/gbatmaz

Approvers:
  - Tad ZeMicheal (https://github.com/tzemicheal)
  - https://github.com/raykallen

URL: #35
  • Loading branch information
gbatmaz authored Feb 24, 2023
1 parent cb91cf4 commit 88253d9
Show file tree
Hide file tree
Showing 8 changed files with 1,329 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ This model shows an application of a graph neural network for anomalous authenti
## [Asset Clustering using Windows Event Logs](/asset-clustering)
This model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.

## [Industrial Control System (ICS) Cyber Attack Detection](/operational-technology)
This model is an XGBoost classifier that predicts each event on a power system based on dataset features.

# Repo Structure
Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:

Expand Down
94 changes: 94 additions & 0 deletions operational-technology/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
## Industrial Control System (ICS) Cyber Attack Detection

## Use Case
Classify events into various categories based on power system data.

### Version
1.0

### Model Overview
The model is a multi-class XGBoost classifier that predicts each event on a power system based on dataset features.

### Model Architecture
XGBoost Classifier

### Requirements
Requirements can be installed with
```
pip install -r requirements.txt
```
and for `p7zip`
```
apt update
apt install p7zip-full p7zip-rar
```

### Training

#### Training data
In this project, we use the publicly available __[**Industrial Control System (ICS) Cyber Attack Datasets**](Tommy Morris - Industrial Control System (ICS) Cyber Attack Datasets (google.com))__[1] dataset from the Oak Ridge National Laboratories (ORNL) and UAH. We use the 3-class version of the dataset. The dataset labels are Natural Events, No Events and Attack Events. All features contain numeric values, and the dataset has no timestamp or interval information.
Dataset features contain synchrophasor measurements and data logs from Snort, a simulated control panel, and relays. There are 78377 rows in the dataset. In our notebooks and scripts, we download the compressed version from its source and then extract and merge all the rows into a dataframe. The `inf` values are replaced with `nan`, and the three labels are replaced with 0,1 and 2.

#### Training parameters

Most of the default XGBoost parameters are used in training code. The performance could be improved by finding better hyperparameters. We experimented with a random search but excluded that part from the notebook for brevity.
i.e.
```
params = { 'max_depth': [2,3,6,10,20],
'learning_rate': [0.05,0.1, 0.15,0.2,0.25,0.3],
'n_estimators': [500, 750, 1000,1200],
'colsample_bytree': [0.1,0.3,0.5, 0.7,0.9],
'min_child_weight': [1, 2, 5,8, 10,12,15,20],
'gamma': [0.5, 0.75,1, 1.5, 2, 5 , 7,8, 10,12],
'subsample': [0.05,0.1,0.3,0.6, 0.8, 1.0],
'colsample_bytree': [0.05,0.1,0.3,0.6, 0.8, 1.0],
}
scorer={'f1_score' : make_scorer(f1_score, average='weighted')}
grid=RandomizedSearchCV(xgb_clf,params,cv=kfold,random_state=2,scoring=scorer,refit=False,n_iter=40)
```

The hyperparameter set below came up as the best combination; different experiments may give different results.
```
{'subsample': 0.8, 'n_estimators': 1200, 'min_child_weight': 2, 'max_depth': 20, 'learning_rate': 0.15, 'gamma': 0.5, 'colsample_bytree': 0.1}
```

#### Model accuracy

The label distribution in the dataset is imbalanced, so we do not use the accuracy score. Instead, we use F1 weighted as the metric. The F1 score was over 0.91 on a test set.


#### Training script

To train the model, run the following script:

```
python ot-xgboost-train.py \
--model ../models/ot-xgboost-20230207.pkl
```
This will download the data (if it is not present) and train a model with a training set, and it will save a model under the `models` directory.

### Inference

Inference script can be run as:
```
python ot-xgboost-inference.py \
--model ../models/ot-xgboost-20230207.pkl \
--output ot-validation-output.jsonlines
```
This will download the dataset, the prediction is performed on the test set, and the output is saved into a file.

### How To Use This Model
This model can be used to detect cyber attacks and natural faults in power systems. A training notebook is also included so that users can update the model as more labelled data is collected.

### Input
The input for this model is the 127 features in the dataset which consist of synchrophasor measurements and data logs from Snort, a simulated control panel, and relays.

### Output
Multi-class classifier predicts one of these labels Natural Events, No Events and Attacks.

### Ethical considerations
N/A

### References
1. https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets
2. http://www.ece.uah.edu/~thm0009/icsdatasets/PowerSystem_Dataset_README.pdf
121 changes: 121 additions & 0 deletions operational-technology/inference/ot-xgboost-inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Example Usage:
python ot-xgboost-inference.py \
--model ../models/ot-xgboost-20230207.pkl \
--output ot-validation-output.jsonlines
"""

import argparse
import glob
import os.path
import pickle
import subprocess

import numpy as np
import pandas as pd
import requests
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

import cudf


def inference(model, output):

# Download the dataset

if not os.path.isfile("triple.7z"):

URL = "http://www.ece.uah.edu/~thm0009/icsdatasets/triple.7z"
response = requests.get(URL)
open("triple.7z", "wb").write(response.content)

# Unzip the dataset

if not os.path.isfile("data1.csv"):

subprocess.run(['p7zip', '-k', '-d', 'triple.7z'], stdout=subprocess.PIPE)

# Read the data into a dataset and save a copy of the merged dataframe

if not os.path.isfile("3class.csv"):
all_files = glob.glob(os.path.join("*.csv"))

dflist = []
for i in all_files:
dflist.append(pd.read_csv(i))
df = pd.concat(dflist)
df.reset_index(drop=True, inplace=True)

else:
df = pd.read_csv("3class.csv")

# Replace infinite values with nan

df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace labels with numbers
df["marker"] = df["marker"].replace("NoEvents", 0)
df["marker"] = df["marker"].replace("Attack", 1)
df["marker"] = df["marker"].replace("Natural", 2)

# Replace the nan values with the median of each column.

df = df.fillna(df.median())

# Create dataframes for input and labels.

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X = cudf.from_pandas(X)
y = cudf.from_pandas(y)

# Create train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Start an XGBoost classifier

xgb_clf = XGBClassifier()
with open(model, "rb") as file:
xgb_clf = pickle.load(file)

# Use the loaded model for predictions

y_pred = xgb_clf.predict(X_test)

f1 = f1_score(y_test.to_numpy(), y_pred, average="weighted")

print("F1 score is ", f1)
X_test["predictions"] = y_pred
X_test.to_json(output, orient='records', lines=True)


def main():

inference(args.model, args.output)
print("Inference completed, output saved")


if __name__ == "__main__":

parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--model", required=True, help="trained model")
parser.add_argument("--output", required=True, help="output filename")
args = parser.parse_args()

main()
6 changes: 6 additions & 0 deletions operational-technology/inference/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
cudf==22.8.1
numpy==1.22.4
pandas==1.3.5
requests==2.28.1
scikit_learn==1.2.1
xgboost==1.7.3
3 changes: 3 additions & 0 deletions operational-technology/models/ot-xgboost-20230207.pkl
Git LFS file not shown
Loading

0 comments on commit 88253d9

Please sign in to comment.