Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operational-technology-use-case #35

Merged
merged 11 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ This model shows an application of a graph neural network for anomalous authenti
## [Asset Clustering using Windows Event Logs](/asset-clustering)
This model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.

## [Industrial Control System (ICS) Cyber Attack Detection](/operational-technology)
This model is an XGBoost classifier that predicts each event on a power system based on dataset features.

# Repo Structure
Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:

Expand Down
86 changes: 86 additions & 0 deletions operational-technology/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
## Industrial Control System (ICS) Cyber Attack Detection

## Use Case
Classify events into various categories based on power system data.

### Version
1.0

### Model Overview
The model is an XGBoost classifier that predicts each event on a power system based on dataset features.
raykallen marked this conversation as resolved.
Show resolved Hide resolved

### Model Architecture
XGBoost Classifier

### Requirements
Requirements can be installed with
```
pip install -r requirements.txt
```
and for `p7zip`
```
apt update
apt install p7zip-full p7zip-rar
```

### Training

#### Training data
In this project, we use the publicly available __[**Industrial Control System (ICS) Cyber Attack Datasets**](Tommy Morris - Industrial Control System (ICS) Cyber Attack Datasets (google.com))__[1] dataset from the Oak Ridge National Laboratories (ORNL) and UAH. We use the 3-class version of the dataset. The dataset labels are Natural Events, No Events and Attack Events.
Dataset features contain synchrophasor measurements and data logs from Snort, a simulated control panel, and relays. There are 78377 rows in the dataset. In our notebooks and scripts, we download the compressed version from its source and then extract and merge all the rows into a dataframe.
raykallen marked this conversation as resolved.
Show resolved Hide resolved

#### Training parameters

Most of the default XGBoost parameters are used in training code. The performance could be improved by finding better hyperparameters. We experimented with a random search but excluded that part from the notebook for brevity.
i.e.
```
params = { 'max_depth': [2,3,6,10,20],
'learning_rate': [0.05,0.1, 0.15,0.2,0.25,0.3],
'n_estimators': [500, 750, 1000,1200],
'colsample_bytree': [0.1,0.3,0.5, 0.7,0.9],
'min_child_weight': [1, 2, 5,8, 10,12,15,20],
'gamma': [0.5, 0.75,1, 1.5, 2, 5 , 7,8, 10,12],
'subsample': [0.05,0.1,0.3,0.6, 0.8, 1.0],
'colsample_bytree': [0.05,0.1,0.3,0.6, 0.8, 1.0],
}
scorer={'f1_score' : make_scorer(f1_score, average='weighted')}
grid=RandomizedSearchCV(xgb_clf,params,cv=kfold,random_state=2,scoring=scorer,refit=False,n_iter=40)
```

The hyperparameter set below came up as the best combination; different experiments may give different results.
```
{'subsample': 0.8, 'n_estimators': 1200, 'min_child_weight': 2, 'max_depth': 20, 'learning_rate': 0.15, 'gamma': 0.5, 'colsample_bytree': 0.1}
```

#### Model accuracy

The label distribution in the dataset is not imbalanced, so we do not use the accuracy score. Instead, we use F1 weighted as the metric. The F1 score was over 0.91 on a test set.
raykallen marked this conversation as resolved.
Show resolved Hide resolved


#### Training script

To train the model, run the following script:

```
python ot-xgboost-train.py \
--model ../models/ot-xgboost-20230207.pkl
```
This will download the data (if it is not present) and train a model with a training set, and it will save a model under the `models` directory.

### Inference

Inference script can be run as:
```
python ot-xgboost-inference.py \
--model ../models/ot-xgboost-20230207.pkl \
--output ot-validation-output.jsonlines
```
This will download the dataset, the prediction is performed on the test set, and the output is saved into a file.


raykallen marked this conversation as resolved.
Show resolved Hide resolved
### Ethical considerations
N/A

### References
1. https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets
2. http://www.ece.uah.edu/~thm0009/icsdatasets/PowerSystem_Dataset_README.pdf
121 changes: 121 additions & 0 deletions operational-technology/inference/ot-xgboost-inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Example Usage:
python ot-xgboost-inference.py \
--model ../models/ot-xgboost-20230207.pkl \
--output ot-validation-output.jsonlines
"""

import argparse
import glob
import os.path
import pickle
import subprocess

import numpy as np
import pandas as pd
import requests
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

import cudf


def inference(model, output):

# Download the dataset

if not os.path.isfile("triple.7z"):

URL = "http://www.ece.uah.edu/~thm0009/icsdatasets/triple.7z"
response = requests.get(URL)
open("triple.7z", "wb").write(response.content)

# Unzip the dataset

if not os.path.isfile("data1.csv"):

subprocess.run(['p7zip', '-k', '-d', 'triple.7z'], stdout=subprocess.PIPE)

# Read the data into a dataset and save a copy of the merged dataframe

if not os.path.isfile("3class.csv"):
all_files = glob.glob(os.path.join("*.csv"))

dflist = []
for i in all_files:
dflist.append(pd.read_csv(i))
df = pd.concat(dflist)
df.reset_index(drop=True, inplace=True)

else:
df = pd.read_csv("3class.csv")

# Replace infinite values with nan

df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace labels with numbers
df["marker"] = df["marker"].replace("NoEvents", 0)
df["marker"] = df["marker"].replace("Attack", 1)
df["marker"] = df["marker"].replace("Natural", 2)

# Replace the nan values with the median of each column.

df = df.fillna(df.median())

# Create dataframes for input and labels.

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X = cudf.from_pandas(X)
y = cudf.from_pandas(y)

# Create train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Start an XGBoost classifier

xgb_clf = XGBClassifier()
with open(model, "rb") as file:
tzemicheal marked this conversation as resolved.
Show resolved Hide resolved
xgb_clf = pickle.load(file)

# Use the loaded model for predictions

y_pred = xgb_clf.predict(X_test)

f1 = f1_score(y_test.to_numpy(), y_pred, average="weighted")

print("F1 score is ", f1)
X_test["predictions"] = y_pred
X_test.to_json(output, orient='records', lines=True)


def main():

inference(args.model, args.output)
print("Inference completed, output saved")


if __name__ == "__main__":

parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--model", required=True, help="trained model")
parser.add_argument("--output", required=True, help="output filename")
args = parser.parse_args()

main()
6 changes: 6 additions & 0 deletions operational-technology/inference/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
cudf==22.8.1
numpy==1.22.4
pandas==1.3.5
requests==2.28.1
scikit_learn==1.2.1
xgboost==1.7.3
3 changes: 3 additions & 0 deletions operational-technology/models/ot-xgboost-20230207.pkl
Git LFS file not shown
Loading