-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
operational-technology-use-case (#35)
Operational technology attack/fault detection notebook, scripts and an example model. closes issue #36 Authors: - https://github.com/gbatmaz Approvers: - Tad ZeMicheal (https://github.com/tzemicheal) - https://github.com/raykallen URL: #35
- Loading branch information
Showing
8 changed files
with
1,329 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
## Industrial Control System (ICS) Cyber Attack Detection | ||
|
||
## Use Case | ||
Classify events into various categories based on power system data. | ||
|
||
### Version | ||
1.0 | ||
|
||
### Model Overview | ||
The model is a multi-class XGBoost classifier that predicts each event on a power system based on dataset features. | ||
|
||
### Model Architecture | ||
XGBoost Classifier | ||
|
||
### Requirements | ||
Requirements can be installed with | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
and for `p7zip` | ||
``` | ||
apt update | ||
apt install p7zip-full p7zip-rar | ||
``` | ||
|
||
### Training | ||
|
||
#### Training data | ||
In this project, we use the publicly available __[**Industrial Control System (ICS) Cyber Attack Datasets**](Tommy Morris - Industrial Control System (ICS) Cyber Attack Datasets (google.com))__[1] dataset from the Oak Ridge National Laboratories (ORNL) and UAH. We use the 3-class version of the dataset. The dataset labels are Natural Events, No Events and Attack Events. All features contain numeric values, and the dataset has no timestamp or interval information. | ||
Dataset features contain synchrophasor measurements and data logs from Snort, a simulated control panel, and relays. There are 78377 rows in the dataset. In our notebooks and scripts, we download the compressed version from its source and then extract and merge all the rows into a dataframe. The `inf` values are replaced with `nan`, and the three labels are replaced with 0,1 and 2. | ||
|
||
#### Training parameters | ||
|
||
Most of the default XGBoost parameters are used in training code. The performance could be improved by finding better hyperparameters. We experimented with a random search but excluded that part from the notebook for brevity. | ||
i.e. | ||
``` | ||
params = { 'max_depth': [2,3,6,10,20], | ||
'learning_rate': [0.05,0.1, 0.15,0.2,0.25,0.3], | ||
'n_estimators': [500, 750, 1000,1200], | ||
'colsample_bytree': [0.1,0.3,0.5, 0.7,0.9], | ||
'min_child_weight': [1, 2, 5,8, 10,12,15,20], | ||
'gamma': [0.5, 0.75,1, 1.5, 2, 5 , 7,8, 10,12], | ||
'subsample': [0.05,0.1,0.3,0.6, 0.8, 1.0], | ||
'colsample_bytree': [0.05,0.1,0.3,0.6, 0.8, 1.0], | ||
} | ||
scorer={'f1_score' : make_scorer(f1_score, average='weighted')} | ||
grid=RandomizedSearchCV(xgb_clf,params,cv=kfold,random_state=2,scoring=scorer,refit=False,n_iter=40) | ||
``` | ||
|
||
The hyperparameter set below came up as the best combination; different experiments may give different results. | ||
``` | ||
{'subsample': 0.8, 'n_estimators': 1200, 'min_child_weight': 2, 'max_depth': 20, 'learning_rate': 0.15, 'gamma': 0.5, 'colsample_bytree': 0.1} | ||
``` | ||
|
||
#### Model accuracy | ||
|
||
The label distribution in the dataset is imbalanced, so we do not use the accuracy score. Instead, we use F1 weighted as the metric. The F1 score was over 0.91 on a test set. | ||
|
||
|
||
#### Training script | ||
|
||
To train the model, run the following script: | ||
|
||
``` | ||
python ot-xgboost-train.py \ | ||
--model ../models/ot-xgboost-20230207.pkl | ||
``` | ||
This will download the data (if it is not present) and train a model with a training set, and it will save a model under the `models` directory. | ||
|
||
### Inference | ||
|
||
Inference script can be run as: | ||
``` | ||
python ot-xgboost-inference.py \ | ||
--model ../models/ot-xgboost-20230207.pkl \ | ||
--output ot-validation-output.jsonlines | ||
``` | ||
This will download the dataset, the prediction is performed on the test set, and the output is saved into a file. | ||
|
||
### How To Use This Model | ||
This model can be used to detect cyber attacks and natural faults in power systems. A training notebook is also included so that users can update the model as more labelled data is collected. | ||
|
||
### Input | ||
The input for this model is the 127 features in the dataset which consist of synchrophasor measurements and data logs from Snort, a simulated control panel, and relays. | ||
|
||
### Output | ||
Multi-class classifier predicts one of these labels Natural Events, No Events and Attacks. | ||
|
||
### Ethical considerations | ||
N/A | ||
|
||
### References | ||
1. https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets | ||
2. http://www.ece.uah.edu/~thm0009/icsdatasets/PowerSystem_Dataset_README.pdf |
121 changes: 121 additions & 0 deletions
121
operational-technology/inference/ot-xgboost-inference.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
Example Usage: | ||
python ot-xgboost-inference.py \ | ||
--model ../models/ot-xgboost-20230207.pkl \ | ||
--output ot-validation-output.jsonlines | ||
""" | ||
|
||
import argparse | ||
import glob | ||
import os.path | ||
import pickle | ||
import subprocess | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import requests | ||
from sklearn.metrics import f1_score | ||
from sklearn.model_selection import train_test_split | ||
from xgboost import XGBClassifier | ||
|
||
import cudf | ||
|
||
|
||
def inference(model, output): | ||
|
||
# Download the dataset | ||
|
||
if not os.path.isfile("triple.7z"): | ||
|
||
URL = "http://www.ece.uah.edu/~thm0009/icsdatasets/triple.7z" | ||
response = requests.get(URL) | ||
open("triple.7z", "wb").write(response.content) | ||
|
||
# Unzip the dataset | ||
|
||
if not os.path.isfile("data1.csv"): | ||
|
||
subprocess.run(['p7zip', '-k', '-d', 'triple.7z'], stdout=subprocess.PIPE) | ||
|
||
# Read the data into a dataset and save a copy of the merged dataframe | ||
|
||
if not os.path.isfile("3class.csv"): | ||
all_files = glob.glob(os.path.join("*.csv")) | ||
|
||
dflist = [] | ||
for i in all_files: | ||
dflist.append(pd.read_csv(i)) | ||
df = pd.concat(dflist) | ||
df.reset_index(drop=True, inplace=True) | ||
|
||
else: | ||
df = pd.read_csv("3class.csv") | ||
|
||
# Replace infinite values with nan | ||
|
||
df.replace([np.inf, -np.inf], np.nan, inplace=True) | ||
|
||
# Replace labels with numbers | ||
df["marker"] = df["marker"].replace("NoEvents", 0) | ||
df["marker"] = df["marker"].replace("Attack", 1) | ||
df["marker"] = df["marker"].replace("Natural", 2) | ||
|
||
# Replace the nan values with the median of each column. | ||
|
||
df = df.fillna(df.median()) | ||
|
||
# Create dataframes for input and labels. | ||
|
||
X = df.iloc[:, :-1] | ||
y = df.iloc[:, -1] | ||
X = cudf.from_pandas(X) | ||
y = cudf.from_pandas(y) | ||
|
||
# Create train and test set | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) | ||
|
||
# Start an XGBoost classifier | ||
|
||
xgb_clf = XGBClassifier() | ||
with open(model, "rb") as file: | ||
xgb_clf = pickle.load(file) | ||
|
||
# Use the loaded model for predictions | ||
|
||
y_pred = xgb_clf.predict(X_test) | ||
|
||
f1 = f1_score(y_test.to_numpy(), y_pred, average="weighted") | ||
|
||
print("F1 score is ", f1) | ||
X_test["predictions"] = y_pred | ||
X_test.to_json(output, orient='records', lines=True) | ||
|
||
|
||
def main(): | ||
|
||
inference(args.model, args.output) | ||
print("Inference completed, output saved") | ||
|
||
|
||
if __name__ == "__main__": | ||
|
||
parser = argparse.ArgumentParser(description=__doc__) | ||
parser.add_argument("--model", required=True, help="trained model") | ||
parser.add_argument("--output", required=True, help="output filename") | ||
args = parser.parse_args() | ||
|
||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
cudf==22.8.1 | ||
numpy==1.22.4 | ||
pandas==1.3.5 | ||
requests==2.28.1 | ||
scikit_learn==1.2.1 | ||
xgboost==1.7.3 |
Git LFS file not shown
Oops, something went wrong.