operational-technology-use-case (#35)

Operational technology attack/fault detection notebook, scripts and an example model. closes issue #36 Authors: - https://github.com/gbatmaz Approvers: - Tad ZeMicheal (https://github.com/tzemicheal) - https://github.com/raykallen URL: #35
nv-morpheus · Feb 24, 2023 · 88253d9 · 88253d9
1 parent cb91cf4
commit 88253d9
Show file tree

Hide file tree

Showing 8 changed files with 1,329 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -72,6 +72,9 @@ This model shows an application of a graph neural network for anomalous authenti
 ## [Asset Clustering using Windows Event Logs](/asset-clustering)
 This model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.
 
+## [Industrial Control System (ICS) Cyber Attack Detection](/operational-technology)
+This model is an XGBoost classifier that predicts each event on a power system based on dataset features.
+
 # Repo Structure
 Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:
 

diff --git a/operational-technology/README.md b/operational-technology/README.md
@@ -0,0 +1,94 @@
+## Industrial Control System (ICS) Cyber Attack Detection
+
+## Use Case
+Classify events into various categories based on power system data.
+
+### Version
+1.0
+
+### Model Overview
+The model is a multi-class XGBoost classifier that predicts each event on a power system based on dataset features.
+
+### Model Architecture
+XGBoost Classifier
+
+### Requirements
+Requirements can be installed with 
+```
+pip install -r requirements.txt
+```
+and for `p7zip`
+```
+apt update
+apt install p7zip-full p7zip-rar
+```
+
+### Training
+
+#### Training data
+In this project, we use the publicly available __[**Industrial Control System (ICS) Cyber Attack Datasets**](Tommy Morris - Industrial Control System (ICS) Cyber Attack Datasets (google.com))__[1] dataset from the Oak Ridge National Laboratories (ORNL) and UAH. We use the 3-class version of the dataset. The dataset labels are Natural Events, No Events and Attack Events. All features contain numeric values, and the dataset has no timestamp or interval information.
+Dataset features contain synchrophasor measurements and data logs from Snort, a simulated control panel, and relays. There are 78377 rows in the dataset. In our notebooks and scripts, we download the compressed version from its source and then extract and merge all the rows into a dataframe. The `inf` values are replaced with `nan`, and the three labels are replaced with 0,1 and 2.
+
+#### Training parameters
+
+Most of the default XGBoost parameters are used in training code. The performance could be improved by finding better hyperparameters. We experimented with a random search but excluded that part from the notebook for brevity.
+i.e.
+```
+params = { 'max_depth': [2,3,6,10,20],
+           'learning_rate': [0.05,0.1, 0.15,0.2,0.25,0.3],
+           'n_estimators': [500, 750, 1000,1200],
+           'colsample_bytree': [0.1,0.3,0.5, 0.7,0.9],
+           'min_child_weight': [1, 2, 5,8, 10,12,15,20],
+           'gamma': [0.5, 0.75,1, 1.5, 2, 5 , 7,8, 10,12],
+           'subsample': [0.05,0.1,0.3,0.6, 0.8, 1.0],
+           'colsample_bytree': [0.05,0.1,0.3,0.6, 0.8, 1.0],
+         }
+scorer={'f1_score' : make_scorer(f1_score, average='weighted')}
+grid=RandomizedSearchCV(xgb_clf,params,cv=kfold,random_state=2,scoring=scorer,refit=False,n_iter=40)
+```
+
+The hyperparameter set below came up as the best combination; different experiments may give different results.
+```
+{'subsample': 0.8, 'n_estimators': 1200, 'min_child_weight': 2, 'max_depth': 20, 'learning_rate': 0.15, 'gamma': 0.5, 'colsample_bytree': 0.1}
+```
+
+#### Model accuracy
+
+The label distribution in the dataset is imbalanced, so we do not use the accuracy score. Instead, we use F1 weighted as the metric. The F1 score was over 0.91 on a test set.
+
+
+#### Training script
+
+To train the model, run the following script:
+
+```
+python ot-xgboost-train.py \
+    --model ../models/ot-xgboost-20230207.pkl
+```
+This will download the data (if it is not present) and train a model with a training set, and it will save a model under the `models` directory.
+
+### Inference
+
+Inference script can be run as:
+```
+python ot-xgboost-inference.py \
+    --model ../models/ot-xgboost-20230207.pkl \
+    --output ot-validation-output.jsonlines
+```
+This will download the dataset, the prediction is performed on the test set, and the output is saved into a file.
+
+### How To Use This Model
+This model can be used to detect cyber attacks and natural faults in power systems. A training notebook is also included so that users can update the model as more labelled data is collected. 
+
+### Input
+The input for this model is the 127 features in the dataset which consist of synchrophasor measurements and data logs from Snort, a simulated control panel, and relays.
+
+### Output
+Multi-class classifier predicts one of these labels Natural Events, No Events and Attacks.
+
+### Ethical considerations
+N/A
+
+### References
+1. https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets
+2. http://www.ece.uah.edu/~thm0009/icsdatasets/PowerSystem_Dataset_README.pdf
diff --git a/operational-technology/inference/ot-xgboost-inference.py b/operational-technology/inference/ot-xgboost-inference.py
@@ -0,0 +1,121 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Example Usage:
+python ot-xgboost-inference.py \
+    --model ../models/ot-xgboost-20230207.pkl \
+    --output ot-validation-output.jsonlines
+"""
+
+import argparse
+import glob
+import os.path
+import pickle
+import subprocess
+
+import numpy as np
+import pandas as pd
+import requests
+from sklearn.metrics import f1_score
+from sklearn.model_selection import train_test_split
+from xgboost import XGBClassifier
+
+import cudf
+
+
+def inference(model, output):
+
+    # Download the dataset
+
+    if not os.path.isfile("triple.7z"):
+
+        URL = "http://www.ece.uah.edu/~thm0009/icsdatasets/triple.7z"
+        response = requests.get(URL)
+        open("triple.7z", "wb").write(response.content)
+
+    # Unzip the dataset
+
+    if not os.path.isfile("data1.csv"):
+
+        subprocess.run(['p7zip', '-k', '-d', 'triple.7z'], stdout=subprocess.PIPE)
+
+    # Read the data into a dataset and save a copy of the merged dataframe
+
+    if not os.path.isfile("3class.csv"):
+        all_files = glob.glob(os.path.join("*.csv"))
+
+        dflist = []
+        for i in all_files:
+            dflist.append(pd.read_csv(i))
+        df = pd.concat(dflist)
+        df.reset_index(drop=True, inplace=True)
+
+    else:
+        df = pd.read_csv("3class.csv")
+
+    # Replace infinite values with nan
+
+    df.replace([np.inf, -np.inf], np.nan, inplace=True)
+
+    # Replace labels with numbers
+    df["marker"] = df["marker"].replace("NoEvents", 0)
+    df["marker"] = df["marker"].replace("Attack", 1)
+    df["marker"] = df["marker"].replace("Natural", 2)
+
+    # Replace the nan values with the median of each column.
+
+    df = df.fillna(df.median())
+
+    # Create dataframes for input and labels.
+
+    X = df.iloc[:, :-1]
+    y = df.iloc[:, -1]
+    X = cudf.from_pandas(X)
+    y = cudf.from_pandas(y)
+
+    # Create train and test set
+    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
+
+    # Start an XGBoost classifier
+
+    xgb_clf = XGBClassifier()
+    with open(model, "rb") as file:
+        xgb_clf = pickle.load(file)
+
+    # Use the loaded model for predictions
+
+    y_pred = xgb_clf.predict(X_test)
+
+    f1 = f1_score(y_test.to_numpy(), y_pred, average="weighted")
+
+    print("F1 score is ", f1)
+    X_test["predictions"] = y_pred
+    X_test.to_json(output, orient='records', lines=True)
+
+
+def main():
+
+    inference(args.model, args.output)
+    print("Inference completed, output saved")
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model", required=True, help="trained model")
+    parser.add_argument("--output", required=True, help="output filename")
+    args = parser.parse_args()
+
+main()
diff --git a/operational-technology/inference/requirements.txt b/operational-technology/inference/requirements.txt
@@ -0,0 +1,6 @@
+cudf==22.8.1
+numpy==1.22.4
+pandas==1.3.5
+requests==2.28.1
+scikit_learn==1.2.1
+xgboost==1.7.3
diff --git a/operational-technology/models/ot-xgboost-20230207.pkl b/operational-technology/models/ot-xgboost-20230207.pkl