Code that won [Kaggle Microsoft Malware Classification Competition] (https://www.kaggle.com/c/malware-classification). Great credits go to team mate daxiongshu for organizing everything!
Please see the PDF for our methods and running the code. It heavily used [XGBOOST] (https://github.com/dmlc/xgboost).
This is a fork of the original winning code ported to Python 3 with updated dependencies. This fork requires Python 3.6.1 to run properly.
-
Clone the repository
-
(Optional, highly recommended) Create a virtual environment
-
Install the required packages:
python -m pip install -r requirements.txt
-
Install pypy for Python 3
-
Set up your PATH variable so that
pypy
points to the executable that runs pypy (so that pypy may be run aspypy [arguments]
without specifying the full path to pypy)
To train a model and perform predictions, see the PDF.
If you performed a custom split of the dataset into a train and a test set and you want to assess the prediction performance, run
prediction_performance.py [path to predictions] [path to true labels]
where [path to predictions]
is a path to the CSV containing predictions generated by one of the models,
and [path to true labels]
is a path to the CSV containing test labels
(having the same structure as the CSV for the train labels on the Kaggle site).