Microsoft-Malware-Detection

A machine learning project to learn and understand different concepts of machine learning. We experimented with different encodings and machine learning algorithm and benchmarked on a balanced dataset.

View the Project on GitHub jatin7gupta/Microsoft-Malware-Detection

Kaggle Competition : Microsoft-Malware-Detection

Requirements

* Do not use pandas 1.1+ because it is incompatible with category-encoders 2.2.2’s CatBoost encoder.

Usage Intructions

Feature Selection

Run the feature_selection.py script to conduct the feature selection process. The result will be printed to stdout. During the process, the script will show each removed features, including its reason for removal. At the end of the output, it will show the all of the selected features.

Parameters for the script are:

  1. path/to/dataset, the path to the dataset file
  2. encoding, encoding method to be used to encode categorical variables. Available encoding methods are target, js, catboost, and freq*.

* See the program arguments glossary below.

Usage examples:

python3 feature_selection.py data/randombalancedsample10000_train.csv freq
python3 feature_selection.py data/randombalancedsample100000_train.csv catboost > catboost_100k_features.txt

Model Benchmarking

Run the benchmark.py script to conduct a model benchmark test. The result will be printed to stdout. It will show the selected model parameters decided by GridSearchCV and the highest score (ROC AUC) from the grid search. It will then show the ROC AUC score result of applying the model to the test set. If the model’s algorithm is tree-based (cart, rf, adaboost, lgbm), it will also show list of features sorted by importance.

Parameters for the script are:

  1. path/to/dataset, the path to the dataset file to train the model
  2. algorithm, the algorithm to train the model. Available algorithms are adaboost, bagging, cart, knn, lgbm, logistic, rf, and svm*.
  3. encoding, encoding method to be used to encode categorical variables. Available encoding methods are target, js, catboost, and freq*.

* See the program arguments glossary below.

Usage examples:

python3 benchmark.py data/randombalancedsample10000_train.csv knn freq
python3 benchmark.py data/randombalancedsample100000_train.csv lgbm catboost > lgbm_catboost_100k.txt
python3 benchmark.py data/randombalancedsample10000_train.csv svm js > svm_js_10k.txt

Sampling

The sampling procedure uses 2 scripts, sampler/splitter.py and sampler/random_balanced_sampler.py.

The splitter.py script will split the original dataset into 2 files: file containing positive examples only and file containing negative examples only.

Parameters:

  1. path/to/input/dataset
  2. path/to/output/negative/example/file
  3. path/to/output/positive/example/file

Example:

python3 splitter.py data/train.csv data/train0.csv data/train1.csv

The random_balanced_sampler.py script takes the examples in the 2 split files to randomly combine them into a single balanced sample file. For example, if the specified sample size is 2S, it will take random S positive examples and random S negative examples and combine them into a single file containing 2S examples.

Note that the script needs the number of examples of each input files in order for the random sampling behaviour to work. We provide a script, sampler/counter.py that might help in getting these numbers.

Parameters:

  1. path/to/input/negative/example/file
  2. path/to/input/positive/example/file
  3. path/to/output/sample/file
  4. number_of_examples_in_negative_file
  5. number_of_examples_in_positive_file
  6. output_sample_size

Example:

python3 random_balanced_sampler.py train0.csv train1.csv randombalancedsample10000_train.csv 4462591 4458892 10000'

Helper Modules

The feature selection and model benchmarking scripts require these modules:

Result Files

Feature selection result files are stored in feature_selection_results directory. The file name format is [encoding]_[sample_size]_features.txt. For example, the result file of feature selection process using CatBoost encoder and sample of 100,000 examples is catboost_100k_features.txt.

Model benchmark result files are stored in benchmark_results directory. The file name format is [algorithm]_[encoding]_[sample_size].txt. For example, the benchmark result file of model trained by the random forest algorthm using James-Stein encoder and sample of 10,000 examples is rf_js_10k.txt.

Program Arguments Glossary

Dataset:

https://www.kaggle.com/c/microsoft-malware-prediction/data