A machine learning project to learn and understand different concepts of machine learning. We experimented with different encodings and machine learning algorithm and benchmarked on a balanced dataset.
View the Project on GitHub jatin7gupta/Microsoft-Malware-Detection
Python 3.7+
numpy 1.19.1
scikit-learn 0.23.1
category-encoders 2.2.2
pandas 1.0.5
*lightgbm 2.3.1
* Do not use pandas 1.1+
because it is incompatible with category-encoders 2.2.2
’s CatBoost encoder.
Run the feature_selection.py
script to conduct the feature selection process. The result will be printed to stdout
. During the process, the script will show each removed features, including its reason for removal. At the end of the output, it will show the all of the selected features.
Parameters for the script are:
path/to/dataset
, the path to the dataset fileencoding
, encoding method to be used to encode categorical variables. Available encoding methods are target
, js
, catboost
, and freq
*.* See the program arguments glossary below.
Usage examples:
python3 feature_selection.py data/randombalancedsample10000_train.csv freq
python3 feature_selection.py data/randombalancedsample100000_train.csv catboost > catboost_100k_features.txt
Run the benchmark.py
script to conduct a model benchmark test. The result will be printed to stdout
. It will show the selected model parameters decided by GridSearchCV
and the highest score (ROC AUC) from the grid search. It will then show the ROC AUC score result of applying the model to the test set. If the model’s algorithm is tree-based (cart
, rf
, adaboost
, lgbm
), it will also show list of features sorted by importance.
Parameters for the script are:
path/to/dataset
, the path to the dataset file to train the modelalgorithm
, the algorithm to train the model. Available algorithms are adaboost
, bagging
, cart
, knn
, lgbm
, logistic
, rf
, and svm
*.encoding
, encoding method to be used to encode categorical variables. Available encoding methods are target
, js
, catboost
, and freq
*.* See the program arguments glossary below.
Usage examples:
python3 benchmark.py data/randombalancedsample10000_train.csv knn freq
python3 benchmark.py data/randombalancedsample100000_train.csv lgbm catboost > lgbm_catboost_100k.txt
python3 benchmark.py data/randombalancedsample10000_train.csv svm js > svm_js_10k.txt
The sampling procedure uses 2 scripts, sampler/splitter.py
and sampler/random_balanced_sampler.py
.
The splitter.py
script will split the original dataset into 2 files: file containing positive examples only and file containing negative examples only.
Parameters:
path/to/input/dataset
path/to/output/negative/example/file
path/to/output/positive/example/file
Example:
python3 splitter.py data/train.csv data/train0.csv data/train1.csv
The random_balanced_sampler.py
script takes the examples in the 2 split files to randomly combine them into a single balanced sample file. For example, if the specified sample size is 2S
, it will take random S
positive examples and random S
negative examples and combine them into a single file containing 2S
examples.
Note that the script needs the number of examples of each input files in order for the random sampling behaviour to work. We provide a script, sampler/counter.py
that might help in getting these numbers.
Parameters:
path/to/input/negative/example/file
path/to/input/positive/example/file
path/to/output/sample/file
number_of_examples_in_negative_file
number_of_examples_in_positive_file
output_sample_size
Example:
python3 random_balanced_sampler.py train0.csv train1.csv randombalancedsample10000_train.csv 4462591 4458892 10000'
The feature selection and model benchmarking scripts require these modules:
attr_map.py
: Map of the feature selection results for each encoding method. It is required by the benchmark script and the preparer module to take only the selected features when training the model.attr_classes.py
: Classification of features. It is required by the feature selection script to decide which feature is nominal categorical, continuous, boolean, etc.preparer.py
: Contains a helper function to prepare the dataset for model benchmarking. The preparation includes dropping features and cleansing the values.Feature selection result files are stored in feature_selection_results
directory. The file name format is [encoding]_[sample_size]_features.txt
.
For example, the result file of feature selection process using CatBoost encoder and sample of 100,000 examples is catboost_100k_features.txt
.
Model benchmark result files are stored in benchmark_results
directory. The file name format is [algorithm]_[encoding]_[sample_size].txt
.
For example, the benchmark result file of model trained by the random forest algorthm using James-Stein encoder and sample of 10,000 examples is rf_js_10k.txt
.
target
: Base target encodingcatboost
: CatBoost encodingjs
: James-Stein encodingfreq
: Frequency encodingadaboost
: AdaBoost ensemble algorithmbagging
: Bagging ensemble algorithmcart
: scikit-learn implementation of decision tree algorithmknn
: K-nearest neighbour algorithmlgbm
: LightGBM implementation of gradient boosting algorithmlogistic
: Logistic regression algorithmrf
: Random forest algorithmsvm
: Support vector machine algorithmDataset:
https://www.kaggle.com/c/microsoft-malware-prediction/data