Merge branch 'reorganize' into main

evandowning · Feb 24, 2021 · 25df887 · 25df887
2 parents 0feb1b1 + ed4fd37
commit 25df887
Show file tree

Hide file tree

Showing 147 changed files with 1,918 additions and 99,653 deletions.
diff --git a/README.md b/README.md
@@ -7,17 +7,10 @@ For technical details, please see the paper cited below.
 
 **Overview**:
   - Input: unpacked malware PE binary
-  - Middle: list of all basic blocks in binary along with their reconstruction error values
-  - Output: choosing a threshold (based on average reconstruction error value per function), identifies regions of interest (i.e., basic blocks above threshold), and clusters the averaged feature vectors of RoIs
+  - Middle: list of all basic blocks in binary along with their reconstruction error values (MSE values)
+  - Output: choosing a threshold (based on average MSE value per function), identifies regions of interest (RoI) (i.e., basic blocks with MSE values above threshold), and clusters the averaged feature vectors of RoIs
 
-**Usage**: Using ground-truth malware binaries, choose an error value threshold which gives the analyst their desired results (tune to favor increasing TPR or decreasing FPR).
-
-## Citation
-  - ```
-    things and stuff
-    ```
-  - Paper: link to paper
-  - [Reproducing paper experiments](reproducing_paper/README.md)
+**Usage**: Using ground-truth malware binaries, choose an MSE threshold which gives the analyst their desired results (tune to favor increasing TPR or decreasing FPR).
 
 ## Coming soon
   - Dockerfile
@@ -28,7 +21,7 @@ For technical details, please see the paper cited below.
     - Tested on Debian 10 (Buster)
     - Python 3 (tested with Python 3.7.3) and pip
     - virtualenvwrapper (optional, but recommended)
-    - BinaryNinja (used only to extract features and function information from binaries)
+    - BinaryNinja 2.2 (used to extract features and function information from binaries)
     - parallel (optional, but recommended)
   - Setup:
     ```
@@ -40,69 +33,113 @@ For technical details, please see the paper cited below.
 
 ## Usage
   - Obtain unpacked benign and malicious PE file datasets
-    - I put benign unpacked binaries in `/data/benign_unipacker/` and malicious unpacked binaries in `/data/malicious_unipacker/` because I unpacked them via unipacker.
-    - Under each directory is a subdirectory of the family label. I.e., `/data/benign_unipacker/benign/`, `/data/malicious_unipacker/virut/`, `/data/malicious_unipacker/zbot/`, etc.
-  - Extract BinaryNinja DB file
-    ```
-    (dr) $ find /data/benign_unipacker -type f > samples_benign_unipacker.txt
-    (dr) $ ./write_commands_binja.sh samples_benign_unipacker.txt /data/benign_unipacker_bndb/ > commands_benign_unipacker_bndb.txt
-    (dr) $ time parallel --memfree 2G --retries 10 -a commands_benign_unipacker_bndb.txt 2> error.txt > output.txt
-    ```
-  - Extracting features:
-    ```
-    (dr) $ find /data/benign_unipacker_bndb -type f > samples_benign_unipacker_bndb.txt
-    (dr) $ ./write_commands_acfg_plus_binja.sh samples_benign_unipacker_bndb.txt benign_unipacker_bndb_acfg_plus/ > commands_benign_unipacker_bndb_acfg_plus.txt
-    (dr) $ time parallel --memfree 2G --retries 10 -a commands_benign_unipacker_bndb_acfg_plus.txt 2> error.txt > output.txt
-
-    (dr) $ find benign_unipacker_bndb_acfg_plus -type f > samples_benign_unipacker_bndb_acfg_plus.txt
-    (dr) $ ./write_commands_acfg_plus_feature.sh samples_benign_unipacker_bndb_acfg_plus.txt benign_unipacker_bndb_acfg_plus_feature/ > commands_benign_unipacker_bndb_acfg_plus_feature.txt
-    (dr) $ time parallel --memfree 2G --retries 10 -a commands_benign_unipacker_bndb_acfg_plus_feature.txt 2> error.txt > output.txt
-    ```
-  - Extract function-related data from BinaryNinja DB files
+    - Benign folder layout:    `benign_unpacked/benign/<binary_files>`
+    - Malicious folder layout: `malicious_unpacked/<family_label>/<binary_files>`
+  - Extract binary features & data
     ```
-    (dr) $ find /data/malicious_unipacker_bndb/ -type f > samples_bndb.txt
-    (dr) $ time ./write_commands_get_function.sh samples_bndb.txt /data/malicious_unipacker_bndb_function/ > commands_get_function.txt
-    (dr) $ time parallel --memfree 4G --retries 10 -a commands_get_function.txt 2> parallel_get_function_stderr.txt > parallel_get_function_stdout.txt
+    (dr) $ ./extract.sh benign_unpacked/
+    (dr) $ ./extract.sh malicious_unpacked/
     ```
   - Train autoencoder:
     ```
-    (dr) $ python split.py /data/benign_unipacker_bndb_acfg_plus_feature/ train.txt test.txt
-    (dr) $ for fn in 'train.txt' 'test.txt' 'valid.txt'; do shuf $fn > tmp.txt; mv tmp.txt $fn; done
+    (dr) $ cd ./autoencoder/
+
+    # Split & shuffle benign dataset
+    (dr) $ python split.py benign_unpacked_bndb_raw_feature/ train.txt test.txt > split_stdout.txt
+    (dr) $ for fn in 'train.txt' 'test.txt'; do shuf $fn > tmp.txt; mv tmp.txt $fn; done
 
     # Check that benign samples use all features:
-    (dr) $ time python feature_check.py train.txt test.txt valid.txt
-    # Do the same for malicious samples as well
-    (dr) $ find /data/malicious_unipacker_bndb_acfg_plus_feature/ -type f > malicious.txt
-    (dr) $ time python feature_check.py malicious.txt valid.txt valid.txt
+    (dr) $ python feature_check.py train.txt
+    (dr) $ python feature_check.py test.txt
+    # Check that malicious samples use all features:
+    (dr) $ find malicious_unpacked_bndb_raw_feature/ -type f > malicious.txt
+    (dr) $ python feature_check.py malicious.txt
+
+    # Get max values (for normalizing)
+    (dr) $ python normalize.py --train train.txt \
+                               --test test.txt \
+                               --output normalize.npy
 
     # Train model
-    (dr) $ time python autoencoder.py --kernel 24 --strides 1 --option 2 acfg_plus --train train.txt --test test.txt --valid valid.txt --model ./models/m2_normalize_24_12.h5 --map benign_map.txt --normalize True > output.txt
-    ```
-  - Extract reconstruction errors:
+    (dr) $ time python autoencoder.py --train train.txt \
+                                      --test test.txt \
+                                      --normalize normalize.npy \
+                                      --model dr.h5 > autoencoder_stdout.txt 2> autoencoder_stderr.txt
+    ```
+  - Cluster suspicious functions:
+    - Extract MSE values for each malware basic block:
+      ```
+      (dr) $ cd ./autoencoder/
+      (dr) $ time python mse.py --feature malicious.txt \
+                                --model dr.h5 \
+                                --normalize normalize.npy \
+                                --output malicious_unpacked_bndb_raw_feature_mse/ 2> mse_stderr.txt
+      ```
+    - Identify desired threshold. See [Grading](#grading).
+    - Extract RoI (basic blocks):
+      ```
+      (dr) $ cd ./autoencoder/
+      (dr) $ mkdir roi/
+      (dr) $ time python roi.py --bndb-func malicious_unpacked_bndb_function/ \
+                                --feature malicious_unpacked_bndb_raw_feature/ \
+                                --mse malicious_unpacked_bndb_raw_feature_mse/ \
+                                --normalize normalize.npy \
+                                --output roi/ \
+                                --bb --avg --thresh 7.293461392658043e-06 > roi/stdout.txt 2> roi/stderr.txt
+      ```
+    - Cluster functions containing RoI:
+      ```
+      (dr) $ cd ./cluster/
+      (dr) $ time python pca_hdbscan.py --x ../autoencoder/roi/x.npy \
+                                        --fn ../autoencoder/roi/fn.npy \
+                                        --addr ../autoencoder/roi/addr.npy > pca_hdbscan_stdout.txt
+      ```
+    - Graph percentage of functions highlighted:
+      ```
+      (dr) $ cd ./cluster/
+      (dr) $ time python function_coverage.py --functions malicious_unpacked_bndb_function/ \
+                                              --fn ../autoencoder/roi/fn.npy \
+                                              --addr ../autoencoder/roi/addr.npy \
+                                              --output function_coverage.png > function_coverage_stdout.txt
+      ```
+
+## Grading
+  - Graph ROC curves
     ```
-    (dr) $ time python autoencoder_eval_all.py acfg_plus --acfg-feature /data/malicious_unipacker_bndb_acfg_plus_feature/ \
-                                                         --model ./models/autoencoder_benign_unipacker_plus/m2_normalize_24_12.h5 \
-                                                         --normalize True \
-                                                         --output /data/malicious_unipacker_bndb_acfg_plus_feature_error/ 2> autoencoder_eval_all_stderr.txt
+    (dr) $ cd grader/
+    (dr) $ ./roc.sh &> roc_stdout_stderr.txt
     ```
-  - Extract regions of interest:
+    - [roc_rbot.png](grader/roc_rbot.png)
+    - [roc_pegasus.png](grader/roc_pegasus.png)
+    - [roc_carbanak.png](grader/roc_carbanak.png)
+    - [roc_combined.png](grader/roc_combined.png)
+  - Pick desired threshold
     ```
-    (dr) $ time python autoencoder_roi.py acfg_plus --data /data/malicious_unipacker_bndb_acfg_plus_feature_error/ \
-                                                    --bndb-func /data/malicious_unipacker_bndb_function/ \
-                                                    --acfg /data/malicious_unipacker_bndb_acfg_plus_feature/ \
-                                                    --output ./autoencoder_roi/ \
-                                                    --bb --avg --thresh 7.293461392658043e-06 > ./autoencoder_roi/stdout.txt 2> ./autoencoder_roi/stderr.txt
+    $ vim roc_stdout_stderr.txt
     ```
-  - Cluster regions of interest:
+  - Examine FPs & FNs due to chosen threshold
     ```
-    (dr) $ time python pca_hdbscan.py --x autoencoder_roi/x_train.npy \
-                                      --fn autoencoder_roi/train_fn.npy \
-                                      --addr autoencoder_roi/train_addr.npy > pca_hdbscan_output.txt
+    $ examine.sh 9.053894787328584e-08 &> examine_stdout_stderr.txt
+    $ vim examine_stdout_stderr.txt
     ```
 
 ## FAQs
   - Why don't you release the binaries used to train and evaluate DeepReflect (other than ground-truth samples)?
     - We cannot release malware binaries because of our agreement with those who provided them to us.
-      - If you're looking for malware binaries, you might consider the [SOREL dataset](https://github.com/sophos-ai/SOREL-20M)
+      - If you're looking for malware binaries, you might consider the [SOREL dataset](https://github.com/sophos-ai/SOREL-20M) or contacting [VirusTotal](https://www.virustotal.com/).
     - We cannot release benign binaries because of copyright rules.
+      - If you're looking for benign binaries, you might consider [crawling](https://github.com/evandowning/selenium-crawler) them on [CNET](https://download.cnet.com/windows/). Make sure to verify they're not malicious via [VirusTotal](https://www.virustotal.com/).
     - We do, however, release our extracted features so models can be trained from scratch.
+
+## Citing
+  ```
+  @inproceedings{deepreflect_2021,
+      author = {Downing, Evan and Mirsky, Yisroel and Park, Kyuhong and Lee, Wenke},
+      title = {DeepReflect: {Discovering} {Malicious} {Functionality} through {Binary} {Reconstruction}},
+      booktitle = {{USENIX} {Security} {Symposium}},
+      year = {2021}
+  }
+  ```
+  - Paper: link to paper
+  - [Reproducing paper experiments](reproducing_paper/README.md)
+
diff --git a/__init__.py b/__init__.py