Skip to content

Commit

Permalink
Merge branch 'reorganize' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
evandowning committed Feb 24, 2021
2 parents 0feb1b1 + ed4fd37 commit 25df887
Show file tree
Hide file tree
Showing 147 changed files with 1,918 additions and 99,653 deletions.
151 changes: 94 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,10 @@ For technical details, please see the paper cited below.

**Overview**:
- Input: unpacked malware PE binary
- Middle: list of all basic blocks in binary along with their reconstruction error values
- Output: choosing a threshold (based on average reconstruction error value per function), identifies regions of interest (i.e., basic blocks above threshold), and clusters the averaged feature vectors of RoIs
- Middle: list of all basic blocks in binary along with their reconstruction error values (MSE values)
- Output: choosing a threshold (based on average MSE value per function), identifies regions of interest (RoI) (i.e., basic blocks with MSE values above threshold), and clusters the averaged feature vectors of RoIs

**Usage**: Using ground-truth malware binaries, choose an error value threshold which gives the analyst their desired results (tune to favor increasing TPR or decreasing FPR).

## Citation
- ```
things and stuff
```
- Paper: link to paper
- [Reproducing paper experiments](reproducing_paper/README.md)
**Usage**: Using ground-truth malware binaries, choose an MSE threshold which gives the analyst their desired results (tune to favor increasing TPR or decreasing FPR).

## Coming soon
- Dockerfile
Expand All @@ -28,7 +21,7 @@ For technical details, please see the paper cited below.
- Tested on Debian 10 (Buster)
- Python 3 (tested with Python 3.7.3) and pip
- virtualenvwrapper (optional, but recommended)
- BinaryNinja (used only to extract features and function information from binaries)
- BinaryNinja 2.2 (used to extract features and function information from binaries)
- parallel (optional, but recommended)
- Setup:
```
Expand All @@ -40,69 +33,113 @@ For technical details, please see the paper cited below.

## Usage
- Obtain unpacked benign and malicious PE file datasets
- I put benign unpacked binaries in `/data/benign_unipacker/` and malicious unpacked binaries in `/data/malicious_unipacker/` because I unpacked them via unipacker.
- Under each directory is a subdirectory of the family label. I.e., `/data/benign_unipacker/benign/`, `/data/malicious_unipacker/virut/`, `/data/malicious_unipacker/zbot/`, etc.
- Extract BinaryNinja DB file
```
(dr) $ find /data/benign_unipacker -type f > samples_benign_unipacker.txt
(dr) $ ./write_commands_binja.sh samples_benign_unipacker.txt /data/benign_unipacker_bndb/ > commands_benign_unipacker_bndb.txt
(dr) $ time parallel --memfree 2G --retries 10 -a commands_benign_unipacker_bndb.txt 2> error.txt > output.txt
```
- Extracting features:
```
(dr) $ find /data/benign_unipacker_bndb -type f > samples_benign_unipacker_bndb.txt
(dr) $ ./write_commands_acfg_plus_binja.sh samples_benign_unipacker_bndb.txt benign_unipacker_bndb_acfg_plus/ > commands_benign_unipacker_bndb_acfg_plus.txt
(dr) $ time parallel --memfree 2G --retries 10 -a commands_benign_unipacker_bndb_acfg_plus.txt 2> error.txt > output.txt
(dr) $ find benign_unipacker_bndb_acfg_plus -type f > samples_benign_unipacker_bndb_acfg_plus.txt
(dr) $ ./write_commands_acfg_plus_feature.sh samples_benign_unipacker_bndb_acfg_plus.txt benign_unipacker_bndb_acfg_plus_feature/ > commands_benign_unipacker_bndb_acfg_plus_feature.txt
(dr) $ time parallel --memfree 2G --retries 10 -a commands_benign_unipacker_bndb_acfg_plus_feature.txt 2> error.txt > output.txt
```
- Extract function-related data from BinaryNinja DB files
- Benign folder layout: `benign_unpacked/benign/<binary_files>`
- Malicious folder layout: `malicious_unpacked/<family_label>/<binary_files>`
- Extract binary features & data
```
(dr) $ find /data/malicious_unipacker_bndb/ -type f > samples_bndb.txt
(dr) $ time ./write_commands_get_function.sh samples_bndb.txt /data/malicious_unipacker_bndb_function/ > commands_get_function.txt
(dr) $ time parallel --memfree 4G --retries 10 -a commands_get_function.txt 2> parallel_get_function_stderr.txt > parallel_get_function_stdout.txt
(dr) $ ./extract.sh benign_unpacked/
(dr) $ ./extract.sh malicious_unpacked/
```
- Train autoencoder:
```
(dr) $ python split.py /data/benign_unipacker_bndb_acfg_plus_feature/ train.txt test.txt
(dr) $ for fn in 'train.txt' 'test.txt' 'valid.txt'; do shuf $fn > tmp.txt; mv tmp.txt $fn; done
(dr) $ cd ./autoencoder/
# Split & shuffle benign dataset
(dr) $ python split.py benign_unpacked_bndb_raw_feature/ train.txt test.txt > split_stdout.txt
(dr) $ for fn in 'train.txt' 'test.txt'; do shuf $fn > tmp.txt; mv tmp.txt $fn; done
# Check that benign samples use all features:
(dr) $ time python feature_check.py train.txt test.txt valid.txt
# Do the same for malicious samples as well
(dr) $ find /data/malicious_unipacker_bndb_acfg_plus_feature/ -type f > malicious.txt
(dr) $ time python feature_check.py malicious.txt valid.txt valid.txt
(dr) $ python feature_check.py train.txt
(dr) $ python feature_check.py test.txt
# Check that malicious samples use all features:
(dr) $ find malicious_unpacked_bndb_raw_feature/ -type f > malicious.txt
(dr) $ python feature_check.py malicious.txt
# Get max values (for normalizing)
(dr) $ python normalize.py --train train.txt \
--test test.txt \
--output normalize.npy
# Train model
(dr) $ time python autoencoder.py --kernel 24 --strides 1 --option 2 acfg_plus --train train.txt --test test.txt --valid valid.txt --model ./models/m2_normalize_24_12.h5 --map benign_map.txt --normalize True > output.txt
```
- Extract reconstruction errors:
(dr) $ time python autoencoder.py --train train.txt \
--test test.txt \
--normalize normalize.npy \
--model dr.h5 > autoencoder_stdout.txt 2> autoencoder_stderr.txt
```
- Cluster suspicious functions:
- Extract MSE values for each malware basic block:
```
(dr) $ cd ./autoencoder/
(dr) $ time python mse.py --feature malicious.txt \
--model dr.h5 \
--normalize normalize.npy \
--output malicious_unpacked_bndb_raw_feature_mse/ 2> mse_stderr.txt
```
- Identify desired threshold. See [Grading](#grading).
- Extract RoI (basic blocks):
```
(dr) $ cd ./autoencoder/
(dr) $ mkdir roi/
(dr) $ time python roi.py --bndb-func malicious_unpacked_bndb_function/ \
--feature malicious_unpacked_bndb_raw_feature/ \
--mse malicious_unpacked_bndb_raw_feature_mse/ \
--normalize normalize.npy \
--output roi/ \
--bb --avg --thresh 7.293461392658043e-06 > roi/stdout.txt 2> roi/stderr.txt
```
- Cluster functions containing RoI:
```
(dr) $ cd ./cluster/
(dr) $ time python pca_hdbscan.py --x ../autoencoder/roi/x.npy \
--fn ../autoencoder/roi/fn.npy \
--addr ../autoencoder/roi/addr.npy > pca_hdbscan_stdout.txt
```
- Graph percentage of functions highlighted:
```
(dr) $ cd ./cluster/
(dr) $ time python function_coverage.py --functions malicious_unpacked_bndb_function/ \
--fn ../autoencoder/roi/fn.npy \
--addr ../autoencoder/roi/addr.npy \
--output function_coverage.png > function_coverage_stdout.txt
```

## Grading
- Graph ROC curves
```
(dr) $ time python autoencoder_eval_all.py acfg_plus --acfg-feature /data/malicious_unipacker_bndb_acfg_plus_feature/ \
--model ./models/autoencoder_benign_unipacker_plus/m2_normalize_24_12.h5 \
--normalize True \
--output /data/malicious_unipacker_bndb_acfg_plus_feature_error/ 2> autoencoder_eval_all_stderr.txt
(dr) $ cd grader/
(dr) $ ./roc.sh &> roc_stdout_stderr.txt
```
- Extract regions of interest:
- [roc_rbot.png](grader/roc_rbot.png)
- [roc_pegasus.png](grader/roc_pegasus.png)
- [roc_carbanak.png](grader/roc_carbanak.png)
- [roc_combined.png](grader/roc_combined.png)
- Pick desired threshold
```
(dr) $ time python autoencoder_roi.py acfg_plus --data /data/malicious_unipacker_bndb_acfg_plus_feature_error/ \
--bndb-func /data/malicious_unipacker_bndb_function/ \
--acfg /data/malicious_unipacker_bndb_acfg_plus_feature/ \
--output ./autoencoder_roi/ \
--bb --avg --thresh 7.293461392658043e-06 > ./autoencoder_roi/stdout.txt 2> ./autoencoder_roi/stderr.txt
$ vim roc_stdout_stderr.txt
```
- Cluster regions of interest:
- Examine FPs & FNs due to chosen threshold
```
(dr) $ time python pca_hdbscan.py --x autoencoder_roi/x_train.npy \
--fn autoencoder_roi/train_fn.npy \
--addr autoencoder_roi/train_addr.npy > pca_hdbscan_output.txt
$ examine.sh 9.053894787328584e-08 &> examine_stdout_stderr.txt
$ vim examine_stdout_stderr.txt
```

## FAQs
- Why don't you release the binaries used to train and evaluate DeepReflect (other than ground-truth samples)?
- We cannot release malware binaries because of our agreement with those who provided them to us.
- If you're looking for malware binaries, you might consider the [SOREL dataset](https://github.com/sophos-ai/SOREL-20M)
- If you're looking for malware binaries, you might consider the [SOREL dataset](https://github.com/sophos-ai/SOREL-20M) or contacting [VirusTotal](https://www.virustotal.com/).
- We cannot release benign binaries because of copyright rules.
- If you're looking for benign binaries, you might consider [crawling](https://github.com/evandowning/selenium-crawler) them on [CNET](https://download.cnet.com/windows/). Make sure to verify they're not malicious via [VirusTotal](https://www.virustotal.com/).
- We do, however, release our extracted features so models can be trained from scratch.

## Citing
```
@inproceedings{deepreflect_2021,
author = {Downing, Evan and Mirsky, Yisroel and Park, Kyuhong and Lee, Wenke},
title = {DeepReflect: {Discovering} {Malicious} {Functionality} through {Binary} {Reconstruction}},
booktitle = {{USENIX} {Security} {Symposium}},
year = {2021}
}
```
- Paper: link to paper
- [Reproducing paper experiments](reproducing_paper/README.md)

Empty file added __init__.py
Empty file.
Loading

0 comments on commit 25df887

Please sign in to comment.