Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data usage issues #1

Open
WF-CIL opened this issue May 10, 2021 · 3 comments
Open

Data usage issues #1

WF-CIL opened this issue May 10, 2021 · 3 comments

Comments

@WF-CIL
Copy link

WF-CIL commented May 10, 2021

There are multiple data sets used in the program, and I am a little confused about how to download data from your designated website. like ../data/cleaned-and-combined.csv

@MJafarMashhadi
Copy link
Owner

Hi! The data is not included in this GitHub repository, one reason being the dataset is live and get updated very frequently.
However, I have included a list of sources where I gathered the data from in the README file, this section. I got the data from multiple sources and combined them all into one dataset. hope it helps!

@WF-CIL
Copy link
Author

WF-CIL commented May 10, 2021

Thank you very much for taking the time to reply to my question, but I may have made a mistake. I know how to download the data. I am confused when using the data. There is a lot of data, there are differences, I don't know how to use it.
I got KeyError:'type' when I was running.
I think my way of using data is wrong.
Are you directly using the downloaded data without processing it?

I don't know which of the downloaded data should be used in the program dataset_ops.py:pd.read_csv('../data/mixed.csv', index_col=False).
There is also :def load_data(*, file_name='../data/cleaned-and-combined.csv', split_ratio=None, random_state=42):
I don't know how to choose from the downloaded data to use to execute your program.
I hope you can help, thank you again.

@MJafarMashhadi
Copy link
Owner

It's been more than a year, my memory on the details is a bit rusty. I did combine the different data sources into one file here. Reading through that function you can see what the filenames should be. Feel free to use any data source you want.

In the end of the day you just need to have a table with two columns, url and type. Type is the label, 1=malicious 0=benign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants