Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(models): Implemented three models for license similarity #69

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kaushl2208
Copy link
Member

@Kaushl2208 Kaushl2208 commented Aug 11, 2020

Description

Implementation of Logistic Regression, Multinomial Naive Bayes and Linear SVC on license dataset licenseList.csv. The main purpose of implementing this idea was to plan for a model which can make atarashi faster and more accurate.

Files

  • train.py (Training the models and saving in binary)
  • test.py ( For the testing purpose)
  • lr_model.pkl (Binary file for logistic regression)
  • nb_model.pkl(Binary file for Multinomial Naive Bayes)
  • svc_model.pkl(Binary file for Linear SVC)
  • vectorizer.pkl (Binary file for storing vocabulary)

How to use?

  • Test the models

    • atarashi -a lr_classifier path/to/file (Logistic Regression)
    • atarashi -a nb_classifier path/to/file (Multinomial Naive Bayes)
    • atarashi -a svc_classifier path/to/file (Linear SVC)
  • Train the models (Optional)

    • From the base folder run : python3 atarashi/agents/models/train.py

ToDo

  • Test working and accuracy of the algorithms using evaluator.py

  • proper integration with atarashii.py

Accuracy Score

Model Name Accuracy Score in % Time taken on 100 files in (sec)
Logistic Regression 31 88.6
Linear SVC 36 79.4
Multinomial Naive Bayes 30 83.72

Future Scope

  • The well-defined dataset will increase the similarity accuracy even more. By well-defined dataset I mean with newly updated licenses also ( 1 class to n License) style license file will do the work.

CC: @hastagAB @GMishx @ag4ums

Signed off by: Kaushlendra Pratap Singh [email protected]

@Kaushl2208
Copy link
Member Author

@hastagAB @GMishx , I implemented the models command into atarashii.py but it seems like I am missing something to update somewhere in code.

@Kaushl2208
Copy link
Member Author

@GMishx @ag4ums I have run all three models on the Test files and I am attaching the screenshot of the results.

SVC

SVC

NB

NB

Logistic Regression

LR

@GMishx GMishx added the GSOC-20 Label to tag issues and pull request for GSOC 2020 activities label Aug 20, 2020
atarashi/atarashii.py Outdated Show resolved Hide resolved
atarashi/agents/models/test.py Outdated Show resolved Hide resolved
atarashi/agents/models/test.py Outdated Show resolved Hide resolved
atarashi/agents/models/test.py Outdated Show resolved Hide resolved
atarashi/agents/models/test.py Show resolved Hide resolved
atarashi/agents/models/train.py Outdated Show resolved Hide resolved
atarashi/agents/models/train.py Outdated Show resolved Hide resolved

def model_train():

data = pd.read_csv("atarashi/data/licenses/licenseList.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future improvement, SPDX license data can be pulled in using atarashi.license.licenseDownloader.LicenseDownloader.download_license and merged with main list using atarashi.license.license_merger.license_merger.

Copy link
Member

@GMishx GMishx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more changes are required. And please squash your commits.

atarashi/agents/models/test.py Outdated Show resolved Hide resolved
atarashi/agents/models/test.py Show resolved Hide resolved
atarashi/agents/models/test.py Show resolved Hide resolved
atarashi/agents/models/train.py Show resolved Hide resolved
atarashi/atarashii.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@GMishx GMishx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent looks good.
Tested with pip install .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSOC-20 Label to tag issues and pull request for GSOC 2020 activities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants