Cross Lingual Zero Shot Transfer

Cross-lingual NLP: Developing NLP models that can effectively process and translate multiple languages, especially low-resource languages, to help bridge language barriers and make information more accessible.

Authors:

Jayveersinh Raj

Makar Shevchenko

Nikolay Pavlenko

Brief Description

This is a project for Abuse reporting trained on toxic comments by Jigsaw Google dataset with 150k+ english comments. The project aims to accomplish the arbitary zero shot transfer for abuse detection in arbitarary language while being trained on English dataset. It attempts to achieve this by using the vector space alignment that is the core idea behind embedding models like XLM-Roberta, MUSE etc. Different embeddings are tested with the dataset to check the best performing embedder. Our project/model can be used by any platform or software engineer/enthusiast who has to deal with multiple languages to directly flag the toxic behaviour, or identify a valid report by users for a toxic behaviour. The use case for this can be application specific, but the idea is to make the model work with arbitary language by training on a singular language data available.

The architectural diagram

NOTE: The classifier architecture can have arbitrary parameters, or hidden states, the above diagram is a general idea. Diagram Credits: Samuel Leonardo Gracio

Similar work

Daily motion (Credits : Samuel Leonardo Gracio)

Dataset Description and link

jigsaw-toxic-comment-classification

We merged all the classes to one, since all the classes belong to one super class of toxicity. Our hypothesis is to use this to flag severe toxic behaviour, severe enough to ban or block a user.

Tech stack

Key Results

Non-toxic sentences
- 100% accuracy - our model generalizes well
Toxic sentences
- 75% → they were not toxic enough, but were subjective to the human annotator.
- Proof of claim: GPT4 translated them, they weren’t severe, but refused to generate toxic sentences
Conclusion: GPT4 after generating said they were toxic, which is contradictory to itself, hence, our model is better in detecting abuse/toxicity/severity.

Importing from Huggingface

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Jayveersinh-Raj/PolyGuard")
model = AutoModelForSequenceClassification.from_pretrained("Jayveersinh-Raj/PolyGuard")

Example usecase

from transformers import XLMRobertaForSequenceClassification, AutoTokenizer
import torch

model_name = "Jayveersinh-Raj/PolyGuard"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = XLMRobertaForSequenceClassification.from_pretrained(model_name)

text = "Jayveer is a great NLP engineer, and a noob in CV"
inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(inputs)[0]
probabilities = torch.softmax(outputs, dim=1)
predicted_class = torch.argmax(probabilities).item()
if predicted_class == 1:
  print("Toxic")
else:
  print("Not toxic")

NOTE

Make sure to have sentencepiece already installed, restart runtime after installation

pip install sentencepiece

Custom vector space aligner

Requirements
- Single GPU
Source : LINK

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Presentation		Presentation
datasets		datasets
deployment		deployment
figures		figures
notebooks		notebooks
progress_reports		progress_reports
test_val_datasets		test_val_datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
helper_functions.py		helper_functions.py
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross Lingual Zero Shot Transfer

Authors:

Brief Description

The architectural diagram

NOTE: The classifier architecture can have arbitrary parameters, or hidden states, the above diagram is a general idea. Diagram Credits: Samuel Leonardo Gracio

Similar work

Dataset Description and link

Tech stack

Key Results

Importing from Huggingface

Example usecase

NOTE

Make sure to have sentencepiece already installed, restart runtime after installation

Custom vector space aligner

About

Releases

Packages

Languages

License

cm-awais/cross-lingual-zero-shot-transfer

Folders and files

Latest commit

History

Repository files navigation

Cross Lingual Zero Shot Transfer

Authors:

Brief Description

The architectural diagram

NOTE: The classifier architecture can have arbitrary parameters, or hidden states, the above diagram is a general idea. Diagram Credits: Samuel Leonardo Gracio

Similar work

Dataset Description and link

Tech stack

Key Results

Importing from Huggingface

Example usecase

NOTE

Make sure to have sentencepiece already installed, restart runtime after installation

Custom vector space aligner

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages