Skip to content

R script for generating imbalanced data, injecting label noise, and evaluating ensemble noise detections as described in the article "Label noise detection under the Noise at Random model with ensemble filters".

Notifications You must be signed in to change notification settings

kdMoura/label-noise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This repository contains the code used for the article Label noise detection under the Noise at Random model with ensemble filters.

Installation

Install required packages

# classifiers
install.packages('RSNNS') #mlp
install.packages('party') #ad
install.packages('class') #knn
install.packages('e1071') #svm
install.packages('randomForest') #randomForest
install.packages('RoughSets') #CN2
install.packages('naivebayes') #Naive Bayes
install.packages('RWeka') #J48 

install.packages("dismo") #"kfold" for stratified cross validation
install.packages('foreign') #read.arff
install.packages('caTools') #sample.split

Load libraries and source files

# libraies
library(RSNNS)
library(party)
library(class)
library(e1071)
library(randomForest)
library(RoughSets)
library(naivebayes)
library(RWeka)
library(dismo)
library(foreign)
library(caTools)

#sources
setwd("[path to the lable-noise repository folder]/label-noise")
source('./LNLib/Classifiers.R')
source('./LNLib/HandleData.R')
source('./LNLib/Classification.R')
source('./LNLib/NoiseInjection.R')
source('./LNLib/NoiseDetection.R')

How it works

process

1. Data Cleaning

dat = LNLib.readAndTreatDataset("sample_data.arff")
cleaned.data = LNLib.getCleanedData(dat)

2. Generate Imbalance Ratio (IR)

#For IR of 50:50, use LNLib.IR.VALUES$IR.5050
#For IR of 30:70, use LNLib.IR.VALUES$IR.3070
#For IR of 20:80, use LNLib.IR.VALUES$IR.2080
data.2080 = LNLib.generateIR(cleaned.data,LNLib.IR.VALUES$IR.2080)

3. Split data

sample = sample.split(data.2080$class, SplitRatio = 0.70)
train = subset(data.2080, sample == TRUE)
test  = subset(data.2080, sample == FALSE)

4. Noise Injection

#For 15% of noise and NCAR noise model (or noise ratio):
info.injection = LNLib.injectNoise(test, 15, LNLib.NOISE.MODEL$NCAR)
noisy.test = info.injection$noisy.data
About noise models

The label noise model should be one of the following:

  1. LNLib.NOISE.MODEL$NCAR - noise equally distributed per class
  2. or LNLib.NOISE.MODEL$NAR.MIN - more noise in minority class, proportion of 1:9
  3. or LNLib.NOISE.MODEL$NAR.MAJ - more noise in majority class, proportion of 1:9

Example: suppose a dataset with 200 rows and we want to inject 10% of noise into the data, i.e, 20 rows with noise

  • for NCAR, aprox. 10 noisy rows will be injected into each class
  • for NAR.MIN, aprox. 18 noisy rows will be injected into the minority class and 2 into the majority one
  • for NAR.MAJ, aprox. 18 noisy rows will be injected into the majority class and 2 into the minority one
About LNLib.injectNoise result
set.seed(165421)

#Dataset with 10 rows ( 5 rows from class 1, and 5 rows from class 2)
dat = as.data.frame(
          list( col1 = c(1,2,3,4,5,6,7,8,9,10),
                col2 = c(10,9,8,7,6,5,4,3,2,1), 
                class  = c(1,1,1,1,1,2,2,2,2,2)))
                
##If we inject 10% of noise.. 
LNLib.injectNoise(dat,10,LNLib.NOISE.MODEL$NCAR)

##We get:
$changes.made
  changed.rows original.class new.class
1            5              1         2
2           10              2         1

Results:
$noisy.data
   col1 col2 class
1     1   10     1
2     2    9     1
3     3    8     1
4     4    7     1
5     5    6     2
6     6    5     2
7     7    4     2
8     8    3     2
9     9    2     2
10   10    1     1

$noise.perc
[1] 10

$noise.ratio
[1] 1

5. Select vote threshold / Ensemble prediction / Evaluation

#Get classification (of all algorithms) after noise being injected 
classified.test = LNLib.getClassification(train,noisy.test) 

#Set ensemble vote threshold = 60% (i.e., an instance is considered noisy if 60% of all algorithms misclassify it) 
ensemble.treshold = 60.0 

measures = LNLib.getNoiseDetectionMeasures(info.injection,
                              classified.test,
                              ensemble.treshold)
Sample result of measures
$precision
[1] 28
$recall
[1] 87.5
$f_measure
[1] 42.4
$n.correct.detection #Number of correct detections
[1] 7
$n.noise.detected #Number of detections (correct or not)
[1] 25
$n.noisy.labels #Number of noisy labels in data
[1] 8
About ensemble threshold

It is also possible to use one of the following options as input

  1. LNLib.ENSEMBLE.THRESHOLD$CONSENSUS (100% of algorithms)
  2. or LNLib.ENSEMBLE.THRESHOLD$MAJORITY (50% + 1 algorithms)

How to reproduce the article experiments

  • Set the same seed used before running code set.seed(165421)

  • Use the same datasets (all data should be in .arff format)

Dataset Source
arcene UCI Machine Learning Repository or OpenML
breast-c UCI Machine Learning Repository or OpenML
column2c UCI Machine Learning Repository
credit UCI Machine Learning Repository or OpenML
cylinder-bands OpenML
diabetes OpenML
eeg-eye-state UCI Machine Learning Repository or OpenML
glass0 KEEL-dataset repository or OpenML
glass1 KEEL-dataset repository
heart-c OpenML
heart-statlog OpenML
hill-valley OpenML
ionosphere UCI Machine Learning Repository
kr-vs-kp KEEL-dataset repository
mushroom UCI Machine Learning Repository or OpenML
pima KEEL-dataset repository
sonar OpenML
steel-plates-fault OpenML
tic-tac-toe UCI Machine Learning Repository
voting UCI Machine Learning Repository
  • Given the memory and processing time needed, each step (data cleaning, generate IR..) was executed for every dataset and results were temporarily saved into a file

  • The training and testing step was run multiple times and measures were evaluated by average values

  • Important: the code is a script and has no error handler. It is important data is well structured, organized, and correct, and that the script is executed as previously described.

Citation

bibtex
@Article{Moura2022,
	author={Moura, Kecia G. and Prud{\^e}ncio, Ricardo B.C. and Cavalcanti, George D.C.},
	title={Label noise detection under the noise at random model with ensemble filters},
	journal={Intelligent Data Analysis},
	year={2022},
	publisher={IOS Press},
	volume={26},
	pages={1119-1138},
	keywords={Label noise; noise detection; ensemble methods; noise at random; ensemble noise filtering},
	note={5},
	issn={1571-4128},
	doi={10.3233/IDA-215980},
	url={https://doi.org/10.3233/IDA-215980}
}

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.

About

R script for generating imbalanced data, injecting label noise, and evaluating ensemble noise detections as described in the article "Label noise detection under the Noise at Random model with ensemble filters".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages