Overview

This repository contains the code used for the article Label noise detection under the Noise at Random model with ensemble filters.

Installation

Install required packages

# classifiers
install.packages('RSNNS') #mlp
install.packages('party') #ad
install.packages('class') #knn
install.packages('e1071') #svm
install.packages('randomForest') #randomForest
install.packages('RoughSets') #CN2
install.packages('naivebayes') #Naive Bayes
install.packages('RWeka') #J48 

install.packages("dismo") #"kfold" for stratified cross validation
install.packages('foreign') #read.arff
install.packages('caTools') #sample.split

Load libraries and source files

# libraies
library(RSNNS)
library(party)
library(class)
library(e1071)
library(randomForest)
library(RoughSets)
library(naivebayes)
library(RWeka)
library(dismo)
library(foreign)
library(caTools)

#sources
setwd("[path to the lable-noise repository folder]/label-noise")
source('./LNLib/Classifiers.R')
source('./LNLib/HandleData.R')
source('./LNLib/Classification.R')
source('./LNLib/NoiseInjection.R')
source('./LNLib/NoiseDetection.R')

How it works

1. Data Cleaning

dat = LNLib.readAndTreatDataset("sample_data.arff")
cleaned.data = LNLib.getCleanedData(dat)

2. Generate Imbalance Ratio (IR)

#For IR of 50:50, use LNLib.IR.VALUES$IR.5050
#For IR of 30:70, use LNLib.IR.VALUES$IR.3070
#For IR of 20:80, use LNLib.IR.VALUES$IR.2080
data.2080 = LNLib.generateIR(cleaned.data,LNLib.IR.VALUES$IR.2080)

3. Split data

sample = sample.split(data.2080$class, SplitRatio = 0.70)
train = subset(data.2080, sample == TRUE)
test  = subset(data.2080, sample == FALSE)

4. Noise Injection

#For 15% of noise and NCAR noise model (or noise ratio):
info.injection = LNLib.injectNoise(test, 15, LNLib.NOISE.MODEL$NCAR)
noisy.test = info.injection$noisy.data

About noise models

The label noise model should be one of the following:

LNLib.NOISE.MODEL$NCAR - noise equally distributed per class
or LNLib.NOISE.MODEL$NAR.MIN - more noise in minority class, proportion of 1:9
or LNLib.NOISE.MODEL$NAR.MAJ - more noise in majority class, proportion of 1:9

Example: suppose a dataset with 200 rows and we want to inject 10% of noise into the data, i.e, 20 rows with noise

for NCAR, aprox. 10 noisy rows will be injected into each class
for NAR.MIN, aprox. 18 noisy rows will be injected into the minority class and 2 into the majority one
for NAR.MAJ, aprox. 18 noisy rows will be injected into the majority class and 2 into the minority one

About `LNLib.injectNoise` result

set.seed(165421)

#Dataset with 10 rows ( 5 rows from class 1, and 5 rows from class 2)
dat = as.data.frame(
          list( col1 = c(1,2,3,4,5,6,7,8,9,10),
                col2 = c(10,9,8,7,6,5,4,3,2,1), 
                class  = c(1,1,1,1,1,2,2,2,2,2)))
                
##If we inject 10% of noise.. 
LNLib.injectNoise(dat,10,LNLib.NOISE.MODEL$NCAR)

##We get:
$changes.made
  changed.rows original.class new.class
1            5              1         2
2           10              2         1

Results:
$noisy.data
   col1 col2 class
1     1   10     1
2     2    9     1
3     3    8     1
4     4    7     1
5     5    6     2
6     6    5     2
7     7    4     2
8     8    3     2
9     9    2     2
10   10    1     1

$noise.perc
[1] 10

$noise.ratio
[1] 1

5. Select vote threshold / Ensemble prediction / Evaluation

#Get classification (of all algorithms) after noise being injected 
classified.test = LNLib.getClassification(train,noisy.test) 

#Set ensemble vote threshold = 60% (i.e., an instance is considered noisy if 60% of all algorithms misclassify it) 
ensemble.treshold = 60.0 

measures = LNLib.getNoiseDetectionMeasures(info.injection,
                              classified.test,
                              ensemble.treshold)

Sample result of `measures`

$precision
[1] 28
$recall
[1] 87.5
$f_measure
[1] 42.4
$n.correct.detection #Number of correct detections
[1] 7
$n.noise.detected #Number of detections (correct or not)
[1] 25
$n.noisy.labels #Number of noisy labels in data
[1] 8

About ensemble threshold

It is also possible to use one of the following options as input

LNLib.ENSEMBLE.THRESHOLD$CONSENSUS (100% of algorithms)
or LNLib.ENSEMBLE.THRESHOLD$MAJORITY (50% + 1 algorithms)

How to reproduce the article experiments

Set the same seed used before running code set.seed(165421)
Use the same datasets (all data should be in .arff format)

Dataset	Source
arcene	UCI Machine Learning Repository or OpenML
breast-c	UCI Machine Learning Repository or OpenML
column2c	UCI Machine Learning Repository
credit	UCI Machine Learning Repository or OpenML
cylinder-bands	OpenML
diabetes	OpenML
eeg-eye-state	UCI Machine Learning Repository or OpenML
glass0	KEEL-dataset repository or OpenML
glass1	KEEL-dataset repository
heart-c	OpenML
heart-statlog	OpenML
hill-valley	OpenML
ionosphere	UCI Machine Learning Repository
kr-vs-kp	KEEL-dataset repository
mushroom	UCI Machine Learning Repository or OpenML
pima	KEEL-dataset repository
sonar	OpenML
steel-plates-fault	OpenML
tic-tac-toe	UCI Machine Learning Repository
voting	UCI Machine Learning Repository

Given the memory and processing time needed, each step (data cleaning, generate IR..) was executed for every dataset and results were temporarily saved into a file
The training and testing step was run multiple times and measures were evaluated by average values
Important: the code is a script and has no error handler. It is important data is well structured, organized, and correct, and that the script is executed as previously described.

Citation

bibtex
@Article{Moura2022,
	author={Moura, Kecia G. and Prud{\^e}ncio, Ricardo B.C. and Cavalcanti, George D.C.},
	title={Label noise detection under the noise at random model with ensemble filters},
	journal={Intelligent Data Analysis},
	year={2022},
	publisher={IOS Press},
	volume={26},
	pages={1119-1138},
	keywords={Label noise; noise detection; ensemble methods; noise at random; ensemble noise filtering},
	note={5},
	issn={1571-4128},
	doi={10.3233/IDA-215980},
	url={https://doi.org/10.3233/IDA-215980}
}

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LNLib		LNLib
.gitignore		.gitignore
README.md		README.md
sample_code.R		sample_code.R
sample_data.arff		sample_data.arff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Installation

Install required packages

Load libraries and source files

How it works

1. Data Cleaning

2. Generate Imbalance Ratio (IR)

3. Split data

4. Noise Injection

About noise models

About `LNLib.injectNoise` result

5. Select vote threshold / Ensemble prediction / Evaluation

Sample result of `measures`

About ensemble threshold

How to reproduce the article experiments

Citation

License

About

Releases

Packages

Languages

kdMoura/label-noise

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Install required packages

Load libraries and source files

How it works

1. Data Cleaning

2. Generate Imbalance Ratio (IR)

3. Split data

4. Noise Injection

About noise models

About LNLib.injectNoise result

5. Select vote threshold / Ensemble prediction / Evaluation

Sample result of measures

About ensemble threshold

How to reproduce the article experiments

Citation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

About `LNLib.injectNoise` result

Sample result of `measures`

Packages