This repository contains the code used for the article Label noise detection under the Noise at Random model with ensemble filters.
# classifiers
install.packages('RSNNS') #mlp
install.packages('party') #ad
install.packages('class') #knn
install.packages('e1071') #svm
install.packages('randomForest') #randomForest
install.packages('RoughSets') #CN2
install.packages('naivebayes') #Naive Bayes
install.packages('RWeka') #J48
install.packages("dismo") #"kfold" for stratified cross validation
install.packages('foreign') #read.arff
install.packages('caTools') #sample.split
# libraies
library(RSNNS)
library(party)
library(class)
library(e1071)
library(randomForest)
library(RoughSets)
library(naivebayes)
library(RWeka)
library(dismo)
library(foreign)
library(caTools)
#sources
setwd("[path to the lable-noise repository folder]/label-noise")
source('./LNLib/Classifiers.R')
source('./LNLib/HandleData.R')
source('./LNLib/Classification.R')
source('./LNLib/NoiseInjection.R')
source('./LNLib/NoiseDetection.R')
dat = LNLib.readAndTreatDataset("sample_data.arff")
cleaned.data = LNLib.getCleanedData(dat)
#For IR of 50:50, use LNLib.IR.VALUES$IR.5050
#For IR of 30:70, use LNLib.IR.VALUES$IR.3070
#For IR of 20:80, use LNLib.IR.VALUES$IR.2080
data.2080 = LNLib.generateIR(cleaned.data,LNLib.IR.VALUES$IR.2080)
sample = sample.split(data.2080$class, SplitRatio = 0.70)
train = subset(data.2080, sample == TRUE)
test = subset(data.2080, sample == FALSE)
#For 15% of noise and NCAR noise model (or noise ratio):
info.injection = LNLib.injectNoise(test, 15, LNLib.NOISE.MODEL$NCAR)
noisy.test = info.injection$noisy.data
The label noise model should be one of the following:
LNLib.NOISE.MODEL$NCAR
- noise equally distributed per class- or
LNLib.NOISE.MODEL$NAR.MIN
- more noise in minority class, proportion of 1:9 - or
LNLib.NOISE.MODEL$NAR.MAJ
- more noise in majority class, proportion of 1:9
Example: suppose a dataset with 200 rows and we want to inject 10% of noise into the data, i.e, 20 rows with noise
- for
NCAR
, aprox. 10 noisy rows will be injected into each class - for
NAR.MIN
, aprox. 18 noisy rows will be injected into the minority class and 2 into the majority one - for
NAR.MAJ
, aprox. 18 noisy rows will be injected into the majority class and 2 into the minority one
set.seed(165421)
#Dataset with 10 rows ( 5 rows from class 1, and 5 rows from class 2)
dat = as.data.frame(
list( col1 = c(1,2,3,4,5,6,7,8,9,10),
col2 = c(10,9,8,7,6,5,4,3,2,1),
class = c(1,1,1,1,1,2,2,2,2,2)))
##If we inject 10% of noise..
LNLib.injectNoise(dat,10,LNLib.NOISE.MODEL$NCAR)
##We get:
$changes.made
changed.rows original.class new.class
1 5 1 2
2 10 2 1
Results:
$noisy.data
col1 col2 class
1 1 10 1
2 2 9 1
3 3 8 1
4 4 7 1
5 5 6 2
6 6 5 2
7 7 4 2
8 8 3 2
9 9 2 2
10 10 1 1
$noise.perc
[1] 10
$noise.ratio
[1] 1
#Get classification (of all algorithms) after noise being injected
classified.test = LNLib.getClassification(train,noisy.test)
#Set ensemble vote threshold = 60% (i.e., an instance is considered noisy if 60% of all algorithms misclassify it)
ensemble.treshold = 60.0
measures = LNLib.getNoiseDetectionMeasures(info.injection,
classified.test,
ensemble.treshold)
$precision
[1] 28
$recall
[1] 87.5
$f_measure
[1] 42.4
$n.correct.detection #Number of correct detections
[1] 7
$n.noise.detected #Number of detections (correct or not)
[1] 25
$n.noisy.labels #Number of noisy labels in data
[1] 8
It is also possible to use one of the following options as input
LNLib.ENSEMBLE.THRESHOLD$CONSENSUS
(100% of algorithms)- or
LNLib.ENSEMBLE.THRESHOLD$MAJORITY
(50% + 1 algorithms)
-
Set the same seed used before running code
set.seed(165421)
-
Use the same datasets (all data should be in .arff format)
Dataset | Source |
---|---|
arcene | UCI Machine Learning Repository or OpenML |
breast-c | UCI Machine Learning Repository or OpenML |
column2c | UCI Machine Learning Repository |
credit | UCI Machine Learning Repository or OpenML |
cylinder-bands | OpenML |
diabetes | OpenML |
eeg-eye-state | UCI Machine Learning Repository or OpenML |
glass0 | KEEL-dataset repository or OpenML |
glass1 | KEEL-dataset repository |
heart-c | OpenML |
heart-statlog | OpenML |
hill-valley | OpenML |
ionosphere | UCI Machine Learning Repository |
kr-vs-kp | KEEL-dataset repository |
mushroom | UCI Machine Learning Repository or OpenML |
pima | KEEL-dataset repository |
sonar | OpenML |
steel-plates-fault | OpenML |
tic-tac-toe | UCI Machine Learning Repository |
voting | UCI Machine Learning Repository |
-
Given the memory and processing time needed, each step (data cleaning, generate IR..) was executed for every dataset and results were temporarily saved into a file
-
The training and testing step was run multiple times and measures were evaluated by average values
-
Important: the code is a script and has no error handler. It is important data is well structured, organized, and correct, and that the script is executed as previously described.
bibtex
@Article{Moura2022,
author={Moura, Kecia G. and Prud{\^e}ncio, Ricardo B.C. and Cavalcanti, George D.C.},
title={Label noise detection under the noise at random model with ensemble filters},
journal={Intelligent Data Analysis},
year={2022},
publisher={IOS Press},
volume={26},
pages={1119-1138},
keywords={Label noise; noise detection; ensemble methods; noise at random; ensemble noise filtering},
note={5},
issn={1571-4128},
doi={10.3233/IDA-215980},
url={https://doi.org/10.3233/IDA-215980}
}
All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.