Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added stopwords cleaner with tests #228

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -10,4 +10,7 @@ target/
*.iml
*.ipr
*.iws
/.idea/
/.idea/

.DS_Store

22 changes: 22 additions & 0 deletions duke-core/.idea/compiler.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions duke-core/.idea/copyright/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

42 changes: 42 additions & 0 deletions duke-core/.idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 9 additions & 0 deletions duke-core/.idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

562 changes: 562 additions & 0 deletions duke-core/.idea/workspace.xml

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@

package no.priv.garshol.duke.cleaners;

import no.priv.garshol.duke.Cleaner;

import java.io.*;
import java.util.List;
import java.util.ArrayList;

/**
* A cleaner which removes english stopwords from a string.
*/

public class StopwordsCleaner implements Cleaner {
private LowerCaseNormalizeCleaner sub;
HashSet<String> stopwords = new HashSet<String>();
private ArrayList<String> wordsList = new ArrayList<String>();


public StopwordsCleaner() {
this.sub = new LowerCaseNormalizeCleaner();

try {
this.stopwords = loadStopwords();
} catch (DukeException e) {
throw new RuntimeException(e);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please throw DukeException instead.

}
}


public String clean(String value) {

value = sub.clean(value);
if (value == null || value.equals(""))
return value;


for (String word : words) {
if (!stopwords.contains(word))
wordsList.add(word);
}

return String.join(" ",wordsList);

}

private HashSet<String> loadStopwords() throws IOException {
String mapfile = "no/priv/garshol/duke/english-stopwords.txt";

BufferedReader in = new BufferedReader(new FileReader(mapfile));
String str;

HashSet<String> stopwords = new HashSet<String>();
while((str = in.readLine()) != null){
stopwords.add(str);
}

in.close();
return stopwords;
}

}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@

package no.priv.garshol.duke.cleaners;

import org.junit.Before;
import org.junit.Test;

import static junit.framework.Assert.assertEquals;
import static junit.framework.Assert.assertTrue;

public class StopwordsCleanerTest extends LowerCaseNormalizeCleanerTest {

public void setUp() {
cleaner = new StopwordsCleaner();
}

@Test
public void testMapping() {
assertEquals("Hello my name is duke", cleaner.clean("hello name duke"));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks wrong. Surely the two string literals should be swapped?

Please also add two more tests: one with null and one with "". We need to know the cleaner can handle these two cases.

}

@Test
public void testEmpty() {
assertTrue(cleaner.clean("") == "");
}

@Test
public void testNull() {
assertTrue(cleaner.clean(null) == null);
}


}