maoristemmer

Snowball stemmer for te kupu reo Māori, and derivative class for Lucene/Solr SnowballPorterFilterFactory

Introduction

This stemmer is written in Snowball (http://snowballstem.org) so it is portable across many platforms and uses, but the original intention was to autogenerate SnowballPorter stemmer classes for Apache Lucene and include in analyzer libraries so that DSpace and other webapps using Apache Solr for search can specify a Māori stemmer for field types in addition to English and other languages

Quick start (Lucene libraries)

To test this stemmer out in Apache Solr 4.10.4 (eg DSpace 6):

Back up your existing Lucene Common Analyzers jar (probably somethig like lucene-analyzers-common-4.10.4.jar
Copy lucene-analyzers-common-4.10.4-SNAPSHOT.jar into your library / classpath
In your Solr schema, create a new fieldType, or modify an existing fieldType into include a SnowballPorterFilterFactory in both the index and query analyzers, with the language parameter set to "Maori" (no macron): <filter class="solr.SnowballPorterFilterFactory" language="Maori" protected="protwords.txt"/>
Restart Tomcat / Solr and rebuild your Solr index
You should now be able to see stemming occurring if you test out queries / indexes in the dashboard, or POST new documents which use fields of the type you added the filter to.

Here's an example showing how this works for the passive verb ending '-tia' so that it is stemmed down to its active verb form, so a search for waihangatia returns results matching waihanga and vice versa:

TODO

Identifier other stemming rules
Collaborate with te reo experts to review and improve general stemming rules
Contribute snowball code back to main Snowballstems algorithm package
Generate (and fix if necessary) new java class which works with Lucene master / branch to be used for next release
Contribute autogenerated java class to Apache Lucene
Identify lists of stopwords for Māori search
Idenfity lists of protected words (which will not be manipulated by stemmer)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

maoristemmer

Introduction

Quick start (Lucene libraries)

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

maoristemmer

Introduction

Quick start (Lucene libraries)

TODO