Snowball stemmer for te kupu reo Māori, and derivative class for Lucene/Solr SnowballPorterFilterFactory
This stemmer is written in Snowball (http://snowballstem.org) so it is portable across many platforms and uses, but the original intention was to autogenerate SnowballPorter stemmer classes for Apache Lucene and include in analyzer libraries so that DSpace and other webapps using Apache Solr for search can specify a Māori stemmer for field types in addition to English and other languages
To test this stemmer out in Apache Solr 4.10.4 (eg DSpace 6):
- Back up your existing Lucene Common Analyzers jar (probably somethig like
lucene-analyzers-common-4.10.4.jar
- Copy
lucene-analyzers-common-4.10.4-SNAPSHOT.jar
into your library / classpath - In your Solr schema, create a new
fieldType
, or modify an existingfieldType
into include a SnowballPorterFilterFactory in both the index and query analyzers, with the language parameter set to "Maori" (no macron):<filter class="solr.SnowballPorterFilterFactory" language="Maori" protected="protwords.txt"/>
- Restart Tomcat / Solr and rebuild your Solr index
- You should now be able to see stemming occurring if you test out queries / indexes in the dashboard, or POST new documents which use fields of the type you added the filter to.
Here's an example showing how this works for the passive verb ending '-tia' so that it is stemmed down to its active verb form, so a search for waihangatia returns results matching waihanga and vice versa:
- Identifier other stemming rules
- Collaborate with te reo experts to review and improve general stemming rules
- Contribute snowball code back to main Snowballstems algorithm package
- Generate (and fix if necessary) new java class which works with Lucene master / branch to be used for next release
- Contribute autogenerated java class to Apache Lucene
- Identify lists of stopwords for Māori search
- Idenfity lists of protected words (which will not be manipulated by stemmer)