Author: Karthik Narasimhan ([email protected])
Unsupervised Discovery of Morphological Chains (TACL 2015)
- A model for unsupervised morphological analysis that integrates orthographic and semantic views of words.
- Model consistently outperforms three state-of-the-art baselines on the task of morphological segmentation on Arabic, English and Turkish.
You can clone the repository and use the production2 branch (default) for the latest code.
- This project uses the LBFGS-B algorithm for optimization (the jar files for the library are included in lib/). We, however, recommend you to download and install the lbfgsb_wrapper for Java from here since there may be additional steps for you to take for installing on Mac OSX. At the end of the install, move the files lbfgsb_wrapper-.jar and liblbfgsb_wrapper.so (or liblbfgsb_wrapper.dylib on OSX) into the lib/ directory.
- External library: commons-lang3-3.3.2.jar (included in lib/)
- Install the Junit framework following instructions in http://junit.org/ or using Maven.
- Replace the path for
jdk.home.1.7
in thebuild.properties
file with your local install. - (optional) Change
path.variable.maven_repository
inbuild.properties
to your local maven repository if you wish to use your Maven installs.
Use 'ant all' to compile on the terminal (requires ant version > 1.6). You can also directly import the entire directory into IntelliJ or Eclipse and compile using the GUI.
Here is an example of how to run the code from the home directory of the project. The output will contain the predicted segmentations for all the words in the test file. If you do not have gold segmentations to test against, you can just input a file with the word as its own segmentation (i.e. : instead of : in each line of the file - see FORMATS.txt for details).
PARAMS_FILE=params.properties;
OUT_FILE=output.txt;
java -ea -Djava.library.path=lib/ -classpath "./lib/*:./out/production/Morphology" Main $PARAMS_FILE >$OUT_FILE
Most parameters in the model can be changed in the file params.properties
A good tool to produce your own vectors from a raw corpus is word2vec. You can also use any pre-existing vectors as long as they satisfy the format as specified in FORMATS.txt.
Please use the issue tracker or email me if you have any questions/suggestions.