Chris Jordan, Jane E. Mason, John Healy, Vlado Keselj and Carolyn Watters (2008)
Swordfish2: Using Kernel Density Estimation to Smooth N-gram Histograms for Morphological Analysis

Swordfish2 is an extension of the original Swordfish algorithm, an unsupervised approach to morphological analysis using character N-gram probabilistic models. In the original algorithm, a generative character N-gram model is created from a word list composed of word frequency pairs. Morphological word splits were discovered by comparing a given term's probability to the probabilities of its corresponding N-grams. Swordfish2 builds multiple generative N-gram models instead of just a single one. Each model represents a different location in a word, thus incorporating the word location of N-grams into the morphological analysis. Histograms for each N-gram are constructed from these generative models and smoothed using kernel density estimation. The original Swordfish algorithm provided evidence that indicated that morphemes are highly probable N-grams. Swordfish2 attempted to extend the original Swordfish approach to include word location, but was unable significantly to improve the morphological analysis on the 2005 Morpho Challenge and CELEX data sets. While these results do provide evidence that support our hypothesis that the most probable N-grams are morphemes, the design of the probabilistic model that proves it remains an open question.