



Chris Jordan, Jane E. Mason, John Healy, Vlado Keselj and Carolyn Watters (2008)
Swordfish2: Using Kernel Density Estimation to Smooth Ngram Histograms for Morphological Analysis
Swordfish2 is an extension of the original Swordfish algorithm, an unsupervised approach to morphological analysis using character Ngram probabilistic models. In the original algorithm, a generative character Ngram model is
created from a word list composed of word frequency pairs. Morphological word splits were discovered by comparing a given term's probability to the probabilities of its corresponding Ngrams. Swordfish2 builds multiple generative Ngram models instead of just a single one. Each model represents a different location in a word, thus incorporating the word location of Ngrams into the morphological analysis. Histograms for each Ngram are constructed from these generative models and smoothed using kernel density estimation. The original Swordfish algorithm provided evidence that indicated that morphemes are highly probable Ngrams. Swordfish2 attempted to extend the original Swordfish approach to include word location, but was unable significantly to improve the morphological analysis on the 2005 Morpho Challenge and CELEX data sets. While these results do provide evidence that support our hypothesis that the most probable Ngrams are morphemes, the design of the probabilistic model that proves it remains an open question.







