Abstract

Unknown-word categorization remains an important topic in the field of computational linguistics, because a language constantly evolves new words (e.g.) to accommodate novel concepts. It is hard to cover all results of this process using just static word lists or lexicons. In the past, some work has been done to fill this gap for the German language. For this purpose, a lexicon-less finite-state morphology has been built, that can offer hypotheses about possible morphological features for a given word form. This thesis concerns itself with the evaluation of two different approaches for the determination of the correct explanation of a word form from such a set of hypotheses.

The first approach explores the possibility of using methods and theoretical considerations derived from the minimal-description-length principle (MDL) to find a mapping between corpus word-forms and hypotheses that would allow for a maximal compression of the corpus. By Ockham’s razor, this minimal mapping is expected to mostly incorporate the correct choices. However, although this effect is certainly visible in the results, it does not appear to be strong enough to suggest a practical applicability of this approach.

The second approach attempts to use the “number-of-hits feature” as returned as meta data by modern web search-engines in response to any query to gain a confidence measure on each particular hypothesis for the explanation of a word form. Even using a basic heuristic on the returned meta data to determine the most likely correct hypothesis already achieves a satisfactory level of accuracy, suggesting potential for further improvement.

This thesis concludes with remarking that while the first approach makes use of a compelling theoretical background, further investigation should concentrate on the more promising, data-mining-oriented second approach.

Note: Get the full text here.

Contact

thkruege ät uos döt de

Note: This website primarily contains resources concerning my BSc thesis that I wrote in late 2009 in the area of Computational Linguistics to conclude my study of Cognitive Science at the University of Osnabrueck. My work produced a certain amount of code and data, which I share on these pages.

Nested Menu

Links