Note: This README concerns itself with documenting the implementation (downloadable here) of the work from my thesis.

Description

This README details the functionality and requirements of the core implementation of the approaches detailed in my BSc thesis.

For the most up-to-date version of this document, refer to http://www-lehre.inf.uos.de/~thkruege/readme.html.

General Requirements

In order to run these scripts, you should have the following installed:

Requirements that are already included in the distribution archive for convenience are:

web.lua

This script is called with a search engine name (either google or yahoo or bing) as the first argument and any German noun form to search for.

Call it with, e.g., $ lua web.lua google "Auto"

Note that if you wish to use this script with search engines other than google, you need to modify this script to include valid App IDs (that are obtainable by registering the application with the respective search engine). Do not hesitate to contact me if you wish to try this with the App IDs I originally registered in order to collect the data for my thesis.

As discribed in detail in my thesis, the script will then hypothesize about possible inflections of this word, attach matching articles to each hypothesis and use the result to query for the number of hits.

Note that the script relies on caching the results in a special file inside a directory called data. Make sure this directory exists and is writeable, otherwise the script will fail ungracefully. My apologies for neglecting to perform proper error checking.

The output will be written to stdout in csv format and contain results of individual queries along with the queries themselves. If called in the above described, stand-alone manner, some columns of the output will show N/A. This is to be expected, because without the gold standard, the script can’t hace any preconceptions about correctness of any one result. The rest of the output should be largely self-explanatory but feel free to bug me with any questions you might have.

Note that if you want diacritics to be processed properly (which is likely), you have to set your terminal to latin-1. To start an xterm session in latin-1 do this:

$ LANG=en_GB.ISO-8859-1 XTERM_LOCALE=en_GB.ISO-8859-1 xterm

batchweb.lua

This script uses the above described web.lua to perform web searches in batch mode. As a data source, the reference to the TIGER corpus is hard coded, edit the script if you want to change this. The script relies on presence of the gold standard to determine correct or false hypotheses. I include it in the downloadable archive for convenience.

Otherwise, what is true for web.lua above is also true for batchweb.lua.

mdl.lua

Somewhat misnamed, this script contains the bulk of the implementation of the first approach detailed in the thesis. Its behavior is predominantly controlled through the file lib/config.lua.

Acknowledgements

Thanks to Peter Adolphs for providing me with the code for the implementation of his own diploma thesis (paper here). Without it, this work would not have been possible.

Contact

thkruege ät uos döt de.

Note: This website primarily contains resources concerning my BSc thesis that I wrote in late 2009 in the area of Computational Linguistics to conclude my study of Cognitive Science at the University of Osnabrueck. My work produced a certain amount of code and data, which I share on these pages.

Nested Menu

Links