Managing word form variation of text retrieval in practice – Why language technology is not the only cure for better IR performance?

Dr. Kimmo Kettunen


Purpose: The article discusses on a general methodological level different methods that have been used for management of single key word form variation in information retrieval during the history of textual information retrieval. The paper offers the reader an overall practical guide for choosing between different methods to be used for different types of European languages. Methods being compared in the paper include stemming, lemmatization, truncation, syllabification, unsupervised morphological methods, character n-gramming and generation of inflected word forms.
Methodology/Approach: Based on the empirical findings and results achieved by other researchers the paper discusses several pros and cons of different keyword variation management methods in a broader context than usually in IR, where only achieved effectiveness results are normally considered. The study proposes a list of five criteria for comparison of the conflation methods in general and offer a heuristics for choosing a suitable method for conflation of a specific language.
Findings: Simpler character-based methods could be preferred in IR instead of very sophisticated linguistic methods. It is also suggested that for morphologically simple languages, such as English, any kind of keyword variation management may be futile, as the increase in IR effectiveness achieved may be very low. Morphologically more complex languages can be conflated with the simple methods quite effectively for present IR search engines.
Keywords: Information retrieval; Management of word form variation; Comparison of word form variation management methods; IR performance; Effectiveness; Language technology
Paper Type: Meta-analysis

Full Text: PDF

Creative Commons License The TRIM is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License