Morphological Analysis from the Raw Kashmiri Corpus Using Open Source Extract Tool

Manzoor Ahmad Chachoo, S. M. K. Quadri

Abstract


Purpose: Morphological information is a key part when we consider the design of any machine translation engine, any information retrieval system or any natural language processing application. It is important to investigate how lexicon development can be automated maintaining the quality which makes it of use for the applications, since manual development can be highly time consuming task. The paper describe how we can simply provide the extraction rules along with raw texts which can guide the computerized extraction of morphological information with the help of the extract tool like Extract v2.0.

Design/methodology/approach: We used Extract v2.0 which is an open source tool for extracting linguistic information from raw text, and in particular inflectional information on words based on the word forms appearing in the text. The input to the Extract is a file containing, an un-annotated Kashmiri corpus and a file containing the Extract rules for the language. The tools output is the list of analyses; each analysis consists of a sequence of words annotated with a identifier that describes some linguistic information about the word.

Findings: The study includes the fundamental extraction rules which can guide the Extract tool v2.0 to extract the inflectional information and help in the development of a full lexicon that can be use for developing different applications in the natural language applications. The major contributions of the study are:

  • Orthography component: A Unicode Infrastructure to accommodate Perso-Arabic script of Kashmiri.
  • Morphology component: A type system that covers the language abstraction and an inflection engine that covers word-and-paradigm morphological rules for all word classes.

Research Implications: The study however does not include all the rules but can be taken as a prototype for extending the functionality of the lexicon. An attempt has been made to make use of automated morphological information using Extract tool.

Originality/Value: Kashmiri language is the most widely spoken language in the state of Jammu and Kashmir. The language has very scarce software tools and applications. The study provides a framework for the development of a full size lexicon for the Kashmiri language from the raw text. The study is an attempt to provide a lexicon support for the applications which make use of Kashmiri language. This study can be extended for developing spoken lexicon of Kashmiri language that can be used in spoken dialogue systems.

Keywords: Natural Language Processing; Morphology; Lexicon; Kashmiri Morphology; Extract Tool; Logic

Paper Type: Design


Full Text: PDF



Creative Commons License The TRIM is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License