LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

Volume 17:1 January 2017
ISSN 1930-2940

Managing Editor: M. S. Thirumalai, Ph.D.
Editors: B. Mallikarjun, Ph.D.
         Sam Mohanlal, Ph.D.
         B. A. Sharada, Ph.D.
         A. R. Fatihi, Ph.D.
         Lakhan Gusain, Ph.D.
         Jennifer Marie Bayer, Ph.D.
         G. Baskaran, Ph.D.
         L. Ramamoorthy, Ph.D.
         C. Subburaman, Ph.D. (Economics)
         N. Nadaraja Pillai, Ph.D.
         Renuga Devi, Ph.D.
         Soibam Rebika Devi, M.Sc., Ph.D.
Assistant Managing Editor: Swarna Thirumalai, M.A.

HOME PAGE

Click Here for Back Issues of Language in India - From 2001




BOOKS FOR YOU TO READ AND DOWNLOAD FREE!


REFERENCE MATERIALS

BACK ISSUES


  • E-mail your articles and book-length reports in Microsoft Word to languageinindiaUSA@gmail.com.
  • PLEASE READ THE GUIDELINES GIVEN IN HOME PAGE IMMEDIATELY AFTER THE LIST OF CONTENTS.
  • Your articles and book-length reports should be written following the APA, MLA, LSA, or IJDL Stylesheet.
  • The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are expected from the authors and discussants.

Copyright © 2016
M. S. Thirumalai

Publisher: M. S. Thirumalai, Ph.D.
11249 Oregon Circle
Bloomington, MN 55438
USA


Custom Search

An Experiment with the CRF++ Parts of Speech (POS) Tagger for Odia

Pitambar Behera, M.A., B.Ed., M.Phil., Ph.D.


Abstract

This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of approximately 600k tokens has been annotated manually in the Indian Languages Corpora Initiative (ILCI) project for Odia. The whole Odia corpus has been annotated based on the Bureau of Indian Standards (BIS) tagset developed by the DIT, govt. of India with some modifications under the ILCI. The tagger has been trained and tested with 2, 36, 793 and 1, 28, 646 tokens respectively. It provides 94.39% accuracy in the domain of seen data and 88.87% in the unseen dataset in precision and recall measures. In addition, this study further conducts an IA (inter-annotator) agreement, an error analysis to figure out salient erroneous labels committed by the automatic tagger and provides various suggestions to improve its efficiency. Furthermore, this study also provides the user-interface architecture and its functionalities.

Keywords: Indo-Aryan language, Odia, BIS, ILCI, POS tagger, CRF++, NLP.

Overview

Parts of Speech (POS) tagging, as well as annotation or labelling task (Mitkov, 2003) is the method of assigning a grammatical category label for each token based on the linguistic and contextual information within a sentence. There are several approaches and methods for POS annotation task out of which rule-based, statistical and hybrid methods are salient.

Indian languages have always been quite challenging for both linguistics and NLP owing to the fact that they are diverse and multiple in nature and morphologically richer; including some other unique features. India has been the homeland for five diverse language families, namely, the Indo-Aryan, Dravidian, Austro-Asiatic, Tibeto-Burman, and the Andamanese (Abbi, 2001, pp. 24).


This is only the beginning part of the article. PLEASE CLICK HERE TO READ THE ENTIRE ARTICLE IN PRINTER-FRIENDLY VERSION.



Pitambar Behera, M.A., B.Ed., M.Phil., Ph.D.
Centre for Linguistics
School of Language, Literature and Culture Studies
Jawaharlal Nehru University
New Delhi-110067
India
pitamb38llh@jnu.ac.in
pitambarbehera2@gmail.com

Custom Search


  • Click Here to Go to Creative Writing Section

  • Send your articles
    as an attachment
    to your e-mail to
    languageinindiaUSA@gmail.com.
  • Please ensure that your name, academic degrees, institutional affiliation and institutional address, and your e-mail address are all given in the first page of your article. Also include a declaration that your article or work submitted for publication in LANGUAGE IN INDIA is an original work by you and that you have duly acknowledged the work or works of others you used in writing your articles, etc. Remember that by maintaining academic integrity we not only do the right thing but also help the growth, development and recognition of Indian/South Asian scholarship.