LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

Volume 18:2 February 2018
ISSN 1930-2940

Managing Editor: M. S. Thirumalai, Ph.D.
Editors: B. Mallikarjun, Ph.D.
         Sam Mohanlal, Ph.D.
         B. A. Sharada, Ph.D.
         A. R. Fatihi, Ph.D.
         Lakhan Gusain, Ph.D.
         Jennifer Marie Bayer, Ph.D.
         G. Baskaran, Ph.D.
         L. Ramamoorthy, Ph.D.
         C. Subburaman, Ph.D. (Economics)
         N. Nadaraja Pillai, Ph.D.
         Renuga Devi, Ph.D.
         Soibam Rebika Devi, M.Sc., Ph.D.
         Dr. S. Chelliah, Ph.D.
Assistant Managing Editor: Swarna Thirumalai, M.A.

Language in India www.languageinindia.com is included in the UGC Approved List of Journals. Serial Number 49042.


HOME PAGE

Click Here for Back Issues of Language in India - From 2001




BOOKS FOR YOU TO READ AND DOWNLOAD FREE!


REFERENCE MATERIALS

BACK ISSUES


  • E-mail your articles and book-length reports in Microsoft Word to languageinindiaUSA@gmail.com.
  • PLEASE READ THE GUIDELINES GIVEN IN HOME PAGE IMMEDIATELY AFTER THE LIST OF CONTENTS.
  • Your articles and book-length reports should be written following the APA, MLA, LSA, or IJDL Stylesheet.
  • The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are expected from the authors and discussants.

Copyright © 2016
M. S. Thirumalai

Publisher: M. S. Thirumalai, Ph.D.
11249 Oregon Circle
Bloomington, MN 55438
USA


Custom Search

Creation and Compilation of Hindi Newspaper Text Corpus

Vandana Mishra and Niladri Sekhar Dash
Indian Statistical Institute


Abstract

Developing a corpus for the study of various aspects of a language is a highly challenging task which involves effective planning and implementation of the same. The prime concern in the development of a corpus is the overall design criteria. In this chapter we aim at presenting some theoretical guidelines on the design criteria of a one million words digital corpus of Hindi Newspaper Text Corpus (HNTC) which has been developed as a part of an on-going research activity. After the determination of the planning stage a comprehensive description of the various steps involved in the development of the corpus is discussed. An overview of the developed corpus is also highlighted with detailed specifications. Since the developed corpus has to be used subsequently for various kinds of linguistic analysis, it has been documented efficiently. This chapter also tends to give importance to documentation, storage and management of the developed corpus as it requires extreme care on the part of the corpus builder. It is a highly tedious task. Proper documentation of the corpus will ensure it authenticity and retrievability. Also, it will be utilizable for a wider range of potential areas in future.

Keywords: Corpus, Compilation, Hindi, Newspaper, Documentation

1. Introduction

The development of text corpus in Indian languages began with the generation of the Kolhapur Corpus of Indian English (KCIE) which was designed by Shastri (1988) in an effort at individual level to identify the types of similarity and difference among American English, British English and Indian English. From then onwards several attempts may have been made to develop corpora for all major Indian languages at the individual level but these are not much appreciated or attested in the history of corpus generation and application in India.

The next most important milestone in this route is the TDIL (Technology Development of Indian Languages) project which was initiated in early 1990s by Department of Electronics (DoE), Ministry of Communication and Information Technology (MCIT), Govt. of India in 1991. It was launched with a mission for developing corpora in electronic form in all Indian languages included in the 8th Schedule of the Constitution of India for subsequent works of language technology (Dash 2007). The Central Institute of Indian Languages (CIIL), Mysore was entrusted with the responsibility for coordinating the corpus development task on behalf of the MCIT as well as developing required tools and systems for conversion of the corpus into Unicode format as well as for its storage, management, dissemination, and utilization by interested researchers. The CIIL has collaborated with Lancaster University, UK for these tasks (Baker, McEnery 2003).


This is only the beginning part of the article. PLEASE CLICK HERE TO READ THE ENTIRE ARTICLE IN PRINTER-FRIENDLY VERSION.


Vandana Mishra
Senior Research Fellow
vandana.mishra87@gmail.com

Niladri Sekhar Dash
Associate Professor
ns_dash@yahoo.com

Linguistic Research Unit
Indian Statistical Institute
203 B.T Road
Kolkata-700108
West Bengal
India


Custom Search


  • Click Here to Go to Creative Writing Section

  • Send your articles
    as an attachment
    to your e-mail to
    languageinindiaUSA@gmail.com.
  • Please ensure that your name, academic degrees, institutional affiliation and institutional address, and your e-mail address are all given in the first page of your article. Also include a declaration that your article or work submitted for publication in LANGUAGE IN INDIA is an original work by you and that you have duly acknowledged the work or works of others you used in writing your articles, etc. Remember that by maintaining academic integrity we not only do the right thing but also help the growth, development and recognition of Indian/South Asian scholarship.