HOME PAGE
Click Here for Back Issues of Language in India - From 2001
BOOKS FOR YOU TO READ AND DOWNLOAD FREE!
REFERENCE MATERIALS
BACK ISSUES
- E-mail your articles and book-length reports in Microsoft Word to
languageinindiaUSA@gmail.com.
- PLEASE READ THE GUIDELINES GIVEN IN HOME PAGE
IMMEDIATELY AFTER THE LIST OF CONTENTS.
- Your articles and book-length reports should be written following the APA, MLA, LSA, or IJDL Stylesheet.
- The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are
expected from the authors and discussants.
Copyright © 2016
M. S. Thirumalai
Publisher: M. S. Thirumalai, Ph.D.
11249 Oregon Circle
Bloomington, MN 55438
USA
|
Custom Search
Creation and Compilation of Hindi Newspaper Text Corpus
Vandana Mishra and Niladri Sekhar Dash
Indian Statistical Institute
Abstract
Developing a corpus for the study of various aspects of a language is a highly challenging task which involves effective planning and implementation of the same. The prime concern in the development of a corpus is the overall design criteria. In this chapter we aim at presenting some theoretical guidelines on the design criteria of a one million words digital corpus of Hindi Newspaper Text Corpus (HNTC) which has been developed as a part of an on-going research activity. After the determination of the planning stage a comprehensive description of the various steps involved in the development of the corpus is discussed. An overview of the developed corpus is also highlighted with detailed specifications. Since the developed corpus has to be used subsequently for various kinds of linguistic analysis, it has been documented efficiently. This chapter also tends to give importance to documentation, storage and management of the developed corpus as it requires extreme care on the part of the corpus builder. It is a highly tedious task. Proper documentation of the corpus will ensure it authenticity and retrievability. Also, it will be utilizable for a wider range of potential areas in future.
Keywords: Corpus, Compilation, Hindi, Newspaper, Documentation
1. Introduction
The development of text corpus in Indian languages began with the generation of the Kolhapur Corpus of Indian English (KCIE) which was designed by Shastri (1988) in an effort at individual level to identify the types of similarity and difference among American English, British English and Indian English. From then onwards several attempts may have been made to develop corpora for all major Indian languages at the individual level but these are not much appreciated or attested in the history of corpus generation and application in India.
The next most important milestone in this route is the TDIL (Technology Development of Indian Languages) project which was initiated in early 1990s by Department of Electronics (DoE), Ministry of Communication and Information Technology (MCIT), Govt. of India in 1991. It was launched with a mission for developing corpora in electronic form in all Indian languages included in the 8th Schedule of the Constitution of India for subsequent works of language technology (Dash 2007). The Central Institute of Indian Languages (CIIL), Mysore was entrusted with the responsibility for coordinating the corpus development task on behalf of the MCIT as well as developing required tools and systems for conversion of the corpus into Unicode format as well as for its storage, management, dissemination, and utilization by interested researchers. The CIIL has collaborated with Lancaster University, UK for these tasks (Baker, McEnery 2003).
This is only the beginning part of the article. PLEASE CLICK HERE TO READ THE ENTIRE ARTICLE IN PRINTER-FRIENDLY VERSION.
Vandana Mishra
Senior Research Fellow
vandana.mishra87@gmail.com
Niladri Sekhar Dash
Associate Professor
ns_dash@yahoo.com
Linguistic Research Unit
Indian Statistical Institute
203 B.T Road
Kolkata-700108
West Bengal
India
Custom Search
|
- Click Here to Go to Creative Writing Section
- Send your articles
as an attachment
to your e-mail to
languageinindiaUSA@gmail.com.
- Please ensure that your name, academic degrees, institutional affiliation
and institutional address, and your e-mail address are all given in
the first page of your article. Also include a declaration that your
article or work submitted for publication in LANGUAGE IN INDIA is an
original work by you and that you have duly acknowledged the work or
works of others you used in writing your articles, etc.
Remember that by maintaining academic integrity we not only do the right
thing but also help the growth, development and recognition of Indian/South Asian scholarship.
|