LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

Volume 10 : 11 November 2010
ISSN 1930-2940

Managing Editor: M. S. Thirumalai, Ph.D.
Editors: B. Mallikarjun, Ph.D.
         Sam Mohanlal, Ph.D.
         B. A. Sharada, Ph.D.
         A. R. Fatihi, Ph.D.
         Lakhan Gusain, Ph.D.
         K. Karunakaran, Ph.D.
         Jennifer Marie Bayer, Ph.D.
         S. M. Ravichandran, Ph.D.
         G. Baskaran, Ph.D.

HOME PAGE


AN APPEAL FOR SUPPORT

  • We seek your support to meet the expenses relating to the formatting of articles and books, maintaining and running the journal through hosting, correrspondences, etc.Please write to the Editor in his e-mail address languageinindiaUSA@gmail.com to find out how you can support this journal. Thank you. Thirumalai, Editor.


BOOKS FOR YOU TO READ AND DOWNLOAD FREE!


REFERENCE MATERIAL

BACK ISSUES


  • E-mail your articles and book-length reports in Microsoft Word to languageinindiaUSA@gmail.com.
  • Contributors from South Asia may e-mail their articles to
    B. Mallikarjun,
    Central Institute of Indian Languages,
    Manasagangotri,
    Mysore 570006, India
    mallikarjun@ciil.stpmy.soft.net.
  • PLEASE READ THE GUIDELINES GIVEN IN HOME PAGE IMMEDIATELY AFTER THE LIST OF CONTENTS.
  • Your articles and booklength reports should be written following the APA, MLA, LSA, or IJDL Stylesheet.
  • The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are expected from the authors and discussants.

Copyright © 2010
M. S. Thirumalai


 
Web www.languageinindia.com

Development of Punjabi-Hindi Aligned Parallel Corpus
from Web Using Machine Translation

Gurpreet Singh Josan, Ph. D.
Jagroop Kaur, M. Tech.


Abstract

Aligned parallel corpus plays a vital role for research in various automatic NLP tasks. A constantly increasing resource for collecting parallel text is the World Wide Web. This paper discusses a novel approach for collecting parallel text for language pair Punjabi-Hindi. We use Machine Translation and DOM for finding parallel text from internet. The collected text is of heterogeneous nature and is aligned at word level with high precision. The approach discussed in this paper guarantees high quality parallel data in short time span.

1. Introduction

Recent advancements in natural language processing are largely based on statistical approaches. The parallel corpus plays a vital role in statistical approaches as it allows empirical studies for various applications of NLP as language studies, machine translation, cross language information retrieval, bi-lingual lexicon development etc. Parallel corpus is a collection of original texts translated to another language where the texts, paragraphs, and sentences down to word level are typically linked to each other.

There exists multi language parallel corpus like Europarl, Bible, and OPUS etc. as well as bi lingual parallel corpus like ISJ-ELAN Sloveign English, English Chinese, English Norwegian parallel corpus etc. English enjoys the privileges when it came to the creation of parallel corpora. Most of the time, it is one of the two languages in the pair. Also the size of available corpus is limited. Another constraint is the limited domain. Most of existing corpora are developed from either government documents or from Newswire texts. There is a scarcity of parallel corpora for any other language pair excluding English particularly among Indian languages. The problem is a big barrier in the development of NLP applications involving Indian Languages.

World Wide Web is a constantly evolving source of a parallel text. Electronically accessible information is available on the web and is increasing day by day. The web mining seems to be a promising and can be used for building parallel corpora for the under privileged and minority languages. Collecting parallel corpus particularly for resources starved languages from the internet is among the challenging problems in NLP tasks. This is not a trivial task at all for the huge network makes the process very labor intensive. Besides there are the chances that useful documents are mixed up with garbage and high quality translations are mixed up with garbage.

Therefore, scientists have designed several systems to automate this construction process. The idea leads to the development of software for automatic discovering parallel text on World Wide Web such as BITS (Xiaoyi and Liberman, 1999), PTMiner (Chen and Nie, 2000), and Strand (Resnik, 1998; Resnik and smith, 2003) etc. This paper describes a technique for automatic generation of parallel corpora for Punjabi and Hindi. We will try to utilize best possible techniques available and supplement these techniques with additional resources. We will show why the already present systems are not suitable for our work and then we describe how a machine translation system helps in identifying and then aligning the parallel corpus obtained from the web.

2. Existing systems

(Resnik, 1998) proposed a simple method based on the anchor tag. A simple query is posted to Altavista to locate the pages that point to a pair of pages which contain an anchor text indicating the language of its parallel text. This is the case for an Index.html file which contains pointers to two parallel texts anchored as "English version" and "French version". However this simple method can only catch a small part of all the parallel pages. A lot of other parallel pages do not satisfy this condition.


This is only the beginning part of the article. PLEASE CLICK HERE TO READ THE ARTICLE IN PRINTER-FRIENDLY VERSION.


Implementing Explicit Grammatical Instruction in Thailand Schools | Nature of Sentence Intonation in Kannada, Tulu and Konkani | Language and Gender - Linguistic Analysis of Intermediate English Textbooks in Pakistan | Development of Punjabi-Hindi Aligned Parallel Corpus from Web Using Machine Translation | Paralinguistic and Non-Verbal Props in Second-Language Use: A Study of Icheoku and Masquerade in Nigeria | Economic Perspectives and Life-style Characteristics of the Aged Population in Tamil Nadu, India | Redefining Secularism - An Analysis of John Updike's Terrorist and Mohsin Hamid's The Reluctant Fundamentalist as Post-9/11 Novels | Reduplication in Bengali Language | Development of Time-Compressed Speech Test for Children between 8 - 12 Years of Age in Telugu | Bridging the Gap - The Potential of Contrastive Rhetoric in Teaching L2 Writing | ELT in Yemen and India - The Need for Remedial Measures | Relationship between Multiple Intelligence Categories and Learning Styles of Students in Pakistan | Internet as an Educational Resource in Vocabulary Instruction | The Effectiveness of Technology in Teaching Study Skills | A Study of the Comparative Elements in the Poetry of Keats and Ghani Khan | Sentence Pattern Method - A New Approach for Teaching Spoken English for Tamil/Indian/EFL Learners | Enhancing Language Skills Using Learn to Speak English Software in Engineering Students of Andhra Pradesh | Problems in Teaching of English Language at the Primary Level in District Kohat, NWFP, Pakistan | An Appraisal of the Practicum - Finding the Gaps between Theory and Practice in Teacher Training Institutions in Pakistan | A Study of B.Ed. Students' Attitude Towards Using Internet in Vellore District, Tamilnadu, India, Masters Dissertation | Politics of Sambalpuri or Kosali as a Dialect of Oriya in Orissa | A Six-Step Approach to Teaching Poetry Incorporating the Four Skills | Lexis of a Suicidal | A Case Review of Tamil Diglossia | Comparison of Markedness of Lexical Semantic Abilities in Normal Children and Children with Hearing Impairment | Social Effects and Other Impediments in Teaching Literature | Aligning the Connotations of Love and Freedom in the Novels of Iris Murdoch | Spiritual Communication and Managerial Effectiveness | A PRINT VERSION OF ALL THE PAPERS OF NOVEMBER, 2010 ISSUE IN BOOK FORMAT. | HOME PAGE of November 2010 Issue | HOME PAGE | CONTACT EDITOR languageinindiaUSA@gmail.com


Gurpreet Singh Josan, Ph.D.
Rayat & Bahra Institute of Engineering & Biotechnology
Sahauran
Mohali
Punjab, India
josangurpreet@rediffmail.com

Jagroop Kaur, M.Tech.
University College of Engineering
Punjabi University
Patiala
Punjab, India
jagroop_80@rediffmail.com

 
Web www.languageinindia.com
  • Send your articles
    as an attachment
    to your e-mail to
    languageinindiaUSA@gmail.com.
  • Please ensure that your name, academic degrees, institutional affiliation and institutional address, and your e-mail address are all given in the first page of your article. Also include a declaration that your article or work submitted for publication in LANGUAGE IN INDIA is an original work by you and that you have duly acknolwedged the work or works of others you either cited or used in writing your articles, etc. Remember that by maintaining academic integrity we not only do the right thing but also help the growth, development and recognition of Indian scholarship.