Development of Punjabi-Hindi Aligned Parallel Corpus
from Web Using Machine Translation

Gurpreet Singh Josan, Ph. D.
Jagroop Kaur, M. Tech.


Aligned parallel corpus plays a vital role for research in various automatic NLP tasks. A constantly increasing resource for collecting parallel text is the World Wide Web. This paper discusses a novel approach for collecting parallel text for language pair Punjabi-Hindi. We use Machine Translation and DOM for finding parallel text from internet. The collected text is of heterogeneous nature and is aligned at word level with high precision. The approach discussed in this paper guarantees high quality parallel data in short time span.

1. Introduction

Recent advancements in natural language processing are largely based on statistical approaches. The parallel corpus plays a vital role in statistical approaches as it allows empirical studies for various applications of NLP as language studies, machine translation, cross language information retrieval, bi-lingual lexicon development etc. Parallel corpus is a collection of original texts translated to another language where the texts, paragraphs, and sentences down to word level are typically linked to each other.

There exists multi language parallel corpus like Europarl, Bible, and OPUS etc. as well as bi lingual parallel corpus like ISJ-ELAN Sloveign English, English Chinese, English Norwegian parallel corpus etc. English enjoys the privileges when it came to the creation of parallel corpora. Most of the time, it is one of the two languages in the pair. Also the size of available corpus is limited. Another constraint is the limited domain. Most of existing corpora are developed from either government documents or from Newswire texts. There is a scarcity of parallel corpora for any other language pair excluding English particularly among Indian languages. The problem is a big barrier in the development of NLP applications involving Indian Languages.

World Wide Web is a constantly evolving source of a parallel text. Electronically accessible information is available on the web and is increasing day by day. The web mining seems to be a promising and can be used for building parallel corpora for the under privileged and minority languages. Collecting parallel corpus particularly for resources starved languages from the internet is among the challenging problems in NLP tasks. This is not a trivial task at all for the huge network makes the process very labor intensive. Besides there are the chances that useful documents are mixed up with garbage and high quality translations are mixed up with garbage.

Therefore, scientists have designed several systems to automate this construction process. The idea leads to the development of software for automatic discovering parallel text on World Wide Web such as BITS (Xiaoyi and Liberman, 1999), PTMiner (Chen and Nie, 2000), and Strand (Resnik, 1998; Resnik and smith, 2003) etc. This paper describes a technique for automatic generation of parallel corpora for Punjabi and Hindi. We will try to utilize best possible techniques available and supplement these techniques with additional resources. We will show why the already present systems are not suitable for our work and then we describe how a machine translation system helps in identifying and then aligning the parallel corpus obtained from the web.

2. Existing systems

(Resnik, 1998) proposed a simple method based on the anchor tag. A simple query is posted to Altavista to locate the pages that point to a pair of pages which contain an anchor text indicating the language of its parallel text. This is the case for an Index.html file which contains pointers to two parallel texts anchored as "English version" and "French version". However this simple method can only catch a small part of all the parallel pages. A lot of other parallel pages do not satisfy this condition.

