LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

Volume 3 : 12 December 2003

Editor: M. S. Thirumalai, Ph.D.
Associate Editors: B. Mallikarjun, Ph.D.
         Sam Mohanlal, Ph.D.
         B. A. Sharada, Ph.D.

BOOKS FOR YOU TO READ AND DOWNLOAD


REFERENCE MATERIAL

BACK ISSUES


  • E-mail your articles and book-length reports to thirumalai@bethfel.org or send your floppy disk (preferably in Microsoft Word) by regular mail to:
    M. S. Thirumalai
    6820 Auto Club Road #320
    Bloomington, MN 55438 USA.
  • Contributors from South Asia may send their articles to
    B. Mallikarjun,
    Central Institute of Indian Languages,
    Manasagangotri,
    Mysore 570006, India
    or e-mail to mallikarjun@ciil.stpmy.soft.net
  • Your articles and booklength reports should be written following the MLA, LSA, or IJDL Stylesheet.
  • The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are expected from the authors and discussants.

Copyright © 2001
M. S. Thirumalai

CURRENT TRENDS IN INDIAN LANGUAGES TECHNOLOGY
Girish Nath Jha, Ph.D.


1. IT DEVELOPMENTS AND INDIAN LANGUAGES

The present paper outlines the developments in the field of Information Technology (IT) applications for major Indian languages. Besides describing the current linguistic scene in India, the paper makes a survey of the significant Research & Development (R&D) in the major areas of Indian languages technology. The paper has the following sections:
  1. Introduction - describing the language scene in India, language families, constitutional status, problems and politics, and need to use IT for solving language issues and promoting inter-communication among Indian languages,
  2. Language technology in India - describing the latest developments in the field,
  3. Conclusion, and
  4. References.

2. THE CURRENT INDIAN LINGUISTIC SCENE - BLEAK OR BRIGHT?

India is considered a linguistic area by language typologists2. There are over a thousand languages with well defined grammars3 and five language families - Indo-Aryan, Dravidian, Austro-Asiatic, Tibeto-Burman, and Andamanese1. These language families are distinguished on a linguistic basis and each family has been studied extensively except the Andamanese1. Today India has 18 national languages, of which Hindi is the National Official Language (NOL). English, which does not have a place among the national languages, serves as the Associate Official Language (AOL).

Some scholars claim that there has been traditionally no organized effort for language policies8. The government has been certainly vacillating in its approach to language issues in the country after the linguistic reorganization of the states, and efforts to implement the Three Language Formula have not been successful. The government also faces continuous pressure from linguistic communities for including their languages in the Language Schedule of the Constitution of India. The government efforts to promote Hindi as the NOL have been partially successful, though Hindi has received a lot of popular support from the film and electronic media. The race for evolving and promoting a NOL after independence has also led to a neglect of many languages, spoken by comparatively smaller population groups. Upward socio-economic mobility causes language death in India, when younger generations stop learning and using their native languages. Barriers among major Indian languages have also caused conflicts and fueled political hot-spots.

3. IT IN INDIA

India is world's major IT provider, and also is the country with largest number of languages. The technical edge that India has can also reflect in solving many language problems and help preserve language data of dying languages. IT can be used to help Indian languages in the following ways -

  • creating fonts and word processing systems
  • creating Optical Character Recognition (OCR) tools
  • building data warehouses
  • corpora building
  • dialect mapping
  • preparing multilingual resources
  • building uni/bi/multimodal natural language interfaces
  • creating M(A)T tools
  • preparing online resources and search engines
  • building speech recognition and synthesis tools
  • building Text to Speech (TTS) systems
  • doing localization for major software systems

Keeping the complex linguistic scenario in mind and the need for building software systems for Indian languages, the government started IT initiatives in Indian languages and knowledge based systems in the last couple of decades in many technology centres, universities, and institutes funded by the Technology Development for Indian Languages (TDIL) program of the Ministry of Information Technology (MIT) and also the UNDP. University Grants Commission (UGC) also supports minor and major research projects. Mention may be made of the Indian Institutes of Technology (IITs), Indian Institutes of Information Technology (IIITs), Centre for Development of Advanced Computing (C-DAC), Indian Institute of Science (IIS), Indian Statistical Institute (ISI), Jawaharlal Nehru University (JNU), Mahatma Gandhi International Hindi University (MGIHU), major Sanskrit universities and other institutes for significant contributions in this field. The private enterprises like Tata Institute of Fundamental Research (TIFR), Tata Consultancy Services (TCS) have also funded Indian language technology R&D.

4. LANGUAGE TECHNOLOGY IN INDIA

The IT initiatives in Indian languages have been primarily funded by government agencies. The TDIL program of the Government of India (GOI) has been aggressively supporting language technology R&D in leading Indian universities, institutes and technology centres. In its latest issue, the TDIL journal Vishvabharata (Jan 2002) 9 outlined its short, medium and long term goals, some of which are -
  • Corpora creation and analysis
  • Smart content creation
  • Language technology be integrated into curricula
  • Indian language speech databases
  • Multilingual multimedia content development
  • Speech engine: speech recognition, specific speech I/O
  • Indian language support on internet applications
  • Machine Aided Translation
  • Cross Lingual Information Retrieval Tools (CLIR), and
  • Speech to speech translation

The TDIL programme is executed under several nodal centres like IITs, IIITs, National Centre for Software Technology (NCST), C-DAC, ISI, Utkal University, Thapar Institute of Engineering and Technology, University of Hyderabad, IIS, Central Electronics Engineering Research Institute (CEERI), TIFR, Anna University, IBM India Research Lab, Megasoft India, Tata Infotech Ltd., Electronics Research & Development Centre (ERDC), Summit Information Technologies Pvt. Ltd., Webdunia.com (India) Pvt. Ltd., Modular Infotech Pvt. Ltd., Banasthali Vidyapeeth and JNU.

The September, 2001 issue of Vishvabharat10 lists some of the current projects under the following categories -

  1. Machine Aided Translation
    • MaTra - Human Aided Machine Translation System for English-Hindi
    • MANTRA - Machine Assisted Translation Tool from English-Hindi
  2. Operating System
    • IndiX - Localisation of Graphical User Interface of Linux Operating System
  3. Human Machine Interface System
    • Optical Character Recognition systems for Indian languages
    • High quality PC based Parametric synthesizer
    • HindiVani- Text to Speech Synthesis System for Hindi
    • Speech/Speaker recognition systems
    • Limited Domain speech synthesis
  4. Tools
    • ITERM - Indian script terminal for Unix X windows
    • Word processors for Indian languages
    • Anusaraka - a language accessor among Indian languages to Hindi
    • Sanskrit Authoring Systems (SAAS)
    • Devanagari search engines for Unicode
    • Linux isfoc script manager
    • Indica multilingual solutions
  5. E-content
    • Bilingual electronic dictionary
    • Word net, corpora building and speech databases
    • Web based education systems
    • Multilingual Ayurvedic knowledge base

Complementing the above language technology efforts, a joint funding by GOI and UNDP started another programme called the Knowledge Based Computer Systems Programme (KBCSP) in 1986. The KBCS nodal centres at C-DAC, Pune and IISc, Bangalore have done commendable research since then. The NLP R&D at C-DAC reached the take-off point after the development of GIST technology (which made it possible to perform input and output functions in a large number of scripts including non Indian ones). They have been working on Indian language fonts and software for over a decade. Many language technology projects now use their fontographic standards, if not their text-processing software. A brief summary of some of their activities are as follows 4 -

  • Simulation of Artificial Neural Networks on C-DAC's parallel hardware
  • Porting of CS - PROLOG and STRAND on multi-transputer systems and implementation of SCHEME an TLISP
  • Development of Indian language (Devanagari ) interface for AI programming languages and Expert Systems Shells
  • Expert Systems for rural applications with the help of AI packages with Inndian language Interfaces
  • Sanskrit Grammar for Natural Language Understanding (NLU) and Natural Language Interface (NLI)
  • Intelligent Tutoring System for Sanskrit Grammar
  • MT tools and systems
  • Ayurveda based Diagnostic Expert System
  • Musico-linguistic study of Music
  • Development of OCR (Optical Character Recognition) for Devanagari script.

Besides C-DAC, the IITs at Kanpur, Delhi and Madras are doing state of the art work on language technology. The nodal centre on Expert Systems in IIT, Madras has been trying to design Systems (ES) with special reference to medical/ health areas in the rural context, and engineering related areas. Substantial progress has been made in their projects EKALAVYA: An Intelligent Referral System for primary Child Health Care, and IITM - DESS (Diagnostic Expert System Shell )4 .

IIT, Kanpur has long been working on a project to design a MT system for inter-translation among major Indian Languages using Sanskrit as an inter-lingua. They have developed a device called Anusaraka which renders text from one Indian language into another in near comprehensible form. The lexical resources for Indian languages that they have collected are being used for the Anusaraka multilingual transfer tool. The IIIT, Hyderabad has been doing significant work on Machine Translation and fonts development for Indian languages. There have also been efforts to build domain based translation system. A study of share-market's technical register of English and its translation into Hindi was done at JNU and a bidirectional lexical interface called LexFace was built using object Pascal and Interbase6.

Projects finished or currently being run under GOI sponsorship being run at places like JNU, Sanskrit centres and vidyapeethas like Gurukul Kangri (Haridwar), Sampurnanand Sanskrit Vidyapeeth (Varanasi), Vanasthali Vidyapeeth (Vanasthali) have been remarkable. For example the CASTLE (Computer Assisted Sanskrit Teaching and Learning Environment) project at JNU built a Prolog system for analyzing, teaching and learning Sanskrit on Paninian model. Another project at JNU is preparing an online learning system for Japanese, Chinese and Sanskrit. Academy of Sanskrit Research, Melkote, Mysore has been actively involved in bringing scholars doing technology R&D for Sanskrit and shastras on a single platform. In 1993, it organised seminars on 'Sanskrit and computer based linguistics' and in 1994, a seminar on 'Interface Mechanisms in Shastras and Computer Science'. The latter, among other things, brought out similarities in the traditional Indian theories and principles of Artificial Intelligence 5. The newly opened centre of Sanskrit studies in JNU has started research on Navya Nyaya language and methodology. The centre has already started an ambitious non-funded project on developing an online multi-lingual Amarakosha with Java front-end and SQL server as back-end. The M.A. students of the centre recently finished sample projects in Prolog on about 20 topics from Sanskrit language and literature.

Many NGOs and the private enterprises too have found interest in this area. The Tata Institute of Fundamental Research (TIFR), Tata Consultancy Service (TCS) and other companies have started funding the Indian languages and KBCS R&D. The Computer Systems and Communication Group of TIFR, Bombay aims at developing a Voice Oriented Interactive Computing Environment (VOICE) using KBCS technologies. Already, some progress has been made in preparing speech database, phoneme-to-speech synthesis system, Multi-speaker speech recognition systems, and Computer Tutor in Speech mode. 4

5. CORPORA BUILDING AND DATA BASES

There have been significant development in corpora building and databases for Indian languages data have been prepared. The IIIT, Hyderabad lexical resources project collects and represents multilingual lexical data from Indian languages. The IIT, Delhi has been working on developing Indian languages corpora in machine readable form with a target of collecting about 3 million words for each language from works published between 1980-1990.

The Mahatma Gandhi International Hindi University (MGIHU) started a mega project on databases and dialect mapping of Hindi. This project, called 'Hindi Samgraha,' has the following objectives7

  • Creating databases for varieties of Hindi
  • Creating a digital warehouse of audio and visual data
  • Generating interactive linguistic atlas
  • Generating reports on analysis and linguistic patterns
  • Creating a dynamic content multimedia website
  • Creating a multi-dialectal lexicon and search mechanism for Hindi

This five year project will do fieldwork by trained linguists and folk literature specialists in the Hindi belt. Digital recording of first hand data will be done which will be stored as files with only references in the database. A VB front-end and SQL Server backend software will allow easy data-entry, user management, reporting, etc. A dynamic content, interactive website will be created, so that data can be added from field itself. A geographical mapping software will be used to generate linguistic maps from the given data.

This year, a seminar on '21st century reality: language culture and technology' was organized by the MGIHU at New Delhi. The technology section of the seminar presented many important works by scholars. Among others, the presentations on creating multimedia for preserving/teaching/learning languages and building lexical resources for Indian languages were significant. Recently, a workshop on Sanskrit Informatics was organized at IGNCA, New Delhi by School of Computer an System Sciences, JNU to discuss current trends in language technology in Sanskrit and goals for future.

6. CONCLUSION

Marrying of languages with the advancement in IT has ushered in a new information age where the globe has shrunk and information has become the most valuable commodity responsible for progress and prosperity. This is more relevant in India than any other country or continent. With an unmatched linguistic and cultural diversity, the Indian linguistic scene has been a hot target for sociological and computational studies.

The IT R&D and applications for Indian languages have multiplied manifold in last two decades thanks to a liberal funding from the government. However, there has been a growing feeling that there is a significant lack of proper coordination among various R&D units and nodal centres doing language technology and knowledge based research. Sometimes, this leads to duplication of work. Also the government funding does not always produce expected results in terms of quality and time. There is also a feeling that the Indian companies in the forefront of IT development around the world do not extend adequate support for research based on Indian languages. Work on the needs of the linguistically challenged populations in India is conspicuous by its insignificant presence.

India has a very diverse linguistic scene and there is a tremendous demand on the language industry to develop systems for MT, information retrieval, man-machine interfaces, knowledge representation, etc. There is also a great potential in developing KB systems for Indian languages. This calls for a greater cooperation among linguists, computer scientists, and cognitive psychologists on the one hand and the R&D nodal centres on the other. A continued liberal funding from both private and government sources is a must, because funding in these areas are known to dry up due to poor or slow results.


References

1. Abbi, Anvita, 2001, A Manual of Linguistic Fieldwork and Structures of Indian Languages, Lincom Europa, Muenchen

2. Emeneau, Murray B., 1956, India as a Linguistic Area, Language 32.1:3-16

3. Grierson, George A., 1903-28, Linguistic Survey of India, VolI-XI, Calcutta, Repr. 1968, Motilal Banarasidass, Delhi.

4. Jha, Girish N, 1993, Morphology of Sanskrit Case Affixes: a computational analysis, M.Phil dissertation, J.N.U., New Delhi

5. Jha, Girish N, 1994, Indian theory of knowledge: an AI perspective, Proceedings of National Seminar on Interface Mechanisms in Shastras and Computer Science, Academy of Sanskrit Research, Malkote, Mysore, April, 1994.

6. Jha, Girish N, 1996, Designing a lexical semantic component for English-Hindi machine translation. Ph.D. thesis, JNU.

7. Jha, Girish N, 2002, Hindi Samgraha: The technology of IT, presented as the project outlines in a brainstorming session of experts, IIC, New Delhi, Dec 12, 2002

8. Kachru, Braj, 1981, Language Policy in South Asia, Annual Review of Applied Linguistics (Kaplan, ed. 1982), Newbury House Publishers Inc., Rowley, Massachussetts

9. vishvabharata, a Ministry of Information Technology publication on language technology in India, (Jan 2002).

10. vishvabharata, a Ministry of Information Technology publication on language technology in India, (Sept 2001)


HOME PAGE | BACK ISSUES | Practicing Literary Translation - Some Considerations, Reflections | Language of Panchangam (Hindu Almanac) | Use of Folk Literature From a Pedagogical Perspective | Speech versus Language Disorders: A Critical Evaluation | COMMUNICATION VIA GESTURE | Pre-Primary Language Development Materials | Current Trends in Indian Languages Technology | Language News This Month - Learning Disabilities, etc. | CONTACT EDITOR


Girish Nath Jha, Ph.D.
Centre for Sanskrit Studies
Jawaharlal Nehru University
New Delhi-110067, India
E-mail: girishj@mail.jnu.ac.in