LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

Volume 22:5 May 2022
ISSN 1930-2940

Editors:
         Sam Mohanlal, Ph.D.
         B. Mallikarjun, Ph.D.
         A. R. Fatihi, Ph.D.
         G. Baskaran, Ph.D.
         T. Deivasigamani, Ph.D.
         Pammi Pavan Kumar, Ph.D.
         Soibam Rebika Devi, M.Sc., Ph.D.

Managing Editor & Publisher: M. S. Thirumalai, Ph.D.

Celebrate India!
Unity in Diversity!!

HOME PAGE

Click Here for Back Issues of Language in India - From 2001




BOOKS FOR YOU TO READ AND DOWNLOAD FREE!


REFERENCE MATERIALS

BACK ISSUES


  • E-mail your articles and book-length reports in Microsoft Word to languageinindiaUSA@gmail.com.
  • PLEASE READ THE GUIDELINES GIVEN IN HOME PAGE IMMEDIATELY AFTER THE LIST OF CONTENTS.
  • Your articles and book-length reports should be written following the APA, MLA, LSA, or IJDL Stylesheet.
  • The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are expected from the authors and discussants.

Copyright © 2022
M. S. Thirumalai

Publisher: M. S. Thirumalai, Ph.D.
11249 Oregon Circle
Bloomington, MN 55438
USA


Custom Search

Crawler and Its Linguistic Challenges in the Arabic Language Sites

Asmaa Al-Haj Badran


Abstract

Crawler, a Web indexing program or an Internet robot/bot (Spetka, 2004), is a software application that runs automated scripts over the Internet. The Web engines use it to update the content and sites via copying all the accessed pages and processing them into indexes so that the users can search much more sufficiently. Crawling is the first stage that downloads Web documents, which the indexer indexes for later use by searching module, with feedback from other backgrounds. This module could also provide on-demand crawling services for search engines. Yet, with the massive amount of data that has been fed on the web, we still encounter some problems and challenges while crawling data. Subsequently, through the wide-open access to all search engines, Arabic content is hitherto scantily accessible. This paper descriptively details the stances and challenges that the Arabic language, the fifth most spoken language, might grapple with while crawling data.

Keywords: Web crawler, Arabic Language Sites, search engine, methodology, challenges, limitations, newspapers’ logistic expressions.

1. Introduction

The fast-paced growth of the Internet magnetizes the researchers to facilitate the load of data from different fields, providing metadata for all users regardless of their mother tongue. The web-searching engine confronts various problems that may improve or detract from the crawling of online data. The problems are either novel, like those I faced through my crawling process, or deep-rooted; they have been dealt with before but never solved. This project aims to raise awareness of these problems that, consequently, could benefit the maintenance of the crawler server to be improved in indexing the data. Therefore, a Web crawler starts with a set of URLs to get them indexed; they are called seeds. These URLs are recursively visited according to a series of policies, which may allow, or allow you not, to access data. The initial step is to choose a Web page from the URL, which must then be processed by extracting the text and links. These extracted links are added to the URLs frontier to be crawled. (Bal, S., 2012) Best of the most significant crawlers, covered by the enormous pages on the Internet, prove inadequate for presenting a complete index. Generally, the crawler has two options for maintaining its index. Essentially, the crawler examines the web until the collection reaches several pages and then stops viewing pages. When it is time to reset the pool, the crawler creates a new one using the same procedure outlined above and then substitutes the old one with the new one. Conversely, the crawler may continue visiting sites after the collection has reached its desired size to modify the current collection progressively. (Cho, J. and Garcia-Molina, H., 2000)

2. Crawler and the Search Engine

Some Web engines use the spidering software to keep their Web indices and content updated on other sites. All the pages visited by the Web crawler can be copied for later processing using a search engine that indexes the pages to be available for the users to search more efficiently. A crawler can visit pages without approval, yet the problem is that not all the pages allow crawling their data.


This is only the beginning part of the article. PLEASE CLICK HERE TO READ THE ENTIRE ARTICLE IN PRINTER-FRIENDLY VERSION.


Asmaa Alhaj Badran
Jawaharlal Nehru University, New Delhi 110067
asmaaalhajbadran@gmail.com

Custom Search


  • Click Here to Go to Creative Writing Section

  • Send your articles
    as an attachment
    to your e-mail to
    languageinindiaUSA@gmail.com.
  • Please ensure that your name, academic degrees, institutional affiliation and institutional address, and your e-mail address are all given in the first page of your article. Also include a declaration that your article or work submitted for publication in LANGUAGE IN INDIA is an original work by you and that you have duly acknowledged the work or works of others you used in writing your articles, etc. Remember that by maintaining academic integrity we not only do the right thing but also help the growth, development and recognition of Indian/South Asian scholarship.