Tuesday, August 30, 2011

Web Crawling

What are some of the challenges with crawling the web?
How can a web developer make his/her page get better rankings?

Some challenges I found:
Due to the fact that bandwidth is finite and costly, it is in the best interest to most effectively crawl the most important content. With all sorts of server-side generated content being served by many websites, the problem with avoiding duplicate content is more difficult(eg many many GET parameter combinations that result in the same data). Also, spider traps can be a severe problem. Take for example, a web-calendar. If the spider was to follow the links on the calender, it would continue crawling to the next day/year etc...

Web page rankings:
There is an entire topic on SEO (search engine optimization) which in a sense is reverse engineering search engines to get better rankings, which is too large to cover here. The typical tips you hear about are avoiding duplicate content, getting more inbound links...etc

1 comment:

  1. I think one more challenge is crawling the so called "invisible web". Databases which normal crawlers are not able to access. There is a lot of active research going on in this area.

    Dr. Rao has also published a paper on a related topic of Deep web. http://rakaposhi.eas.asu.edu/www11-sourcerank.pdf