Wednesday, August 31, 2011

Use of Social Media for ranking results

This post is a related to the previous post (Google explores re-ranking).

Using Social networking data (as well as user preferences, profiles, even browsing history) to give better results looks like the only way (easy way!!), as of now, for any search engine today to improve their results. This is one of the reason why Facebook(750 million active users as of now) is considered a huge threat to google. The notion of Facebook entering the search domain has been around for sometime now (and has been ridiculed by many ). The fact remains that most users would prefer looking at a webpage or buying stuff , if that is already recommended by someone they know (google's +1 or FB's "like","shared" features).

My concern is, this would leave out people who do not use social networking sites a lot. A medical professional might still use google on a regular basis but might not be active in Social network sites , or may not even have a google/gmail id (to get profile preferences).


Another concern is the misuse of these parameters. There are already known ways to boost a website's google rank (some good, some not so good). This was also discussed in class with the example of "miserable failure" keywords pointing to websites of George Bush. There is an interesting term for this concept - "Google Bomb" . There is a good chance that these new social n/w features might also be misused. Here is an article where a user was able to boost his website's ranking only based on "like" clicks, without any backlinks.

This is also mentioned in the previous post's article -
But if Google’s going to start using those +1 votes, the company is virtually inviting the world’s spammers and blackhat SEO magicians to flood its social networking system with fake profiles and fake votes — potentially ruining it and possibly making the problem of search spam even worse.


For these reasons, I think the research on improved IR methods/ranking algorithms should continue with minimal user data.


Google explores re-ranking search results using +1 button data

Hi Friends,
I came across an interesting article about Google using its social network to re-rank the results.

"Google's biggest weakness is the possibility that someone will figure out how to build a better search engine -- and there's many who bet the way to do that is to make search involve more of a human touch and less of a machine's".

The above line reminds me of the "Sweet Spot" concept taught in the class

http://www.cnn.com/2011/08/30/tech/web/google-ranking-plus-results/index.html?hpt=te_bn5 is the URL to the article.

Thanks
Abhishek

Tuesday, August 30, 2011

Mooers' Law


I came across an interesting law and thought of sharing it. 

Mooers’ Law:

"An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have it." -- Calvin N. Mooers 
                  and Wikipedia

Web Crawling

Questions:
What are some of the challenges with crawling the web?
How can a web developer make his/her page get better rankings?

Some challenges I found:
Due to the fact that bandwidth is finite and costly, it is in the best interest to most effectively crawl the most important content. With all sorts of server-side generated content being served by many websites, the problem with avoiding duplicate content is more difficult(eg many many GET parameter combinations that result in the same data). Also, spider traps can be a severe problem. Take for example, a web-calendar. If the spider was to follow the links on the calender, it would continue crawling to the next day/year etc...

Web page rankings:
There is an entire topic on SEO (search engine optimization) which in a sense is reverse engineering search engines to get better rankings, which is too large to cover here. The typical tips you hear about are avoiding duplicate content, getting more inbound links...etc

Inverted Index for storing bag of words



Creating a bag of words plays an important role. There are lot of documents in the web. The idea of creating a bag of words for each document is not so good, as it increases the time and space constraints. So many search engines makes use of Inverted Index data structure.
Above image makes you understand clearly about the creation of bag of words.

The purpose of an inverted index is to allow fast full text searches. at a cost of increased processing when a document is added to the database. It is the most popular data structure used in document systems, used on a large scale for example in search engines.

I think we will use this data structure in our search engine project.

Regards.
Rajasekhar.

Abe Lincoln as an awesome IR engine..

Here is my all-time favorite IR sketch.. written by Woody Allen, about a putative incident from Lincoln's
life. Read it with today's IR discussion in the back of your mind, and you will enjoy this already funny sketch
even more..


Feel free to blog-comment on any IR connections you made. 

Rao

=====
  "like Reduced Shakespeare Company: The better you know the original,
the funnier it gets"

--LA Times review of "Dave Barry Slept here, A
sort of history of United States"