Wednesday, August 31, 2011

Use of Social Media for ranking results

This post is a related to the previous post (Google explores re-ranking).

Using Social networking data (as well as user preferences, profiles, even browsing history) to give better results looks like the only way (easy way!!), as of now, for any search engine today to improve their results. This is one of the reason why Facebook(750 million active users as of now) is considered a huge threat to google. The notion of Facebook entering the search domain has been around for sometime now (and has been ridiculed by many ). The fact remains that most users would prefer looking at a webpage or buying stuff , if that is already recommended by someone they know (google's +1 or FB's "like","shared" features).

My concern is, this would leave out people who do not use social networking sites a lot. A medical professional might still use google on a regular basis but might not be active in Social network sites , or may not even have a google/gmail id (to get profile preferences).


Another concern is the misuse of these parameters. There are already known ways to boost a website's google rank (some good, some not so good). This was also discussed in class with the example of "miserable failure" keywords pointing to websites of George Bush. There is an interesting term for this concept - "Google Bomb" . There is a good chance that these new social n/w features might also be misused. Here is an article where a user was able to boost his website's ranking only based on "like" clicks, without any backlinks.

This is also mentioned in the previous post's article -
But if Google’s going to start using those +1 votes, the company is virtually inviting the world’s spammers and blackhat SEO magicians to flood its social networking system with fake profiles and fake votes — potentially ruining it and possibly making the problem of search spam even worse.


For these reasons, I think the research on improved IR methods/ranking algorithms should continue with minimal user data.


Google explores re-ranking search results using +1 button data

Hi Friends,
I came across an interesting article about Google using its social network to re-rank the results.

"Google's biggest weakness is the possibility that someone will figure out how to build a better search engine -- and there's many who bet the way to do that is to make search involve more of a human touch and less of a machine's".

The above line reminds me of the "Sweet Spot" concept taught in the class

http://www.cnn.com/2011/08/30/tech/web/google-ranking-plus-results/index.html?hpt=te_bn5 is the URL to the article.

Thanks
Abhishek

Tuesday, August 30, 2011

Mooers' Law


I came across an interesting law and thought of sharing it. 

Mooers’ Law:

"An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have it." -- Calvin N. Mooers 
                  and Wikipedia

Web Crawling

Questions:
What are some of the challenges with crawling the web?
How can a web developer make his/her page get better rankings?

Some challenges I found:
Due to the fact that bandwidth is finite and costly, it is in the best interest to most effectively crawl the most important content. With all sorts of server-side generated content being served by many websites, the problem with avoiding duplicate content is more difficult(eg many many GET parameter combinations that result in the same data). Also, spider traps can be a severe problem. Take for example, a web-calendar. If the spider was to follow the links on the calender, it would continue crawling to the next day/year etc...

Web page rankings:
There is an entire topic on SEO (search engine optimization) which in a sense is reverse engineering search engines to get better rankings, which is too large to cover here. The typical tips you hear about are avoiding duplicate content, getting more inbound links...etc

Inverted Index for storing bag of words



Creating a bag of words plays an important role. There are lot of documents in the web. The idea of creating a bag of words for each document is not so good, as it increases the time and space constraints. So many search engines makes use of Inverted Index data structure.
Above image makes you understand clearly about the creation of bag of words.

The purpose of an inverted index is to allow fast full text searches. at a cost of increased processing when a document is added to the database. It is the most popular data structure used in document systems, used on a large scale for example in search engines.

I think we will use this data structure in our search engine project.

Regards.
Rajasekhar.

Abe Lincoln as an awesome IR engine..

Here is my all-time favorite IR sketch.. written by Woody Allen, about a putative incident from Lincoln's
life. Read it with today's IR discussion in the back of your mind, and you will enjoy this already funny sketch
even more..


Feel free to blog-comment on any IR connections you made. 

Rao

=====
  "like Reduced Shakespeare Company: The better you know the original,
the funnier it gets"

--LA Times review of "Dave Barry Slept here, A
sort of history of United States"

Information collection


I was trying to understand few basic questions
a. Does search engine queries the entire Web whenever a new search is done or does it has a copy of all the websites from where it does the search.
b. How does the search engine knows when a new webpage, new document is added or existing documents are updated somehwere in a remote web server. How does it update itself.

I found this video helpful to answer the above questions http://www.youtube.com/watch?v=RLyKLo6StLg

Google for example, has a programs which they send it out periodically to web to find out new pages or documents. This in turn returns with the words used in that page. Google has a big data base where it maps the words to this webpage. So when  user does a search, google looks into this datbase and finds the webpage matching to the text eneterd by the user.

Regards
Bharath

regarding missing tweetnotes...

If you sent a tweetnote and didn't see it posted to the blog, don't re-send. There seems to a hiccup in having them posted--I will fix it (your mails themselves are safe--they just didn't get posted to the blog.

rao

[cse494] Readings for this week: Text retrieval

[This was sent originally last week--but before the class population stabilized. So apparently several people didn't get this mail. I am re-sending this. Fortunately, the readings will start becoming more useful only starting now.

Rao


---------- Forwarded message ----------
From: Subbarao Kambhampati <rao@asu.edu>
Date: Mon, Aug 22, 2011 at 6:55 AM
Subject: [cse494] Readings for this week: Text retrieval
To: Rao Kambhampati <rao@asu.edu>


We will start discussing Text retrieval this week. You can either start reading the first link in the readings for text retrieval
 (in that order)



Rao





shifted office hours today...2:30--3:30

Due to another conflict, I cannot hold the office hours from 1-2pm today. Instead, I will be available 2:30--3:30pm

You can also try skype if you don't want to trudge in the sun. My skype id is subbarao2z and I will keep it on 

rao

" The Filter Bubble"

Hi All,
Here's is an interesting Ted Video where the guy called "Eli Pariser" shares his views the danger of personalizing your search engine which might eventually narrow your world view.



---Abilash

Sunday, August 28, 2011

Homework 0: Linear algebra refresher

Folks:

 I posted a linear algebra refresher homework (you can reach it from the "home work" menu item of the web page; the 

You are highly encouraged to complete the homework as soon as possible since most of our lectures will assume you remember these things. 

regards
Rao

Saturday, August 27, 2011

Page Rank

Unable to get a PageRank for my sample webpage, I was trying to understand how exactly PageRank works and found an interesting article which explains how PageRank works.

Highlights:

* PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine.

* PageRank process has been patented which is assigned to Stanford University and not to Google !!!

* It is a numeric value that represents how important a page is on the web.

* PageRank of a webpage depends on the pageRank of the inbound link also.

For more details:
http://www.webworkshop.net/pagerank.html

Questions:

* Why doesnot a webpage get an immediate PageRank?

* If the PageRank of a webpage depends on the pageRank of the inbound link, then can I have a high PageRank for my sample webpage - by having a link for my webpage from my facebook profile. ( Facebook has pagerank - 10/10 ) ?


Friday, August 26, 2011

Retailing services

As professor had talked about Big Idea 2,

"How do you “incentivize” people into letting you steal their brain cycles? Pay them! (Amazon mturk.com ) Make it fun (ESP game)"

Pay them option looks good, but the issue was checking the integrity of the results produced or output given. So I questioned how about a retailing service kind of system, where you go exactly to the place where you can get answers to your questions and the website takes up your questions and passes them to expertise panel, who would answer the question. In this way user who is also a consumer pays certain amount X, part of it goes to the website who handles the request and part of it goes to expertise panel.

One of the websites I saw recently was "www.awaaz.de", over here farmers can ask some specific questions pertaining to fertilizers to use, crop cultivation techniques etc. Mainly focuses on Indian farmers (most of them are not computer literate but they own mobile phones, and can call upto any number), as they are less equipped (sometimes financially and sometimes knowledge pertaining to farming) then international counterparts.

I was astonished to find that farmers at my native place didn't had idea about Jetropha, used for making bio-diesel. Now, if you read more about Jetropha you would come to know its "raining money" plant. So, a system which gives very specialized answers to queries of very specific community is a nice idea. And yes all the answers to all the questions are not present over internet (by my own experience), but surely present with people or community involved.

Insight into Google`s Search Algorithm

Hello All,

The mystery around google search:

1)500 improvements are made to search algorithm every year to get the best results.

2) SandBox--Conduct live experiments on real users; a very small fraction of the actual google traffic is sent through a sandbox to calculate the metrics.

3) Spelling suggestions(Full Page Replacement)--The user can use the escape hatch when the search suggestions went wrong; Escape hatch would pass the test only if the user clicked it 1 in every 50 times.

Every commentator in the video(refer to the link below) at the end emphasizes that they do these changes keeping in mind the user(makes me wonder is this the reason why our searches are not private anymore and so called personalized searches)

Thursday, August 25, 2011

Re: Experimental collective "tweet notes" for the class..

Here is another ground rule
 
 Sign your note (you can use a short name).

This is because the blog won't keep the email header information. 

rao




On Thu, Aug 25, 2011 at 8:10 PM, Subbarao Kambhampati <rao@asu.edu> wrote:
Folks:

 Given the twists and turns the lectures, where we wind up discussing much more than what is explicitly on the slides, 
 I often wonder what points "stick" and which get blurry or washed out. 

I was thinking that it would be cool for people to be able to "tweet" short points they grasped so that the collection of
tweets can be viewed as a representation of what got through. 

For various technical reasons (including the fact that twitter doesn't quite allow us to have persistent searchable topics for a semester long
period), we can't quite use twitter service. 

So, I set up an alternative service. If you send a short mail to the address "494tweetnotes" at gmail, then whatever you send will be archived on the blog I set up expressly for this purpose:  http://tweetnotes-cse494-f11.blogspot.com/ 
 (this is different from the class blog--it is mostly meant to be read only. )  

Here are the ground rules:

Subject line should be the date of the class (e.g 8/25/2011) and the body should be a short point (in the spirit of 
short notes, I suggest that you keep each mail for just one point. You can send multiple mails). 

The best time to send these is close to the class time, when what you heard is still fresh. 

You will see a seed tweetnote I posted.

Note that this will not be curated--i.e. if the point you got is wrong, then I am not going to correct it. 

Also, the entries will only be posted on the blog and won't be sent to you by email (if you want you can "subscribe" to 
the feed or some such). 

Let's give it a shot and see if we can make collective notes. 

yours experimentally
Rao




Experimental collective "tweet notes" for the class..

Folks:

 Given the twists and turns the lectures, where we wind up discussing much more than what is explicitly on the slides, 
 I often wonder what points "stick" and which get blurry or washed out. 

I was thinking that it would be cool for people to be able to "tweet" short points they grasped so that the collection of
tweets can be viewed as a representation of what got through. 

For various technical reasons (including the fact that twitter doesn't quite allow us to have persistent searchable topics for a semester long
period), we can't quite use twitter service. 

So, I set up an alternative service. If you send a short mail to the address "494tweetnotes" at gmail, then whatever you send will be archived on the blog I set up expressly for this purpose:  http://tweetnotes-cse494-f11.blogspot.com/ 
 (this is different from the class blog--it is mostly meant to be read only. )  

Here are the ground rules:

Subject line should be the date of the class (e.g 8/25/2011) and the body should be a short point (in the spirit of 
short notes, I suggest that you keep each mail for just one point. You can send multiple mails). 

The best time to send these is close to the class time, when what you heard is still fresh. 

You will see a seed tweetnote I posted.

Note that this will not be curated--i.e. if the point you got is wrong, then I am not going to correct it. 

Also, the entries will only be posted on the blog and won't be sent to you by email (if you want you can "subscribe" to 
the feed or some such). 

Let's give it a shot and see if we can make collective notes. 

yours experimentally
Rao



Cuil

Hi,

During the first IR class Prof. Rao spoke about what we want the web to do. One of the points discussed was Privacy and there was a huge discussion over the importance of Privacy.
I was going through a random list of search engines that currently exist and it was surprising to know that my usage of search engine is limited to Google, Bing and Altavista :( Going through the list I found an interesting article about the long dead search engine Cuil.
Couple of features about the search engine that caught my attention were:
1. Creators were ex-Googlers, IBM, Altavista and was termed as "Google-Killer" ( fancy name!)
2. User Privacy is one of the arenas where Cuil excelled.
I quote directly from one of the articles:
“The one area where Cuil excels is user privacy. Whereas Google stores user-specific searches for up to 18 months, Cuil never stores personally identifiable information or search histories.”
I wonder how genuine this feature is and if true why is Google not adopting it? ( Just a cosmic question.)
3. Interestingly Cuil's demise was one of the fastest amongst startup firms. Why did it happen? Due to the excessive load of users the servers crashed and within hours the company shut down.

I found this interesting and thought of sharing. Here's the link to the article link if interested else ignore :). 
 http://www.time.com/time/business/article/0,8599,1827331,00.html


Class audio/video available.. (+ stan

As some of you might have noticed, the audio and video recordings of the classes are available along with the powerpoint slides
at the class home page (http://rakaposhi.eas.asu.edu/cse494

The audio is in .WAV form and is typically below 20mb. The video is in .mp4 form and is about 5gig 

Typically, these are posted pretty much right after the class. 

If you miss a class (which, you should not, to the extent possible) and/or want to review what happened, you are encouraged to use these. [There are no immediate plans to make the 
videos smaller or streaming--think of the download time as the price to pay if you miss the class.]

(For the video recording, I am still looking for a more active volunteer--  who, for example, is willing to pan the camera to the board if
I move to the board for extended periods. )

regards
Rao



Welcome to the class mailing list



Dear all:

  You are getting this as you are (*still*!) registered for CSE494/598. 

This mailing list will be used for class-related announcements. Everything sent to this
list will also be archived at the mailing list archive (http://rakaposhi.eas.asu.edu/f11-cse494-mailarchive/threads.html ) and
posted to class blog (http://cse494-f11.blogspot.com/ ). Both of them are conveniently accessible from the class home page.

Speaking about the "class blog", you should have received invitations to join the blog (Yay to Nathan Briscoe, who is the 
first to accept the invitation... the party can start now).   Once you register, you can post to the blog anything you deem relevant/related to
the class--including questions about the class that you want either me, the TAs or the fellow students to weigh-in on. 

I will occasionally post to the blog inviting responses (which you have to post as comments to those articles--see 
http://cse494-s10.blogspot.com for the blog from the last offering). 

see you in the evening

regards
Rao
------
Subbarao Kambhampati