Thursday, September 29, 2011

Wednesday, September 28, 2011

Responses to a few common concerns from Class survey

Here are some responses to a few common concerns from the class survey:

1. videos: I understand that videos can be compressed and even be
streamed. However, since I am doing the videos on my time, the only
way I can afford the time is if they cost me very little time-wise. 
As of now, when I go back from the class, I just plug in the flip
recorder, rename the files, and copy them over to the homepage (which
takes some 15min). Anything more involved is likely to just not get
done.  The college does do professional videotaping--but that is
limited only to online classes. Although online classes are more
remunerative, I prefer to do classes in a classroom setting. All of
this is a long-winded way of saying "don't look the gift horse in the
mouth" ;-)

Also on videos, the best way to view them would be to download them
onto your laptop while you are at school (or on some highspeed
network) and watch them later (vlcviewer is a very good platform
independent viewer)

2. accent/talking speed etc: I am sympathetic to these issues, but I
don't see much point in me telling you that I can change this
drastically. At some point, it becomes hard to teach new tricks to old
dogs ;-). You will have to make-up using the audio/video (or, since
you are still young dogs, teach yourself the trick of following fast
speakers ;-)

3. Tweetnotes: The intent of tweetnotes is not so much that they are
curated summaries of the classes, but that they are uncurated
impressions of each class. They are most useful for me in terms of
figuring out what points got through to students. 


Tuesday, September 27, 2011

Class survey results

Thanks to all of you for taking the time to fill the survey!

The survey results are available at

These are basically untouched (other than putting in html format). 

The results are of most relevance to me (and I have read them and noted your concerns). I am sharing them with you mostly so you know whether your particular take on the class falls anywhere close to the median view. Since I clearly can't run 53 different classes for the 53 students, I will have to look at the majority/median concerns. 


added a page-rank/authorities-hubs problem to the homework. It will be due next Tuesday at the beginning of class


Office hours shift for today: 2:30--3:30pm

I will hold office hours from 2:30 to 3:30 today


Monday, September 26, 2011

Sunday, September 25, 2011

Reminder: CSE 494/598 Class Survey: Please participate..(by next class: 9/27)

Please ignore if you are one of the 13 that have already participated..


---------- Forwarded message ----------
From: Subbarao Kambhampati <>
Date: Thu, Sep 22, 2011 at 8:55 PM
Subject: CSE 494/598 Class Survey: Please participate..(by next class: 9/27)
To: Rao Kambhampati <>


 Here is a brief class survey.

This is your chance to provide anonymous feedback on how the class is going for you. 

The poll will be closed by 9/27; I will compile the results and mail them to you, and will discuss/course-correct
any issues that are brought up. 

Please participate (but only once per person ;-)


Friday, September 23, 2011

Markov Chains for next week...

In preparation for the discussion of link analysis next week, it would be very useful to refresh your memory about "Markov Chains".

The wikipedia page  is a pretty good start.. Read at least to the end of section 3.


Thursday, September 22, 2011

Santorum: the latest of the google bombs.... ;-)

Check this article (from yesterday!)

Santorum asks Google to clean up search results for his name

Rao (kicking himself for not finding this before today's class..)

CSE 494/598 Class Survey: Please participate..(by next class: 9/27)


 Here is a brief class survey.

This is your chance to provide anonymous feedback on how the class is going for you. 

The poll will be closed by 9/27; I will compile the results and mail them to you, and will discuss/course-correct
any issues that are brought up. 

Please participate (but only once per person ;-)


Re: CSE 494/598 Question

[I am answering this on the blog rather than in class because I want to use the class for more 
immediate issues in understanding LSI--which, for example, prevent you from doing the homework problems.
These questions are considering the higher order consequences of LSI.]

On Thu, Sep 22, 2011 at 6:25 AM, Shubhendra Singh <> wrote:
I have two questions

1. How does LSI tackle errors in text like spelling errors etc., or are they ignored?

LSI doesn't directly correct spelling, however, its analysis might be able to tell it that "computer" and "copmuter" are functional synonyms (like "computer" and "science") because they seem to occur in the same context. 

2. How the scenario changes when new phrases are added for example, lets say "cloud computing" is the new buzz word. I think few years back this word might not have been tokened but there must be several pages on "cloud" and "computing" separately. So do the things change in the same way as you showed in class for two documents "D1" and "D2" which were added in the with 50 occurrences of Database and Sql respectively?

If the semantics of the usage have changed over time, then unless you do LSI on a corpus reflecting the new usage, you won't capture the new semantics. (Consider, for example, running the query "computer" on a corpus of documents that were pre-1900--you will get a lot of information about human computers--since "computer" meant the human who took up a job to calculate/compute[

If "cloud computing" was not used in the original document corpus  with the modern meaning, and you are not recomputing the SVD (i.e., you are getting by with q*TF approach to convert the query), then LSI can't capture the new meaning of cloud computing. 


thank you,

Tuesday, September 20, 2011

Homework questions on LSI (and association clusters) posted

I posted the homework questions on LSI and association clusters (as Homework 2 at )

One way you can figure out whether you understand LSI or not is to try and to the middle two LSI problems (and if you run into difficulties, you can
then ask me questions for next class).


Online Gaming, Science and Computation


*optional* pre-class help/discussion sections..

I will try to get to the class room closer to  4:15pm and will be happy to discuss/answer any questions related to the class
until 4:30 (when the real class starts).  This is entirely optional. Nothing *required* will be done until 4:30.


Saturday, September 17, 2011

[CSE 494] Term appears twice in the index

  A number of you have emailed me asking why sometimes there are two occurrences of a term in the index.
  In addition to indexing the contents of a document, lucene also indexes metadata like the title, url etc. When terms appear in these metadata, they are indexed separately, hence some terms occur twice. I suggest you either pick the first occurrence of the term only, or check the term.field( ) property (and only consider those terms that return "contents").
Thanks and Regards,
Sushovan De

clarification re: tweetnotes...


The tweetnotes are meant to be short points you got from the class lectures, written in your own words. 

Please avoid copy-pasting possibly related stuff from the web..


ps: Another caveat is that given the "crowd-sourced" nature of these notes, their accuracy is not guaranteed.
I do look at what gets posted, and press +1 on notes that I think capture some nontrivial details well.  I can only attest
to the accuracy of these notes. There do exist other notes that I know are wrong--but there is no -1 button unfortunately..

GOOGLE concatenates multi-page websites into one single view all page, to avoid the annoying paging mechanism

"Google Will Provide Single-Page Versions of Content in its Search Results: Instead of returning paginated content that makes you click next several times on a page, Google will point searchers to a single-page version of the content if one is available."



Friday, September 16, 2011


His is a link to the radio show NPR Science Friday that interviews author Steven Levy from Wired Magazine.

It gives a brief overview of google history. Very casual, non technical. Get's better later inutes in. Discusses the Filter Bubble Problem (20min) and the legal aspects of search results/ranking being Opinion only. Just Disregard ignorant/ paranoid callers. After the analysis of search & ads, the comments at the end are directed toward getting HIRED by Google.


The Plex: How Google Thinks, Works, and Shapes Our Lives by

Steven Levy

Google's motto "Don't be Evil"

**PLEASE Disregard** Fwd: [cse494-f11] Time management lecture link...

This mail was supposed to be sent to my ASU 101 class and I seem to have sent to the 494 instead.

You can disregard the mail and the deadline.


---------- Forwarded message ----------
From: Subbarao Kambhampati <>
Date: Fri, Sep 16, 2011 at 10:01 AM
Subject: [cse494-f11] Time management lecture link...

Here is the link to the Randy Pausch time-management talk:

Please watch it, and write a short summary/response on the blog as a comment to this post (deadline is 9/30). 


Posted By Subbarao Kambhampati to cse494-f11 at 9/16/2011 10:01:00 AM

Time management lecture link...

Here is the link to the Randy Pausch time-management talk:

Please watch it, and write a short summary/response on the blog as a comment to this post (deadline is 9/30). 


Thursday, September 15, 2011

Clarification: Homework 1 - raw term frequencies

A number of you have emailed asking for clarifications regarding what "raw term frequency" means in the context of Homework 1, Question 2.

As mentioned in the question, by "raw term frequency", it means that you must not normalize the TF. So, the term frequency of d1 for the term t3 is 9, and not 9/|d1|.

When computing the tf/idf, you must multiply this raw term frequency by the idf. If you do so, you will get the answers as mentioned in the slides.
Thanks and Regards,
Sushovan De

(purely optional) proof of the relation between eigen decomposition and low-rank approximation

This is from the readings:

On the issue of holding precision constant over time--a slight modification...

So while Jadiel is right that in general precision can't be held constant w.r.t time, there are two important exceptions--namely precision at 1 and precision at 0. 

If you only give the relevant answers and nothing but the relevant answers, then your precision remains at 1 at successive periods. Similarly, if you only give irrelevant answers and nothing but irrelevant answers, your precision remains at 0 at successive periods.


You can skip relevance feedback question. e: CSE 494 H/W

Since we havent formally covered relevance feedback, you can skip it.  


On Thursday, September 15, 2011, Mary Forsberg <> wrote:
> Hello Dr. Rao,
> In calculating the relevance feedback using Rocchio
> method for part 4 of problem 2, do we consider
> the document collection to be the same
> original 10?
>   Or is the collection of relevant documents {d4,d6}
> irrelevant documents {d4}
> Additionally, is a negative value in the Query vector
> set to zero in this feedback model?
> Also, do we need to use the alpha, beta, gamma values in the
> Rocchio equation.
> Since the table of tf/idf vectors was incomplete
> I wondered if I misunderstood the problem
> parameters.
> Thanks for any clarification,
> Mary

Clarification: Homework 1 - raw term frequencies

A number of you have emailed asking for clarifications regarding what "raw term frequency" means in the context of Homework 1, Question 2.

As mentioned in the question, by "raw term frequency", it means that you must not normalize the TF. So, the term frequency of d1 for the term t3 is 9, and not 9/|d1|.

When computing the tf/idf, you must multiply this raw term frequency by the idf. If you do so, you will get the answers as mentioned in the slides.

Tuesday, September 13, 2011

[Thinking Cap] Directed review of linear algebra

Your ability to appreciate the finer points of the following lectures depends on your background in linear algebra. To that extent, here is a directed review of linear algebra.

0. Convince yourself that matrix multiplication is essentially a bunch of dot products 
between row vectors of one matrix and the column vectors of the other matrix. This should also give another reason as to why the inner dimensions of the matrices must match before you can multiply them (since you can't define the dot product between two vectors of differing dimensions)

1. Remember the notion of linear dependence. A vector is considered linearly dependent on another set of vectors if it can be written as the linear combination of that set of vectors. Specifically as c1v1+c2v2+..ckvk where vi are the vectors and ci are scalar constants. Show to yourself that the vector <6, 6> is linearly dependent on the vectors <2, 1> and <1, 2>

2. Remember the notion of a space spanned by a set of vectors. Given a set S of vectors, the space spanned by S is the set of all vectors which can be written as the linear combination of the vectors in S. 

3. Remember the notion of linear independence. A set of vectors is linearly independent if none of the vectors in that set can be written as the linear combination of the rest of the vectors. 

4.  Remember  the notion if "basis" for a space. A set is considered a basis for a space if (a) the set is linearly independent and (b) every vector in that space can be written as a linear combination of the basis vectors. The "dimensionality" of a space is the size of its basis set. 

5. Remember the notion of "orthogonal basis" for a space--which is a set of vectors that  is a basis *and* the vectors are orthogonal to each other (i.e., their pairwise dot product vi*vj is 0 if i != j

6. Remember the notion of "orthonormal basis" for a space--which is a set of vectors that forms an orthogonal basis *and* the vectors are all unit vectors.

7. Remember that the row-rank of a matrix is the size of the basis set of the space spanned by the row vectors of the matrix. The column-rank of a matrix is the size of the basis of the space spanned by the column vectors. The Row and Column ranks of a matrix are always equal and this number is the rank of the matrix. The matrix is considered "full rank" if its rank is equal to its number of rows and number of columns (so full rank matrix has to be a square matrix)

Now use this knowledge to answer the following to yourself (feel free to enter answers on the blog). 

Q1. For three dimensional euclidean space, is the set [ <1, 0, 0>, <0, 1, 0>] linearly independent? Is it a basis? 

Q2. For 2-dimensional euclidean space, is the set [<1, 1>, <1, 2>]  linearly independent? Is it a basis? Is it an orthogonal basis? Is it an orthonormal basis? 

Q3. For two dimensional euclidean space, how many different basis sets can you get?  How many of them are orthogonal bases? How many of them are ortho-normal bases? 

Q4: If a matrix is full rank, then are its column vectors linearly independent? how about is row vectors? Are they also orthogonal?  

That is it for now..


relation between the "rank" of a matrix and the maximum number of linearly independent  rows (or columns). Note that a set S of vectors is considered linearly independent if no member of that set can be written as a linear combination of the 

Project help office hours by Sushovan Mon/Thu: 3-4pm

Sushovan will hold office hours Monday and Thursday 3-4pm until further notice.
He sits in the lab right across from my office.


1 trillion pages on the web

Read an interesting article today about the estimated number of pages on the web (rough estimate - 1 trillion)

A group called the World Wide Web foundation (Founded by Tim Berners-Lee - inventor of WWW and funded by google) is going to determine the exact number of pages on the web.


A History of Search Engines

The above link talks about the brief history search engines.


Google Slips

I came across an interesting news today about the search engine giant Google. Google search has decreased and Microsoft Bing is gaining momentum. You can read more from here


Change in office hours today: 2:30--3:30

Please note the change just for today


Sunday, September 11, 2011

Re: [CSE 494 Information Retrieval] Clarification regarding synopsis in search results

If you are returning a document d as a result to the user query, the snippet is supposed to convince the user 
that d is indeed relevant to the query. To do this, it has to (a) show where the text of the query occurs in the document
and (b) what are the distinguishing features of this document.  "b" can be satisfied by listing terms in the document with high idf values
(as these are the parts of the document that are its most distinguishing characteristics) 


On Sun, Sep 11, 2011 at 6:07 AM, Ganesh Krishnamurti <> wrote:
Dear Prof Rao,

You mentioned on Sep 6th lecture that IDF is also used in displaying the synopsis with the search results. Unfortunately I am not getting this point. It would be nice if you can explain this to me more elaborately.


Saturday, September 10, 2011

Importance and Beauty of Structuring data

Hii All,
Here's is an interesting ted video where the speaker beautifully epitomizes the importance and advantages of structuring data, and how a good design can unveil unseen patterns and connections.

Friday, September 9, 2011

Google Servers?

Websites all around the world maintains many servers to handle traffic. Heavy traffic may lead to website crashes. So i would like to give a small idea about the number of servers the worlds best search engine maintains.
Google is a very large service that requires a huge amount of server power to run and serve queries quickly. As well as running the search results, Google also runs a number of services such as Adsense, Google Docs, Google News, Trends, Gmail etc.... The image below was designed to give you some impression of how many servers Google actually has.


for clear image please check this image.

Saturday November 5th Desert Code Camp

Since most of us are in ASU's CS programs, I'm sure you're all focused on *nix systems.

If your interested in some newer programming topics, Chandler Gilber Community College is hosting a second Desert Code Camp on Saturday the 5th of November. It's an all day, multi-track event with FREE breakfast and lunch.

The first Camp last April was very informative. Select just the topics that interest you.

Presently I show talks on OpenGL graphics, jquery for webforms, and intro to Objective C into for anyone wanting to develop on mac/iphone/ipad, plus many others. The last camp had 30 or 40 different lectures.
So if you want free food. I know I do! Nov 5th is the day.

PS there was a bunch of local tech companies hanging around--a great opportunity for professional networking!

Reminder about the distinction between the class blog and the tweetnotes...


 I notice that a couple of you posted stuff to tweetnotes blog that is not a summary of what was said in the
class, but rather other interesting things you came across and wanted to share.

Please note that the place for general (and longer) postings is the class blog. Tweetnotes should be restricted to
short summary points you got from a particular class (and should be signed with your name).


Homework 1 assigned; due in class on 9/20


 Homework 1 is ready; you can reach it from the homepage, or use the following direct link:

The homework is due, in hard copy, on 9/20 in class


Wednesday, September 7, 2011

Project part 1 released. Due 9/27--but you should try to get done by 22nd...

Project 1part 1 is accessible at

Please get started early. 

We keep  this part simple--but since the other two build on top of this part, if you don't have this one, you will be held back
on those parts. 


Something to think about: Similarity/Distance Metrics

We saw several similarity metrics for documents these last two classes. A question is will any old
function be considered a similarity function or whether there are some minimum disiderata. 

You should keep in mind that similarity is really a measure of "distance"--the more distant two
objects are from each other, the less similar they are.

This brings us to desirable properties of distance (and thus similarity) metrics.

The typical required properties of a distance metric are given below (from wikipedia entry).

You want to check to see if each of the metrics we have seen until now--jaccard, euclidean, cosine-theta--satisfy all these properties. 

In the literature, people do consider "distance" functions that do not satisfy all these properties. Typically, the first one to be sacrificed is 
triangle inequality property.  Sometimes even the symmetry property is given up. An example of a distance metric that is asymmetric is 
the distance metric between two probability distributions--called KL-divergence (see )

metric on a set X is a function (called the distance function or simply distance)

d : X × X → R

(where R is the set of real numbers). For all xyz in X, this function is required to satisfy the following conditions:

  1. d(xy) ≥ 0     (non-negativity)
  2. d(xy) = 0   if and only if   x = y     (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
  3. d(xy) = d(yx)     (symmetry)
  4. d(xz) ≤ d(xy) + d(yz)     (subadditivity / triangle inequality).


Tuesday, September 6, 2011

Readings for next class

We will do indexing/retrieval issues next class. The relevant readings are under "Distributed Indexing" topic of the readings--reproduced below
(note that the indented one is optional and is a brand-new paper from google folks on the indexing issues faced by search engines)


Thursday, September 1, 2011

Gentle reminders regarding attendance and attentiveness

This is just a gentle reminder, for those of you who need it (see below), that attendance and attentiveness are not optional but mandatory for this class.

This particular reminder is triggered by (1) the sighting of a couple of people staring into their laptops and/or cell phones all through the class (you know who you are; and thanks to the roster mug shots, so do I ;-)  and (2) the apparent absence of  about 10 people who are currently shown to be registered and did not  sign the attendance sheet today. 

Regarding 1, while I try my best not to single out people during the class to avoid embarrassment and  disruption to the class flow, it does distract me. Regarding 2, remember that you are supposed to let me know *beforehand* if you must miss the class for unavoidable reasons.

thanks for your cooperation


Same query Same results

I was trying to google something and was not able to find relevant results. Surprisingly I tried the same query again after a couple of days and got the same set of results. I was logged-in in gmail both the times. I was wondering why google search is not learning that I don't want these results; I want something else, something different.

When in last IR class (8/30/2011) we discussed the factors contributing to the relevance number R(.) of a document, I found that it very well depends upon the already retrieved documents. I questioned, then why google or any other search engine has not yet implemented it.

While I was googling about the same, I learned that it is difficult to completely remove the documents which are already shown, (and not judged by the user,) when the user re-enters the same query. This is because those documents are actually relevant and they are ranked based on their relevance to the query.
As we very well know that user is not sure what he is looking for, he is not inclined to read certain results. That doesn't imply that next same search should eliminate those documents. The maverick user might find the same set of results useful some other day (in some other light :-) ).

On the other hand, user might be aware of certain results and sure that they are definitely not relevant to his needs and hence never bothered to visit them. In such cases, he might find it frustrating to get the same results.

Hence, how to find "sweet spot" in such cases? If time permits, we can discuss this in tomorrow's class.