Thursday, October 27, 2011

Homework on clustering/classification/recommendation systems assigned

It will be due on nov 10th (2 weeks from today).

Rao

No office hours today

Sorry for the late notice but I am at lunch with the visitor and won't be available for office hours

Rao

midterm marks

Folks

 Here are the midterm marks listed according to posting id. I will return the graded exams in the class today.

Rao

Graded out of 65 points
Average for 494 students:  44
Standard dev:   11
Max: 57.5 Min: 27

Average for 598 students   46
Standard deviation 10.22
Max 63; Min 14


Undergrad
Midterm (65)
4257-016
51.5 0076-297
48.5 2980-612
7246-789
49 4620-483
40.5 7825-853
27 2875-514
31 7467-586
57.5 6692-410
54 1598-032
8943-657
7581-369
7137-693
GRAD
47.5 4171-660
33 0833-776
32.5 9676-570
46.5 3681-882
51.5 2635-502
50 2340-625
30.5 0196-819
61.5 8103-075
35 3613-980
47.5 9374-305
60 5365-064
53.5 5355-367
41 0233-710
46 5552-935
22.5 6370-935
37 8290-753
63 9661-342
45 2987-507
50 3633-513
56 6311-340
42.5 8207-143
52.5 1942-482
53 1864-476
40 2855-634
51.5 5497-814
52 9400-307
54 7074-143
44 7106-112
56 3236-450
58.5 8487-449
45 2671-439
14 6013-730
52.5 0062-742
34.5 6622-345
46 5146-793
59 9522-612
51 5068-787
35.5 4967-738
51.5 1847-585
45.5 6146-759
45 5929-040
45 4074-088
43.5 3661-462

Wednesday, October 26, 2011

[Reminder: Talk relevant to the last 1/3rd of CSE 494] Invited Seminar: From Data to Decisions (Ullas Nambiar, IBM IRL; 10/27 11AM; BY 420)



---------- Forwarded message ----------
From: Subbarao Kambhampati <rao@asu.edu>
Date: Tue, Oct 25, 2011 at 6:36 AM
Subject: Invited Seminar: From Data to Decisions (Ullas Nambiar, IBM IRL; 10/27 11AM; BY 420)
To: Rao Kambhampati <rao@asu.edu>


From Data to Decisions 

  Folks:
     This talk is relevant to the topics we will cover in the last 1/3rd of CSE494 course.  --Rao


                                                                       Dr. Ullas Nambiar
                                                                   IBM India Research Labs

 11AM. Thu, October 27, 2011.  BY 420  



Abstract: 

  Creating a single view of entities of interest  is increasingly becoming a requirement  for effective insight generation about   product/service quality and customer expectations. The explosive growth of data  both within an enterprise and outside (Web & Social Media) has made the task of creating a unified view challenging and interesting.  In this talk,  I will  show how Big Data impacts Single View Creation,  present use cases that show that the problem is not just that of business but even present in areas of common interest like better traffic management, public safety etc.  I will then give a deep dive on a solution that augments enterprise data by extracting relevant information from Web sources. 


Bio: 
Dr. Ullas Nambiar is a researcher in the Information Management &  Analytics group at IBM Research India.  He received a PhD in Computer Science from 2005  from ASU.  His research focusses on data integration, analytics and retrieval  over distributed heterogeneous sources. He is a Senior Member of ACM.  More details are available at  
www.research.ibm.com/people/u/ubnambiar


Tuesday, October 25, 2011

Invited Seminar: From Data to Decisions (Ullas Nambiar, IBM IRL; 10/27 11AM; BY 420)

From Data to Decisions 

                                                     Dr. Ullas Nambiar, IBM India Research Labs

 11AM. Thu, October 27, 2011.  BY 420  



Abstract: 

  Creating a single view of entities of interest  is increasingly becoming a requirement  for effective insight generation about   product/service quality and customer expectations. The explosive growth of data  both within an enterprise and outside (Web & Social Media) has made the task of creating a unified view challenging and interesting.  In this talk,  I will  show how Big Data impacts Single View Creation,  present use cases that show that the problem is not just that of business but even present in areas of common interest like better traffic management, public safety etc.  I will then give a deep dive on a solution that augments enterprise data by extracting relevant information from Web sources. 


Bio: 
Dr. Ullas Nambiar is a researcher in the Information Management &  Analytics group at IBM Research India.  He received a PhD in Computer Science from 2005  from ASU.  His research focusses on data integration, analytics and retrieval  over distributed heterogeneous sources. He is a Senior Member of ACM.  More details are available at  
www.research.ibm.com/people/u/ubnambiar

Saturday, October 22, 2011

[CSE 494] Clarification about project - A/H

A number of you seem to have some questions about the authorities and hubs computation in Project part 2. In short this is what you have to do:
 
Step 1: Find the top-k results from TF/IDF. Call that your "root set".
At this stage you have 10 documents.
 
Step 2: Find all the documents that the root set points to and is pointed by. Call that your "base set".
At this stage you will have, say, 80 documents.
 
Step 3: Create the adjacency matrix for these 80 documents.
The size of the adjacency matrix would be 80 x 80. You will need to make more calls to LinkAnalysis to populate this matrix.
 
Step 4: Create an initial authorities vector and an initial hubs vector. Then use the techniques from the slides to iteratively compute the next authorities and hubs values. Remember to normalize after every iteration. Repeat this until it converges.
You can test convergence by checking whether the sum of the squares of the differences between the current values and the previous values is less than some threshold you choose.
 
Step 5: Print out the top-N authorities and top-N hubs.
 
I hope this clears some of the confusion.
 
Thanks and Regards,
Sushovan De

Monday, October 17, 2011

*Please* do not spend time on the hidden slides..

I just had a couple of students who were confused because they tried to make sense of hidden slides in the ppt lecture notes.

As I mentioned multiple times, you should focus only on the non-hidden slides. Like junk DNA, the hidden slides may not always have
direct connection to what we discussed this time in the class (and in some cases may even be erroneous). 

If you are viewing slides on the computer, use the slideshow mode--which automatically skips over hidden slides. If you are printing, remember to
click-off the "hidden slides" radio button. 

regards
Rao

ps: on a different topic--you might want to note that the page-rank vector doesn't have to be normalized if you start with [1/n ... 1/n]' as the initial vector.
Multiplying a probability distribution with M*--a stochastic matrix--will give rise to another probability distribution (i..e, the entries will all be positive and will
add up to 1)

Graded homework 2 is available at Sushovan's desk

Homework 2 is now graded. In case you want to look at your graded work before the exam, you can pick it up from Sushovan's desk (he sits 
in the lab across from my office).

Otherwise, I will bring them to class either tomorrow or Thu.

Rao

Homework 2 - Solution for Question 3

The solution for HW2 – Q3 was missing from the solutions. Here it is:

Answer 1

td =

0.2500 0.5300 0.7500
0.7300 0.5000 0.2300

 To find the first term of the svd, we find the eigen vectors of td * td'
td * td' =

0.9059 0.6200
0.6200 0.8358

 <Eigen vector calculation here>

 Eigen vectors of td * td'
-0.7268

-0.6869

And

0.6869
-0.7268

The eigen values are 1.4918 and 0.2499

[Note: these might be negative also, that is a valid solution too]

Therefore, the first term of the svd is

-0.7268 0.6869
-0.6869 -0.7268

The second term of the svd is the square root of the eigen values found above

1.2214 0 0
0 0.4999 0

To find the third term of the svd, we find the eigen vectors of td' * td

Td' * td =
0.5954 0.4975 0.3554

0.4975 0.5309 0.5125
0.3554 0.5125 0.6154

<eigen computation>

Eigen vectors are
First eigen vector:
0.5593

0.5965
0.5756

Second eigen vector:
-0.7179

0.0013
0.6962

Third eigen vector:
-0.4146

0.8026
-0.4290

Therefore the third and last term in the svd is
-0.5593 0.7179 -0.4146

-0.5965 -0.0013 0.8026
-0.5756 -0.6962 -0.4290

 <But we did not find the right signs for the matrix here, so we need to find this matrix in a different way, (see recent mail sent by Dr Rao)>

 Answer 2

After removing the lesser of the two eigen values, your s matrix becomes:

1.2214 0 0
0      0 0

 

Then to recreate the matrix, you multiply u * s * v'

0.4965 0.5296 0.5110
0.4692 0.5005 0.4829

Does it look close to the original? In my opinion, NO, it does not. We did eliminate a significant part of the variance.

(Also accept a YES answer if it claims that if you scale it properly, you would end up sort of where the original documents were.)

Answer 3

The query vector is q = [1, 0]

In the factor space, it would be q * tf

qf = -0.7268 -0.6869

The documents in the factor space are:

D1: -0.6831 0.3589
D2: -0.7286 -0.0006
D3: -0.7031 -0.3481

The similarities are:
sim(q,D1) = 0.3239

sim(q,D2) = 0.7274
sim(q,D3) = 0.9561

Answer 4

Before the transformation:

 

After the transformation

 

Clearly, after the transformation, the documents are easily separable using only one axis (the y-axis). So we can get rid of the x-axis and still differentiate between them

Answer 5

The new values added is a linear combination of the previous keywords, (0.5 * k1 + 0.5 * k2).

It will have absolutely no impact upon the calculations at all

It tells us that svd manages to find the real dimensionality of the data

 

Thanks and Regards,
Sushovan De

additional office hours for mid-term: 4-5:30pm today (Monday)

I will be available for questions between 4 to 5:30pm today.

Also, tomorrow, the office hours will be from 2:30 to 3:30.

rao

Thursday, October 13, 2011

Precision-Recall question in Homework 1

In the Precision-Recall question of homework 1, the expected answer is this:

 

 

In the slides, the precision recall curve was plotted at specific recall values (0, 0.1, 0.2, etc.). Some of you misinterpreted that to mean “You should plot the precision-recall curve at only those points at which a relevant document is retrieved.” That is not true in general.

 

Thanks and Regards,

Sushovan De

 

page synopses ==> page snippets..

Apparently some of you are confused about what the "page synopsis" means--I meant the short description that accompanies each result. 
They are also called snippets.

In the following, the synopses/snippets are the text in black


  1. Writing a Synopsis

    The synopsis. I've placed some great synopsis how-to links and books below. Don't miss the sample synopsis page link. Some Books to Help. Using the links ...
  2. Tips About How To Write a Synopsis from Fiction Writer's Connection -

    Once you already have an agent and you are discussing future projects, you can present your ideas in this one-page synopsis format for your agent to look at ...
  3. How to Write a One-Page Synopsis or Treatment | eHow.com

    Feb 16, 2009 – How to Write a One-Page Synopsis or Treatment. If you are trying to sell a screenplay, you will often be asked to submit a "one-sheet" or one ...

Wednesday, October 12, 2011

On getting the eigen vector signs to work out while computing SVD by hand

Some of you have asked me about how to get the signs of the eigen vectors of tf and df to correlate correctly so that when you multiply df*ff*tf' you get the original matrix back. One idea is to compute one of the matrices (either tf or df) by explicitly computing the eigen vectors and compute the other using the SVD equation.

Specifically, suppose you computed tf--picking whatever signs for the eigen vectors.

Then, rather than compute df by looking at the eigen vectors of dd matrix, you can compute df directly
as

df = dt*tf*ff-inverse  (where ff-inverse is the inverse of a diagonal matrix and so it will just be another diagnonal matrix with the entries being reciprocals of the ff)

This way, you are guaranteed to have dt be reconstructible by SVD equation (and it also saves the trouble of computing eigen vectors twice). 

((Here is the simple derivation of that formula above: 

dt = df * ff * tf'  (SVD equation)


dt*tf = df * ff * tf' * tf  = df*ff  (the last product is I since tf is orthonormal)

dt*tf*ff-inverse = df * ff * ff-inverse = df

Here is a link which has a worked out example done this way


Rao

Tuesday, October 11, 2011

Midterm coverage

Search engine engineering issues which we discussed the last two classes are included for mid-term. 

Clustering, which we started today, and will continue on Thursday, will not be included. 

Rao

Saturday, October 8, 2011

practice exam with solutions

A practice midterm exam along with solutions can be found at http://rakaposhi.eas.asu.edu/cse494/exams.html

You are strongly encouraged to try to answer the questions yourselves before looking at the solutions..


Rao

Thursday, October 6, 2011

Paper Reading assignment: Due 10/13 (Updated the homework page too)

  • Homework 3 (or think of it as Homework 2 part b). Due 10/13 in class. (We might discuss your answers)
  • ("Reading Comprehension") Read the paper "Anatomy of a large-scale hyper-textual search engine" which constains a description of Google search engine circa 1998 (i.e., before it became a company). Answer the following questions:
    1. What are Fancy hits?
    2. Why are there two types of barrels--the short and the long?
    3. How is indexing parallelized?
    4. How does Google show that it doesn't quite care about recall?
    5. How does Google avoid crawling the same URL multiple times?
    6. What are some of the memory saving things they do?
    7. Do they use TF/IDF directly or indirectly?
    8. Do they normalize the vectors? (why not?)
    9. Can they support proximity queries?
    10. How are "page synopses" made?
    11. List some of the parameters that need to be initialized for their search engine. Are the default values they pick reasonable?

Solutions for homework 2 are posted

http://rakaposhi.eas.asu.edu/cse494/hw2-f11-solns.pdf

(they are also linked from the homework page)

rao

Tuesday, October 4, 2011

Monday, October 3, 2011

You may skip part 2 of the pagerank/a&h question on the homework..

Part 2 of the A&H/Pagerank question requires understanding an idiosyncrasy of  A&H called tyranny of majority--I was supposed to have covered it
on Friday but didn't get around to it. So, you may skip that part (it won't be graded). 

rao

on the self-link issue.. (qn 5 in homework)

The question 5 in homework has two nodes that have outgoing links but those links point back to the pages themselves.

So the question is whether these pages are to be considered sink pages or not. 

One reasonable interpretation would be to consider them sink pages--since there is no way of leaving those pages and go anywhere else.

However, the specific "repair" we have for sink pages only works if we consider a page to be a sink page ONLY IF IT HAS NO OUTGOING LINKS
(not even links pointing back to itself). Otherwise, the repair will leave you with a probability mass of more than 1.0 on the outgoing links.

So, for this problem, you have to consider these self-looping pages as non-sink pages (despite the obvious fact that you can't leave them ;-).

========

On a related note, although pagerank approach has been developed with Z and K matrices, if you are planning to use a uniform reset matrix, then in a way the sink page repair is superfluous (basically after the K matrix is added,  you will have an irreducible matrix already). 

Rao

Project part 2 released. Due 10/27.

Part 2 of the project is released officially. It will be due Oct 27th.

regards
Rao