Thursday, October 27, 2011

Homework on clustering/classification/recommendation systems assigned

It will be due on nov 10th (2 weeks from today).

Rao

No office hours today

Sorry for the late notice but I am at lunch with the visitor and won't be available for office hours

Rao

midterm marks

Folks

Here are the midterm marks listed according to posting id. I will return the graded exams in the class today.

Rao

Graded out of 65 points

Average for 494 students: 44

Standard dev: 11

Max: 57.5 Min: 27

Average for 598 students 46

Standard deviation 10.22

Max 63; Min 14

Undergrad

Midterm (65)

4257-016

51.5 0076-297

48.5 2980-612

7246-789

49 4620-483

40.5 7825-853

27 2875-514

31 7467-586

57.5 6692-410

54 1598-032

8943-657

7581-369

7137-693

GRAD

47.5 4171-660

33 0833-776

32.5 9676-570

46.5 3681-882

51.5 2635-502

50 2340-625

30.5 0196-819

61.5 8103-075

35 3613-980

47.5 9374-305

60 5365-064

53.5 5355-367

41 0233-710

46 5552-935

22.5 6370-935

37 8290-753

63 9661-342

45 2987-507

50 3633-513

56 6311-340

42.5 8207-143

52.5 1942-482

53 1864-476

40 2855-634

51.5 5497-814

52 9400-307

54 7074-143

44 7106-112

56 3236-450

58.5 8487-449

45 2671-439

14 6013-730

52.5 0062-742

34.5 6622-345

46 5146-793

59 9522-612

51 5068-787

35.5 4967-738

51.5 1847-585

45.5 6146-759

45 5929-040

45 4074-088

43.5 3661-462

Wednesday, October 26, 2011

[Reminder: Talk relevant to the last 1/3rd of CSE 494] Invited Seminar: From Data to Decisions (Ullas Nambiar, IBM IRL; 10/27 11AM; BY 420)

---------- Forwarded message ----------
From: Subbarao Kambhampati <rao@asu.edu>
Date: Tue, Oct 25, 2011 at 6:36 AM
Subject: Invited Seminar: From Data to Decisions (Ullas Nambiar, IBM IRL; 10/27 11AM; BY 420)
To: Rao Kambhampati <rao@asu.edu>

From Data to Decisions

Folks:

This talk is relevant to the topics we will cover in the last 1/3rd of CSE494 course. --Rao

Dr. Ullas Nambiar,

IBM India Research Labs

www.research.ibm.com/people/u/ubnambiar

11AM. Thu, October 27, 2011. BY 420

Abstract:

Creating a single view of entities of interest is increasingly becoming a requirement for effective insight generation about product/service quality and customer expectations. The explosive growth of data both within an enterprise and outside (Web & Social Media) has made the task of creating a unified view challenging and interesting. In this talk, I will show how Big Data impacts Single View Creation, present use cases that show that the problem is not just that of business but even present in areas of common interest like better traffic management, public safety etc. I will then give a deep dive on a solution that augments enterprise data by extracting relevant information from Web sources.

Bio:
Dr. Ullas Nambiar is a researcher in the Information Management & Analytics group at IBM Research India. He received a PhD in Computer Science from 2005 from ASU. His research focusses on data integration, analytics and retrieval over distributed heterogeneous sources. He is a Senior Member of ACM. More details are available at www.research.ibm.com/people/u/ubnambiar

Tuesday, October 25, 2011

Invited Seminar: From Data to Decisions (Ullas Nambiar, IBM IRL; 10/27 11AM; BY 420)

From Data to Decisions

Dr. Ullas Nambiar, IBM India Research Labs

www.research.ibm.com/people/u/ubnambiar

11AM. Thu, October 27, 2011. BY 420

Abstract:

Saturday, October 22, 2011

[CSE 494] Clarification about project - A/H

A number of you seem to have some questions about the authorities and hubs computation in Project part 2. In short this is what you have to do:

Step 1: Find the top-k results from TF/IDF. Call that your "root set".

At this stage you have 10 documents.

Step 2: Find all the documents that the root set points to and is pointed by. Call that your "base set".

At this stage you will have, say, 80 documents.

Step 3: Create the adjacency matrix for these 80 documents.

The size of the adjacency matrix would be 80 x 80. You will need to make more calls to LinkAnalysis to populate this matrix.

Step 4: Create an initial authorities vector and an initial hubs vector. Then use the techniques from the slides to iteratively compute the next authorities and hubs values. Remember to normalize after every iteration. Repeat this until it converges.

You can test convergence by checking whether the sum of the squares of the differences between the current values and the previous values is less than some threshold you choose.

Step 5: Print out the top-N authorities and top-N hubs.

I hope this clears some of the confusion.

Thanks and Regards,
Sushovan De

Monday, October 17, 2011

Please do not spend time on the hidden slides..

I just had a couple of students who were confused because they tried to make sense of hidden slides in the ppt lecture notes.

As I mentioned multiple times, you should focus only on the non-hidden slides. Like junk DNA, the hidden slides may not always have

direct connection to what we discussed this time in the class (and in some cases may even be erroneous).

If you are viewing slides on the computer, use the slideshow mode--which automatically skips over hidden slides. If you are printing, remember to

click-off the "hidden slides" radio button.

regards

Rao

ps: on a different topic--you might want to note that the page-rank vector doesn't have to be normalized if you start with [1/n ... 1/n]' as the initial vector.

Multiplying a probability distribution with M*--a stochastic matrix--will give rise to another probability distribution (i..e, the entries will all be positive and will

add up to 1)

Graded homework 2 is available at Sushovan's desk

Homework 2 is now graded. In case you want to look at your graded work before the exam, you can pick it up from Sushovan's desk (he sits

in the lab across from my office).

Otherwise, I will bring them to class either tomorrow or Thu.

Rao

Homework 2 - Solution for Question 3

The solution for HW2 – Q3 was missing from the solutions. Here it is:

Answer 1

td =

0.2500 0.5300 0.7500
0.7300 0.5000 0.2300

To find the first term of the svd, we find the eigen vectors of td * td'
td * td' =

0.9059 0.6200
0.6200 0.8358

Eigen vectors of td * td'
-0.7268
-0.6869

And

0.6869
-0.7268

The eigen values are 1.4918 and 0.2499

[Note: these might be negative also, that is a valid solution too]

Therefore, the first term of the svd is

-0.7268 0.6869
-0.6869 -0.7268

The second term of the svd is the square root of the eigen values found above

1.2214 0 0
0 0.4999 0

To find the third term of the svd, we find the eigen vectors of td' * td

Td' * td =
0.5954 0.4975 0.3554
0.4975 0.5309 0.5125
0.3554 0.5125 0.6154

Eigen vectors are
First eigen vector:
0.5593
0.5965
0.5756

Second eigen vector:
-0.7179
0.0013
0.6962

Third eigen vector:
-0.4146
0.8026
-0.4290

Therefore the third and last term in the svd is
-0.5593 0.7179 -0.4146
-0.5965 -0.0013 0.8026
-0.5756 -0.6962 -0.4290

Answer 2

After removing the lesser of the two eigen values, your s matrix becomes:

1.2214 0 0
0 0 0

Then to recreate the matrix, you multiply u * s * v'

0.4965 0.5296 0.5110
0.4692 0.5005 0.4829

Does it look close to the original? In my opinion, NO, it does not. We did eliminate a significant part of the variance.

(Also accept a YES answer if it claims that if you scale it properly, you would end up sort of where the original documents were.)

Answer 3

The query vector is q = [1, 0]

In the factor space, it would be q * tf

qf = -0.7268 -0.6869

The documents in the factor space are:

D1: -0.6831 0.3589
D2: -0.7286 -0.0006
D3: -0.7031 -0.3481

The similarities are:
sim(q,D1) = 0.3239
sim(q,D2) = 0.7274
sim(q,D3) = 0.9561

Answer 4

Before the transformation:

After the transformation

Clearly, after the transformation, the documents are easily separable using only one axis (the y-axis). So we can get rid of the x-axis and still differentiate between them

Answer 5

The new values added is a linear combination of the previous keywords, (0.5 * k1 + 0.5 * k2).

It will have absolutely no impact upon the calculations at all

It tells us that svd manages to find the real dimensionality of the data

Thanks and Regards,
Sushovan De

additional office hours for mid-term: 4-5:30pm today (Monday)

I will be available for questions between 4 to 5:30pm today.

Also, tomorrow, the office hours will be from 2:30 to 3:30.

rao

Thursday, October 13, 2011

Precision-Recall question in Homework 1

In the Precision-Recall question of homework 1, the expected answer is this:

In the slides, the precision recall curve was plotted at specific recall values (0, 0.1, 0.2, etc.). Some of you misinterpreted that to mean “You should plot the precision-recall curve at only those points at which a relevant document is retrieved.” That is not true in general.

Thanks and Regards,

Sushovan De

page synopses ==> page snippets..

Apparently some of you are confused about what the "page synopsis" means--I meant the short description that accompanies each result.

They are also called snippets.

In the following, the synopses/snippets are the text in black

Writing a Synopsis

charlottedillon.com/synopsis.html
The synopsis. I've placed some great synopsis how-to links and books below. Don't miss the sample synopsis page link. Some Books to Help. Using the links ...
Tips About How To Write a Synopsis from Fiction Writer's Connection -

www.fictionwriters.com/tips-synopsis.html
Once you already have an agent and you are discussing future projects, you can present your ideas in this one-page synopsis format for your agent to look at ...
How to Write a One-Page Synopsis or Treatment | eHow.com

www.ehow.com › Arts & Entertainment
Feb 16, 2009 – How to Write a One-Page Synopsis or Treatment. If you are trying to sell a screenplay, you will often be asked to submit a "one-sheet" or one ...

Wednesday, October 12, 2011

On getting the eigen vector signs to work out while computing SVD by hand

Some of you have asked me about how to get the signs of the eigen vectors of tf and df to correlate correctly so that when you multiply df*ff*tf' you get the original matrix back. One idea is to compute one of the matrices (either tf or df) by explicitly computing the eigen vectors and compute the other using the SVD equation.

Specifically, suppose you computed tf--picking whatever signs for the eigen vectors.

Then, rather than compute df by looking at the eigen vectors of dd matrix, you can compute df directly

df = dt*tf*ff-inverse (where ff-inverse is the inverse of a diagonal matrix and so it will just be another diagnonal matrix with the entries being reciprocals of the ff)

This way, you are guaranteed to have dt be reconstructible by SVD equation (and it also saves the trouble of computing eigen vectors twice).

((Here is the simple derivation of that formula above:

dt = df * ff * tf' (SVD equation)

dt*tf = df * ff * tf' * tf = df*ff (the last product is I since tf is orthonormal)

dt*tf*ff-inverse = df * ff * ff-inverse = df

Here is a link which has a worked out example done this way

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html

Rao

Tuesday, October 11, 2011

Midterm coverage

Search engine engineering issues which we discussed the last two classes are included for mid-term.

Clustering, which we started today, and will continue on Thursday, will not be included.

Rao

Saturday, October 8, 2011

practice exam with solutions

A practice midterm exam along with solutions can be found at http://rakaposhi.eas.asu.edu/cse494/exams.html

You are strongly encouraged to try to answer the questions yourselves before looking at the solutions..

Rao

Thursday, October 6, 2011

Paper Reading assignment: Due 10/13 (Updated the homework page too)

Homework 3 (or think of it as Homework 2 part b). Due 10/13 in class. (We might discuss your answers)
("Reading Comprehension") Read the paper "Anatomy of a large-scale hyper-textual search engine" which constains a description of Google search engine circa 1998 (i.e., before it became a company). Answer the following questions:
1. What are Fancy hits?
2. Why are there two types of barrels--the short and the long?
3. How is indexing parallelized?
4. How does Google show that it doesn't quite care about recall?
5. How does Google avoid crawling the same URL multiple times?
6. What are some of the memory saving things they do?
7. Do they use TF/IDF directly or indirectly?
8. Do they normalize the vectors? (why not?)
9. Can they support proximity queries?
10. How are "page synopses" made?
11. List some of the parameters that need to be initialized for their search engine. Are the default values they pick reasonable?

Solutions for homework 2 are posted

http://rakaposhi.eas.asu.edu/cse494/hw2-f11-solns.pdf

(they are also linked from the homework page)

rao

Tuesday, October 4, 2011

(highly recommended) reading for next class: a chapter on map-reduce architectures..

In preparation for the next class, I recommend that you read the following chapter--

upto and including section 2.3.2

http://infolab.stanford.edu/~ullman/mmds/ch2.pdf

It has a nice description of big file systems and map reduce architecture.

Rao

A self-contained textbook chapter on just page-rank

is here http://infolab.stanford.edu/~ullman/mmds/ch5.pdf

It touches on pretty much all the topics I discussed in the class--w.r.t. page rank (but doesn't have anything on A&H).

rao

Homework 1 solutions posted; graded homeworks will be returned today

in the class

Rao

Office hours shift again for today: 2:30--3:30

Sorry for these rather frequent changes--my son's school has early closings on a rather large number of days this semester..

Rao

Monday, October 3, 2011

You may skip part 2 of the pagerank/a&h question on the homework..

Part 2 of the A&H/Pagerank question requires understanding an idiosyncrasy of A&H called tyranny of majority--I was supposed to have covered it

on Friday but didn't get around to it. So, you may skip that part (it won't be graded).

rao

on the self-link issue.. (qn 5 in homework)

The question 5 in homework has two nodes that have outgoing links but those links point back to the pages themselves.

So the question is whether these pages are to be considered sink pages or not.

One reasonable interpretation would be to consider them sink pages--since there is no way of leaving those pages and go anywhere else.

However, the specific "repair" we have for sink pages only works if we consider a page to be a sink page ONLY IF IT HAS NO OUTGOING LINKS

(not even links pointing back to itself). Otherwise, the repair will leave you with a probability mass of more than 1.0 on the outgoing links.

So, for this problem, you have to consider these self-looping pages as non-sink pages (despite the obvious fact that you can't leave them ;-).

========

On a related note, although pagerank approach has been developed with Z and K matrices, if you are planning to use a uniform reset matrix, then in a way the sink page repair is superfluous (basically after the K matrix is added, you will have an irreducible matrix already).

Rao

Project part 2 released. Due 10/27.

Part 2 of the project is released officially. It will be due Oct 27th.

regards

Rao