Friday, December 2, 2011

DUE: 12/6: Mandatory Interactive Review Blog qn (You should post your answer as a comment to this thread )

"List five nontrivial ideas you came to appreciate during the course of this semester"
(You cannot list generic statements like "I thought pagerank was cool".  Any cross-topic connections
you saw are particularly welcome. )

Note that you must post your response as a comment to this particular message on the class blog
All comments must be done by the beginning of Tuesday's class (12/6). 

If you have forgotten what we have done, see the class home page for a description. 



  1. SVD was one of the topics that really caught my attention. I had been exposed to svd and have done image compression using it, but I had no idea it could be so versatile for IR, tackling correlation, etc...

    The idea of breaking down documents into bags of terms is extremely powerful and allows for fairly trivial metrics for similarity like jaccard and vector similarity to be used.

    I had never heard of map reduce until this class, but the ability to parellelize tasks across server clusters is extremely powerful and is similar to the parallel structure of gpus architectures.

    Utilizing clustering is extremely useful for grouping objects such as documents based on similarities and can even be used to present a varied set of results in the top 10 results.

    The original google setup had brought some key ideas to the table for ir on the web. The creators solved many of the fundamental issues that plagued many search engines at the time from obtaining importance of pages with pagerank to improvements on indexing and similarity computation.

    -James Cotter

  2. The course is so structured that it is really tough to zero down to 5 particular areas of interest. Even so, couple of topics that I found relatively more interesting than the rest were:

    1. The concept and computation of tf-idf distance for finding the similarity between a document vector and the user query vector is straightforward and sounded trivial. I had not come across this concept before and the discussions in the class regarding this topic was pretty darn interesting. Not to mention, tf-idf based query results were the basis of every algorithm we implemented for our course project.

    2. I have taken a Machine Learning course at ASU, hence ML algorithms ( kmeans, knn) revisited in the class with in depth detailing is very much appreciated. Not to mention the third phase of the project required kmeans implementation and its variations. I guess as the course material and project demanded a lot of effort,extra reading it was a high learning curve for me.

    3. The main objective for taking up this course as many others was to find out how Google runs? But in addition to page rank algorithm I also got work on the HITS algorithm, precursor to Page Rank. Not just readup on it and have discussion but also to implement it.

    4. Most of the topics laid out in the class were not just supported by theoretical evidence. The instructor was able to relate it to real life scenarios which made more sense in understanding. Eg: To explain the latent feature in a data set(lsi lecture) there was an example of a fish and it's size parameter. In the same topic scope citing netflix prize for collaborative filtering algorithm to predict user ratings for movies.

    5. towards the end of the semester the focus of the course discussions shifted to structure of the web data and extracting information. this was also informative.

    I think the alignment of the course topics with project implementation phases was best part of the course.

  3. So top 5,

    1) LSI,
    One of the best ideas, of seeing things which are there seen by human mind but not observed by machines obviously. Latent semantic indexing which is used for capturing relationships between similar terms or in general reducing dimensions. Uses SVD (Singular value decomposition) as mathematical tool to capture relations. For example you write an article and you mention words beats and drums, and they will be related to Music.

    2) A/H and PR :
    Authority/Hubs and Page Rank concept, of finding the real authenticity of webpages, finding the ones which are really liked by others and the ones which are really important (Like Netscape brower webpage would have been captured). People who have read about IR could make webpages which would get extremely high ranking (Knowledge of LSI could be misused, to put terms in pages to make it more obvious to query user gives, but obviously A/H and PR would know the inlinks and outlinks and hence, the importance).

    3) Vickrey auction :
    Outright I thought it was the worst idea and how it could be an idea for noble price. Discussed it with professor relating it to real world scenario's and thought about scenario where too many people would misuse knowledge of Vickrey auction to bid at minimum price (Yes everyone knows bid price of other person unlike what Vickrey assumed where no one else knows what price you have put), then prof told about more real world problems such as economic satisfaction, where no one would miss an auction for like lets say 1 cent! (Obviously those people know that auction is valuable to them, and they dont want to miss opportunity to buy the item, paying the correct/optimum price would always be better idea in the end).

    4)Naive Bayes Classifier (NBC):
    Used mainly for classification, example: the best one which we are quite aware of is spam classification. Classifying document (here email), is based upon occurrences of terms in spam vs non-spam mails and hence corresponding probabilities. The important factor is that a non-spam mail should never be classified to spam mail.

    5) HMM :
    Introduced while doing information extraction, I was completely taken back by idea of hidden markov models, this was the first time I was introduced to HMM concept and then found it was used almost everywhere like weather prediction, stocks, music, language etc. As we don't know the complete state of system hence the word "hidden", which is hidden information in the system. Finding most likely state sequence and maximum likelihood training were quite interesting sub topics.

  4. 1. Knowing that turning a document into a vector of words was a new idea for me. It made lot more sense to convert into vectors than bag of words when we started learning about the vector operations that can be done on the same which was not possible with bag of words.

    2. Spelling correction using K-gram and edit distance was a good idea to be implemented for search engine with cold start problem. I thought there was no other way than to use the query logs to figure out the spelling mistakes. But using the inverted index terms itself through K-gram was an interesting way of solving it.

    3. Using Scalar Cluster for finding co-occurrences of terms to give suggestions on similar terms that can be added to the query. I doubted the effectiveness of the method until I tried it out in project 3 when I understood that it was giving very good results even on a small corpus which we used for the project.

    4. How computation of page rank could be altered to be made use of the grid computing by using map reduce architecture was very interesting. The sequential method of page rank would not have scaled to huge search engine level matrix as it was taking really long time to compute on relatively small matrix of 25k size. Hence being able to paralyze the same was a huge advantage.

    5. Clustering of documents using K Means was interesting concept. By just specifying the number of clusters we expect to see in the documents we get groups of documents. This concept gave me an understanding of how google news clusters news articles. It was also interesting to see the use of Buckshot which reduces the local minima problem of finding the better seeds to start with if not the best.

  5. I came to appreciate a lot of concepts in this course since I had never investigated information retrieval. Here is the list of the top 5 concepts that made the biggest impact on me.

    1) Vector Similarity. The idea of making documents a vector of terms and then comparing with other vectors of terms is simple enough, but it is the basis of the entire web search and this made its importance very clear.

    2) Inverted Indexing. Having a lexicon of terms listed alphabetically and being able to determine quickly what documents have the term is very useful. This idea allowed us to quickly access which document has the highest frequency of a given term and also the positions of that term.

    3) HITS algorithm. Views pages as authorities and hubs to determine a pages importance. Just like authors referencing other books and articles, we use this algorithm to put an actual value on how important someone's page is.

    4) Pagerank. Makes pages a markov chain to get the probability of landing on any given page. We create the probability of a random user landing on any given page by clicking a link while also applying the chance that that user will randomly leave the set of links altogether.

    5) LSI. This one just because it uses SVD and until this course I didn't know of any good use for SVD.

    - Elias Ewert

  6. Vector Similarity Model: Representing the Documents and Queries as bag of words and applying Vector space similarity measures for ranking the documents. Using tf/idf values to represent terms weights in a document that takes care of importance of term in the document corpus and normalization of term frequency in a document.

    Inverted Index: Constructing a Lexicon for efficient retrieval of documents without touching any irrelevant documents which is a pretty good idea. Storing the position and formatting information that is being used in snippet generation and term weight calculation.

    Tolerant Dictionaries: Constructing the k-gram index structure, which is used in suggesting the queries like we see in google i.e “Do you mean:” or correcting the words if he/she mistyped the query. Applying the Jaccard similarity measure to calculate the weights using the index that is similar to inverted index concept in Document-Document similarity measure where term in is similar to K-gram shingles.

    Page Rank and A/H Computation: Making use of link structure of the web and adding this to the earlier document-document similarity measures to come up with better ranking of the web pages(documents). Here we learned two pretty good algorithms, Page rank and HITS algorithm. Even though both the algorithm wants to find a stable state that is done with power iteration. Found that HITS algorithm is unstable to minimal changes if the Eigen gap is small where as page rank algorithm is stable in this case.

    K-means clustering: This is the algorithm to segregate the documents into group of related documents. We can use this technique to display the relevant documents and also first document is different from second document, which is one of the properties of search engine. We can apply the K-means on the documents returned by Vector similarity model and display the top-k from each cluster (Project phase 3).

    Venkata Achanta

  7. There were lot of new and interesting concepts learned in Information Retrieval course.

    The following are five nontrivial ideas which i found interesting:

    1. Considering document as bag of words and representing document as vector of terms. The concept of inverted index, TF-IDF and how IDF reduces the effect of common words across documents. Also normalizing the frequencies of the terms so that larger document don't have an edge over smaller document.

    2. The use of NBC for classification. The interesting fact was the assumption that attributes are independent to each other and still manages to predict the correct class. We still use the probability computed which is erroneous whose value doesn't matter and count on the relative order. what matters is not the exact probability but the relative order.

    3. How Vickrey auction is analogous to open ascending-bid auction. In open ascending-bid auction, the winning bid will be just little more than the highest losing bid.

    4. How LSI uses SVD to reduce the dimensions and finds the relationship between the terms and how it handles synonymy and polysemy. The entire concept of SVD was interesting.

    5. Page Rank: The rank of the document not only depends on number of input links but on the number of important input links to the document. The concept of reset matrix to get different variations of page rank was something new and interesting.

    - Dinu John

  8. 1) Vector-Space Model - This applies more broadly to many of the subjects taught in the course, but the whole concept of working with data in higher dimensions was totally foreign to me before taking this course. Also, it was interesting to see how things such as SVD worked in practice that weren't really fully fleshed out in linear algebra.

    2) TF && TF-IDF - While Page Rank and Authorities Hubs are both great algorithms, TF and TF-IDF are both simple to understand and required for the later algorithms to be used. They were also significantly important to a better understanding of the vector-space model.

    3) K-Means - Not the greatest algorithm ever, but a great way to better understand how clustering can be done, and without it I would have assumed much of clustering to be some kind of black magic.

    4) Markov Chains - This was a paticularly interesting subject for me as the company I work for has stuck with a third party grammar validator. I'm hoping to run a similar algorithm as we wen't over in class on a sample of the data we've accured to better understand how it works in practice.

    5) Naive Bayes Nets - I have used bayes nets before in CSE471 (Introduction to AI), and while we didn't go into great depth with them in this course, it was interesting to learn about naive bayes nets and their more practical applications.

    -Thomas Hayden

  9. 1. How many similarity measures are available, with vector space being applicable for most instances.
    2. LSI- the concept of reducing massive dimensions to a more meaningful subset. May not be practical on the web but is interesting & demonstrates need for practical means to view large scale data.
    3.While identifying the need for structure, realizing how the seemingly random Web itself has its own structure of links, page layout , etc.
    4. Ad Ranking and the Vickrey auction with the strategy of truthful bidding not being used by Google as its too hard. So the somewhat inferior method multi-item generalized second bid method is used.
    5. Welcomed XML/RDF Explanation- XML as a syntatic standard enables making structure. XML schema describes format of the XML document. RDF is a standard using/writing in XML syntax.

  10. 1. We can get the similarity of documents by tf/idf weight, which is very efficient in the large terms and documents corpus. Also, the inverted index data structure can help us locate the corresponding document in a quick way.

    2. Information retrieval class help us have better understanding the importance of linear algebra, from it, we know that SVD, LSI, etc, which is very useful to reduce the high dimentional vector space and make us very comfortable to deal with complex doc/term relations in low space.

    3. Page rank, Authority/Hub give me the basic knowledge to understand the princple behind search engine and how google works at the back stage.

    4. By knowing XML, I know the basic rule of web page organization and it is very useful to quickly get the contents or specific mark by it. It can also guarantee the universal standard to write the web pages.

    5. Generally, this course help me master the basic knowledge of information retrieval, data mining, which gives me great interest to further study in this area.

    -- Shu Wang

  11. 1. Explanation of RDF and OWL standards using the Ramayana story brought more clarity.
    Especially, the example to convert the n-ary predicates to RDF using Rescuee007.

    2. Usage of tag information - to associate different importance to term occurances in different tags.
    The example of usage of anchor text (worst teacher I ever had) for determining the importance was interesting.

    3. This course clearly answered my questions regarding Page Rank which I posted on the blog during the start of the semester,that why a newly created page doesn't get a Page Rank immediately?
    The analogy of actor-agent for authority hubs were interesting.

    4. Using Rocchio relevance feedback to improve the results based on the users feedback was interesting.
    This made sense when I implemented this in the final phase of the project. It provided better results and was working like a real search engine!!

    5. Precision and recall of IR systems with the analogy of "All truth,NOthing but truth" was very helpful in better understanding of the basic performance measures.

  12. Using a simple "Bag of words" technique with TF/IDF weighing for search results and by involving humans in loop seems to be so effective that we can neglect the impact of full NLP absence in most cases

    Capturing the co-relation between terms by computing t-t matrix & analyzing the transitive dependencies. Notion of positive and negative co-relationships is quite useful for query completions.

    How LSI as a technique captures co-relation between terms, synonymy polysemy, and also able to detect and remove any noisy content in the original input data is quite interesting. Though it took time to understand the concept its worth of investing that time and also be able to appreciate the application of complex mathematical techniques which are usually studied in isolation.

    Reasoning behind why A/H and page rank(Markov chain based) techniques would work and the importance of being robust to any adversarial changes in web structure(impact of bridges, page collusion)

    Fast and efficient way of computing the primary eigen value through power iterations.

    Assumption made by NBC classifier about the independence of attributes and still be able to classify the documents.


  13. Latent Semantic Index computation to reduce the dimensionality and to capture the correlation between terms. The project results of using LSI over the given inverted index of documents surprised me to notice only relevant documents listed in the order of their similarity. (Is this all about Google ? ;-)

    Using anchor text to estimate the relevance of the pointed document though the document might not contain any query keyword. Ex: keyword 'browser' doesn't appear on Netscape home page, but it is ranked high for the search "browser" because of anchor text present in many pages that is linking to home page of netscape. This idea can be exploited to tar the target web pages.

    Influence of hubness factors of the incoming links to iteratively calculate authoritativeness of the page answered my point posted ealier on blog [two new web-pages with same content would be ranked on the basis of which one first gets linked by other (major) web-sites, this would result in iteratively staying in high rank]. Auth-Hub computation really captures important pages that are linked by major websites, otherwise, it leads to exploitation by creating huge number of trivial pages (without hubness) and through which linking the desired page to increase its authoritativeness.

    Power iteration in place of eigen decomposition to find eigen values that corresponds to the order of documents' similarity. Our project proved the efficiency of power iteration by converging within 69th iteration; which took less than 2 hour for ~25000 documents.

    Using collaboration filtering to recommend products using similar users purchase history was interesting, which I relate the idea to relevance feedback for document search.

    K-gram spelling correction method is used to propose possible correct word for partially spelled or misspelled word, when the word is considered independent of each other. If the word is part of set of words (sentences), HMM (hidden morkov chain model) does better. HMM is used to find the probability of possible words for the misspelled word in sentence or some times the actual meaning of the literals based on the surrounding sentence.

    --Bhaskar Rangaswamy

  14. Inverted index:
    The idea of creating a global thesarus of words in the corpus which is then used to retrive only the relevant documents (documents that contain the query term atleast once) was neat without which we would be doing the retrieal using the naive (dumb) way. Engineering aspects of creating inverted indexe's and learning about map/reduce programming was new.

    Classifying documents depending on the terms they contain is the natural way. Naive Bayes Classification provides a mathematical way to do the same by just computing ratio of probabilities. "Napoleon Dynamite Problem" of the Netflix Prize provided interesting reading for the topic of collaborative filtering.

    K-means clustering:
    Simplest of all the clustering algorithms covered in class.Introduced the concept of clustering formally. Problems and improvements to k-means were also interesting.

    Introductinon of RDF and XML as the standards for data exchange/extraction/integration on web was new.

    Combined the concepts of learning (on test data), Naive Bayes Classification and maximizing likelihood to predict the probabilistic state sequence.

  15. 1. Tf-Idf - The preliminary idea of considering a document as bag of words and then using this vector to compute similarity with a query based on frequency and inverse document frequency using dot product took a while to sink in. But made a lot of sense when I understood the simplicity of its logic.
    2. Precision and recall - were the two terms which I try noticing for every google query response I get and for lot many practical issues like bugs found in my code at every step, students answering questions in my class.
    3. PageRank - I have a much better understanding of link structure of the web. How anchor text, tag lines could tarnish or enhance someone's reputation. How the link structure is studied to have two different approaches to compute the page importance as hub-authorities or page rank. The reset matrix modification to customise page rank for different users based on their taste is another wonderful and simple technique.
    4. Recommendation technique - Collaborative Filtering - A very useful technique very much in use in all websites was explained well as how other peoples opinion is used to predict my possible liking for a product. Neighbor selection and significance weights were easy to guess as concepts but hard to realize their use in mathematical formula.
    5. Social Network - small world phenomenon - First the fact that small world phenomenon fits in our social life was amazing then the explanation as to how it works with exponential and scale free networks. Followed by realization that our society is where 'rich get richer' follows power law which is a scale free network. Relating these ideas to heap's law which goes back to bag of words in the corpus vs vocabulary. These set of lectures were vast, crisp, short and very interesting! I would have loved to learn them in class.

    I would also liked the following topics
    *Google's paper - I would have never read this wonderful paper with google's basic approach if I had not taking this class. The Map-Reduce technique - the datastructures used - hits and forward index and inverted index - the algorithm. I enjoyed this paper.
    * Search Advertizements
    * LSI

  16. Understanding the link structure of the web and applying markov's chain to assign ranks to pages. Then assigning the overall rank by combining page rank with tf-idf rank is an appreciable idea.

    The concept of documents being vectors in high-dimensional space was at first visually challenging. But with time and better understanding, and with interesting topics like dimensionality reduction, the whole idea felt amazing.

    Relevance feedback is another noteable idea. Re-formulation of the entire query based on user's feedback is a very good idea to provide customized web search experience.

    On the same note, varying the reset matrix during page rank calculation based on the class the surfer belongs to, is also a good idea in providing a pleasant web search experience.

    Associative / scalar clustering provided a good insight on how terms are related to each other. The "bush saddam iraq" example will always stay in my mind.

    I particularly enjoyed some of Information extraction topics like wrapper generation / dom tree extraction and have found good use of it in my day to day life.

    -Arjun Mukherji

  17. 1)The understanding of how considering documents as vector of terms(Vector Similarity) wins over the consideration of documents of bags(Jaccard Similarity); How constructing an Inverted Index helps us in efficient retrieval; Correlation Analysis:The "Bush Saddam Iraq" example. I thought vector Algebra was taught only for the fancies of Math teachers but found one of the real uses here!!

    2)Latent Semantic Indexing using SVD;How the new axis(factor space) covers the maximum variance in the data by doing Eigen vector decomposition which gives us the new dimensions and their importance;The application of this idea made sense only when the example was given that document may contain lot of noise added through various sources and hence the true dimensionality of the matrix is its rank and the fish example.

    3)Link Analysis:
    Considering the pages as Authorities and Hubs; how power iteration is the fastest way of computing the primary Eigen vector; The markov Chain relation for the Page Rank calculation; How the Reset distribution matrix lets us do many other additions to page rank like the trust rank, User-Specific rank, recent pages can be given more importance etc.
    A glimpse of the Map-Reduce parallelism made me realize that for devising a good algorithm it is very important that we take into account the various Engineering issues involved with it.

    4)Feature selection using NBC and how LDA over LSI is important here. Recommendation systems using NBC how collaborative filtering works and how Content based filtering works and their respective pros and cons.

    5) The Economics and Cost evaluation part of the class which dealt with Search Advertising and the various complications involved; how single item Vickery auction helps truthful bidding and how the multi-item vickery Auction doesnot;

    6)How having extra structure helps in querying and also how hard it becomes to store this extra structure;Wrapper generation and pattern extraction in the information extraction;

    The best part of all these above points was after every concept is taught we were also shown some examples of how these concepts were gamed(misused or used to their own advantage) in the real world and a better idea was shown which could remove some drawbacks, it was like constantly re-learning what all we learnt!!

  18. •In Vector Simillarity I was more fascinated by the idea that idf came into picture i.e the fact that terms which are unique also are as important as the terms which are frequently present. This made searching more easy and meaningful.

    •Pagerank and A/H - By vector similarity only pages with good TFIDF to the query was given to the user. By the concept of Pagerank and Authorities and Hubs, web pages were efficiently retrieved not only by vector similarity but also by the concept of link analysis by getting to know by how many web pages the page has been pointed by and also how stable the link structure is and how it deals with subverting. This is very important since the user is only not sure for what exactly he is looking for and the search engine uses this concept to show him with the most relevant and important page which most of the time works.

    •Clustering is a very interesting idea since each query has many different meanings and this concept makes the search engine retrieve a set of results for each meaning of the query helping thus the user to select the cluster of results that he is really concerned about . The various algorithms like K-means , Buckshot and HAC to do clustering and how Buckshot made a big difference in selecting initial good seeds was a fascinating idea. This was understood well after executing the project.

    •Indexing and Crawling - Since web is made of billions of web pages , the idea of crawling and fetching those web pages is a brilliant idea. The indexing especially the inverted indexing made a big difference in easily searching for terms in all the billions of documents. I even appreciate the concept of stemming and Stop word elimination which saves a lot of time in searching.

    •Social Networks – I went through few videos and how the concept of small world was used in major social networking sites like Face book was great. There are so many other points but these were my favorite, even though they look simple, they were applied to build great applications by just rightly understanding how to do and that’s great.


  19. Vector Similarity
    The concept of using idf to get the importance of each word and considering the document as bag of words and considering the query itself as a document and computing the similarity with each document considering tf-idf weights was interesting.

    Association and Scalar Clustering
    Constructing t-t matrix and computing correlations between the terms and how transitive dependencies can be found doing vestor similarity on the t-t matrix was very interesting.

    Realized how effectively linear algebra can be used. Getting the d-t matrix and performing SVD to find the hidden latent dimensions in the document corpus was amazingly surprising.

    Link Analysis
    Getting the top documents using link structure of the web and power iterating to find the eigen vectors and using reset distribution to assign customized weights to specific dimensions was quite interesting.

    The concept of using XML as the intermediate for fully structured data from databases and text data extracted from the web using wrappers was interesting.

  20. An introduction of Latent Semantic Analysis and its applications in Natural Language Processing. The analysis of relationships between documents and terms allows extracting semantic meaning assuming that semantically close words will occur close together in documents.

    I found the relevance feedback technique very useful in steering a query into a proper direction when words with the multiple meaning are used as the query terms.

    The idea of MapReduce for parallel processing of computationally expensive, highly distributable tasks and merging the result sets. This approach could be applied to a problem of author disambiguation in the NIH datasets.

    The usage of RDF and OWL to describe conceptually and model information on the web. I found this idea to be extremely useful for automatic discovery and mapping of the existing data sources.

    Applications of k Nearest Neighbor Algorithm for classification of objects in special databases.

    -Tim Kourt

  21. I never thought that vectors could play such an important role in building a search engine. In this course we not only had the privilege to learn theoretical aspects but also got the chance to implement them in project phase I. The process of understanding inverted index, tf-Idf evaluation considering documents as bag of words theoretical wise and eventually implementing it was some learning experience.

    The fact that words in the documents are not actually independent i.e. they seem to have polysemy and synonymy brings us to the concept of “LSI” an amazing idea! When prof. taught the LSI concept for the first time I was completely clueless of what he was teaching, it took some time for me to sink in. The process of how using SVD, dimensionality reduction of documents could be done was very interesting (not to forget fish example explained in the class).

    HMM another interesting topic, an important aspect to note that not every system has a complete state; hence ability to find most likely state and maximum likelihood becomes important. These things could be found out using HMM.

    Clustering of documents was another interesting topic. Drawing trade-off between different possible clustering techniques (K-means, Bisecting K-means, HAC, Buckshot) was quite informative and interesting.

    I always used to wonder when you are in the process of typing a query more often than not Google manages to guess the query we want to type this brings us to the concept of “Scalar Clustering”. The concept explained in the class clarifies more or less everything on how exactly it is done.


  22. The five topics which I liked the most in this course are:

    1. SVD/LSI : I have used SVD/PCA in some other projects. This class helped me understand why-n-how it is done. It was really helpful to know how to apply LSI to documents and when a new query/document is introduced, how the LSI handles it without re-doing the whole process.

    2. Auth-Hub/PageRank/Markov Chains: This is an interesting concept of ranking the pages. The class as well as the project helped in learning power iteration to compute eigen values, the Markov Chains Concept and the actual engineering issues in doing the same.

    3. Map Reduce Architecture: This concept is again an interesting one. I appreciated it more and more after doing Project part 2, especially the usage of Map-Reduce in matrix multiplication. The reading list - Large Scale File System & Map Reduce (chapter-2 of Mining Massive Datasets) is a good source.

    4. Collaborative Filtering : I appreciated this topic because of the simplicity of its powerful concept. It was interesting to know how K-Nearest Neighbours and Naive based classification methods are used in recommendation systems.

    5. HMM : I have worked on speech recognition projects as a user of speech recognition libraries. This class helped me in learning how HMMs are employed in speech and other systems. It was most helpful and one of the most engaging topic.

    Rashmi Dubey

  23. (1st part of 2 ... 4000 char limit)

    Some of my comments or interesting things I learned in this class resulted from my thinking further on some of the limitations on what we were in fact taught. The two concepts in particular that I was thinking of are the similarity between the two pagerank algorithms, and the idea of extending binary classification to ternary being too complex.

    1 - PageRank
    The PageRank algorithm has both an iterative form and a matrix exponentiation form. The iterative form uses free pagerank per iteration and falloff of inherited pagerank to ensure that the values converge. The same idea is used in the matrix exponentiation version by adding a reset matrix. The matrix uses the same idea to give free pagerank to each page based on the random-surfer model, but it also gives out additional free pagerank divided up from sink nodes that would otherwise simply give their pagerank out to no other pages. The reset matrix is added to the original adjacency matrix with a weight to ensure that inherited pagerank falls off, similar to the iterative form.

    2 - LSI
    I learned that eigenvectors that determine the shape of the data given are more important than those that only tweak the data slightly, so much so that most information is preserved when the less dominant eigenvalues are simply ignored. The removal of these eigenvalues gives approximately the same data, but can combine human-labelled dimensions like "length", "width", or "height". When we perform LSI on data in these three dimensions, we can see that the correlation between the three dimensions is expressed in the single factor-dimension of the LSI data. Though these factor-dimensions doesn't always have a human-generated name like "size", they do express concepts such as correlation.

    (1st part of 2 ... 4000 char limit)
    -Kalin Jonas

  24. (2nd part of 2 ... 4000 char limit)

    3 - Filtering Methods
    Collaborative filtering is involves taking some descriptors of a list of objects, and the rating an entity gives this objects, to determine how the entity would rate another object. Content-based collaborative filtering uses this method to determine simply whether a user would enjoy a book based on the other books they did or did not enjoy, using the features of the books as descriptors of those books. User-based collaborative filtering uses the ratings on books themselves as descriptors for people, and compares people by these ratings to see if the user would enjoy another book they have not read. We can use content-based filtering to determine virtual scores for books to flesh out these ratings, since a new user might not have actually read many books, while older users may have read dozens. We then treat these virtual ratings as legitimate, and use them for user-based collaborative filtering to get an even better view of whether a user would like the book in question based on real and virtual rating correlations between users.

    4 - NLP vs Bag of Words
    While advanced AI could be designed to perform NLP on thousands of documents and determine their relationships using extensive knowledge of human culture and mind, we can dump entire files into bags of term-weights and get meaningful results far cheaper. The generalization that allows us to avoid designing complex algorithms for determining document similarity is that documents that are on the same topic usually contain that topic in word-form. We will miss documents that are on the same topic that do not explicitly state their topic, but basic writing practice dictates that the topic should be expressed and referred to as often as possible to keep a paper coherent. This assumption allows us to use simple vector-similarity based on the relative weights of each term in the document.

    5 - Binary classification
    A perfect binary classifier would obviously get the right category every time, for 100% accuracy. If we get around 50% accuracy, then even though we are getting some results right, we're wrong often enough that we might as well not bother, and flip a coin instead. However, if our classification never gets the right answer, we still have a perfect classifier, since we can just relabel the categories by swapping them. This could be extended to ternary classification, by rotating the categories, and checking if the likelihood of picking the right category rises, but the number of possible combinations rises to 6, and the number of times we'd have to calculate similarity is 3 times that. The complexity is so high there, that we don't even discuss attempting such a thing.

    (2nd part of 2 ... 4000 char limit)
    -Kalin Jonas

  25. The many concepts taught in this course have been overwhelming at times, as I was unaware of how much I'd learn.
    Here are a few I chose worth a mention.

    1. Evaluation strategy with Precision/Recall was taught with the analogy "Whole Truth" and "Nothing but the truth", which was what helped grasp it all.
    After implementing each phase it was interesting to determine the quality of the results we were generating.

    2. I had learnt Clustering, classification and NBC more on theoretical terms in the Data Mining Course I had enrolled for earlier. But, the problems solved here gave me a better understanding of the concept. A further help was implementing the same in the project as well.

    3. Auth/Hub and Page rank computation using the extensive calculations right from the link strcuture of the web to the final convergence using power iterations was one of the trickiest implementations. But, further there was topic specific page ranking and user based page ranking which made me see how on earth the sites keep a tab of what I had done earlier.

    4. Latent Smenantic Indexing using SVD to reduce dimensions and to capture correlation between terms is an artful techinique. The interesting part was how an ingenuous mathematical computation tackles areas like polysemy and synonymy.

    5. The relevance feedback depending on the user's choices was an added improvisation of the topics covered and to obtain better results.

    6. Recommendation Systems using Collaborative filtering seemed to brush some of the concepts we had learnt earlier but transformed itself into a different purpose altogether.

    Though, I mentioned that the topics were quite overwhelming, I believe I am able to join the dots now.

  26. Vector space model TF/IDF measurements:
    Representing documents as vectors having term frequencies was a really cool idea that I may not have thought about if I hadn’t taken this course. Due to this we can easily apply a lot of linear algebra concepts in the field of information retrieval which I didn’t expect! Having the term weights as normalized TF/IDF even bettered it because IDF handles the problem of some words appearing in a lot of documents. Normalized TF made sure that the weightage given to terms did not depend on the size of the document but was only based on the ratio of occurrence of terms in the document.

    Scalar clustering
    Finding correlation between terms using scalar clustering which is a step that involves association matrix and also normalized association matrix was another idea that uses linear algebraic concepts that impressed me. Through scalar clustering we can not only see direct correlation between terms that is correlation of terms occurring together in documents but also indirect correlation that is terms correlated because they co-occur with similar terms in documents.

    Page rank and Google paper
    Page rank was another idea I was fascinated about. When the web is a directed graph with nodes being the pages and the directed links between pages having the probability that a user will go from one page to another, it is really interesting to see that it is possible for these probability values to stabilize when they are interdependent on one another. The read up assignment on Google paper was also a very thrilling task which showed a much more primitive version of the page rank algorithm.

    Latent Semantic Indexing
    Again this uses another linear algebraic concept called Singular Value Decomposition. It is used to identify new dimensions for terms from the original given dimension that best represent the features in the document. It can not only be used for dimensionality reduction but also to reduce noise in the data.

    Collaborative filtering
    I was very much anxious in knowing how Netflix used to recommend movie for users based on their tastes (identified through their actions and ratings). Again LSI is also one of the techniques that are being used for this. The idea of collaborative filtering when learnt along with the discussion of application in Netflix really was very exciting and very good motivation to learn more regarding this concept.


  27. Vector similarity
    In a previous coursework, I used lucene indexer for indexing xml data. But, I never considered the inverse document frequency while ranking the pages. This one was altogether a new concept for me and I enjoyed a lot during phase 1 project and assignments.

    Google Paper
    Google paper assignment was really useful in the sense it exposed us to the underlying architecture design of the Google search engine and I got to learn many things like memory management, fancy hits, short and long barrels and most importantly on how Google avoids crawling the same URL multiple times.

    Page Ranking and Authority/Hub computation
    This topic was worth giving a note as it describes on how to use the link structure of the web to rank the Web Pages. It is also inspiring to see that PageRank can be calculated for documents of any size. Especially, I learned a lot of new data structures for effectively storing and computing authorities and hub values during project phase 2.

    Map Reducer
    I was fascinated about Map reducer concepts and its ability to parallelize the work in distributed environment. It was a framework that uses the concept of the map-reduce that we see in functional programming. It enables us to effectively divide and map a problem into sub problems and then combine the results to form the main result (reduce task).

    I thoroughly enjoyed the Clustering concepts as it was relatively a new topic to me unlike many others in the class. We can use this approach to segregate the documents into different clusters so that the documents within a cluster are similar to each other when compared to documents in other clusters.

    Finally, this course work has motivated me to learn different mining techniques and I would like to learn a lot about linear algebra concepts in the future.


  28. Pagerank. I learned how this particular number came to exist, how its calculation involved using the stochastic matrix, the damping factor and the existence of sink nodes. It is very interesting to see how Google used this particular ranking system to improve the quality of the results obtained when it’s being combined with a similarity algorithm.

    Authority and Hubs. This is where I first used the concept of convergence, which is when the results would stop changing after an uncertain number of iterations. I used to see the web as a network of inter-connected nodes, some with a lot of outgoing links (hubs) and some with a lot of incoming links (authorities) but never thought this would be helpful with results. Apparently, applying this after obtaining the vector space results can gives us clusters that might help better with identifying each separate stream of results.

    K-Means clustering. This was an easy lecture compared to the rest. Clustering based on centroids which change in every iteration, and reassigning the documents to different or same clusters based on their recomputed similarity to each of the centroids until convergence.

    Vector Similarity with TF-IDF. Learned how this basic concept is used in all the other algorithms we talked in class. It’s basic but essential.

    Collaborative Filtering. It’s interesting how the predicted values are formed based on other people’s preferences; seeing how much influence and how helpful this is. It is such a huge recommender system where the potential can be seen in websites such as Amazon and Netflix.

  29. 1) TF-IDF
    I never thought IDF makes a huge difference in the qualitiy of results. By doing the projects and the associated analysis, I was able to realize the importance of IDF in query matching.

    2) Google Paper
    I got the opportunity to study and analyse the research paper "The Anatomy of a Search Engine". It was a very good experience as I was able to get some idea as to how Google search engine works.

    3) Collobarative Filtering
    I was able to understand to some extent how collobarative filtering works and as a result I could see how recommendation systems for companies like Amazon could be designed.

    4) Scalar Clusters
    I was able to learn about query how query expansion could be done by means of scalar clusters and t-t corelation matrix.

    5) Page Rank
    I was able to understand how link structure plays a major role in ranking the pages. And apart from match to query, it is also good to take the link structure when listing the top ranked pages.

  30. 1) TF-IDF
    I never thought IDF makes a huge difference in the qualitiy of results. By doing the projects and the associated analysis, I was able to realize the importance of IDF in query matching.

    2) Google Paper
    I got the opportunity to study and analyse the research paper "The Anatomy of a Search Engine". It was a very good experience as I was able to get some idea as to how Google search engine works.

    3) Collobarative Filtering
    I was able to understand to some extent how collobarative filtering works and as a result I could see how recommendation systems for companies like Amazon could be designed.

    4) Scalar Clusters
    I was able to learn about query how query expansion could be done by means of scalar clusters and t-t corelation matrix.

    5) Page Rank
    I was able to understand how link structure plays a major role in ranking the pages. And apart from match to query, it is also good to take the link structure when listing the top ranked pages.

  31. 1. Learnt that is a cool site to visit and realized that i was a little of a "bozo" user myself.

    2. Improved my knowledge of new topics not necessarily related to Information retrieval, but with powerful analogies to the
    IR concepts. (the origin of spam, typhoid Mary, Gaetan Dugas, Buridan's donkey, Alan Greenspan, Luis von Ahn, Jon
    Kleinberg, Carl Jung, Watson)

    3. Finally understood what Eigen values and vectors actually were used for, instead of blindly finding the values.
    Also the engineering aspects of search results computation were an eye opener.

    4. The concept of bidding process in search advertisements and the way the problems associated with it were solved was
    interesting. Had never thought twice about search advertisements earlier and the complicated process associated with the
    display of relevant ads.

    5.The extra credit lectures on social networking was extremely informative. The assumption that the real world is a uniform random distribution being negated and the discovery that it actually follows the power law distribution was one of the most important and informative concept learnt during this course.

  32. Five Ideas which i liked.

    1] Latent Semantic Index.
    The concept that there are many irrelevant dimensions in the data which we can get rid
    of by mapping it to some latent indexes and then analysing data by not much loss of
    information and then using that we can tranform any data from the dataset to the new

    2] How search engine works
    The concept of vector similarity, jaccard similarity, TF, TF-IDF, Page Rank, Authority and

    hubs.Eigen values and eigen vectors.This is the basic foundation of todays search engine
    and we can use this concept almost every where in the web when it comes to search or
    relevancy and importance.

    3] Hidden Markov Model.
    The concept of "Hidden" in a document which can be modeled in different different
    markov models for understanding the information which not exactly NLP but close enough
    to it , how can we model the HMM using the training data.

    4] Distributions, uniform random, power law, poisson, small world pehnomen.
    Concept that "rich get richer" is very real world case and that is captured
    by power law distribution and that is the reason rich people (like google)
    are more probable in the power ditribution world , same concept in zephs and
    heaps law.

    5] Realization that there is a lot to do in search engines
    especially in the automated extraction lecture where you explained
    that there is a possibility of putting a query in a engine and getting
    directly the results based on the relevant pages(similar to what we
    do when we search for query in our brain using different results of pages)

  33. 1. Application of SVD: Through some previous courses I learned SVD pretty mechanically as the procedure. But I am happy that now I understood the use case and the right situation in which SVD can help. Especially the concept of latent indexes hidden in real data is beautiful

    2. Explanation of curse of dimensionality through example of N-dimensional Apple was fantastic. I think I will never forget this concept because of example.

    3. Hidden markov model used for speech recognition. This was a question in my mind to know how these interactive automated phone systems work and I was happily surprised to see it to be one other Markov model based solution.

    4. Analogy of putting a weak link in link structures to NOT adding a query document in corpus is amazing. Such connections are helping me in enduring these concepts in my head.

    5. Process of bidding and psychological as well as fairness study around all this process is fascinating.

  34. This is Jadiel de Armas.
    1. The main cause of the simplicity of the ranking model of search engines is because of computational time. The vector model is computationally fast. Other models that would return more relevant results are not. Hence, because of the scale of the Internet, implementations of better retrieval models will depend a lot on how well you can engineer a system so that it works well with that more computationally expensive model.

    2. It is very easy to extend an inverted index so that it supports more structured models. For example, suppose that at index time we can discriminate if the word Tiger refers to an animal, an operating system, or a sports player. Then we would just add three different words to the inverted index, saving ourselves from the pain of having to do LSI at query time to find those differences in that word.

    3. It was a good idea to compare clustering on the web by using A/H vs. using classical clustering algorithms. A/H will find the physical clusters of the web (the physical cluster is determined by the links between pages). Classical clustering algorithms will find the topical clusters between web pages. Then we can do correlation analysis in order to determine how much do this two clustering methods correspond to each other.

    4. Statistics is a central tool in Artificial Intelligence and Machine Learning. But I don't like it much. I believe that hybrid models that include statistics, but that also include some extra deterministic information resemble much better how our brain works. One example of those models is the Combinatorial Categorial Grammar (Dr. Chitta Baral uses them on his research).

    5. There is a lot to innovate and to create out there. If our mind can conceive it, it should be possible to be done. In other fields, who you know and what kind of relationships you have will determine in a good measure your success. But the technological field is the fairest of them all. If you can do it faster and better you will succeed.

  35. Dimensionality Deduction/noise removal can be either a feature or a bug. We do not want to look at each and every detail of something, we want to look at it in a way we can best understand it and easily distinguish it from others while considering fewest factors.

    Effect of the Prior should stand while considering small amount of training data but after considerably large training data, sticking to the prior is irrational. Flipping coin 100 times doesn’t shake your beliefs (of 0.5 probability of either heads or tail) but after 3 billion times, it is an unfair coin.

    On the web, the advertiser only pays if the advertisement has been effective (even though limited to bringing attention - not guaranteed sales) vs. traditional advertizing where you pay everything beforehand.

    Page's "popularity" plays a certain role along with page's similarity. A reputable page is a better match than a page containing just query words.

    Any good compression technique is actually a learning technique as it gives you a learned representation of the actual data.

    IR system actually should find "what you mean" and not "what you ask" learning is the best way to do this. But since it is hard, we do best to get most similar results.

  36. 1. Oversimplification of representative models
    Many models used in class consisted of reducing a complicated document to a simple, but much more workable model. The vector space model, for example, does not care about sentence structure or meaning, but still gives relatively good results.

    2. LSI and dimensionality reduction
    In particular, the notion that patterns can still be extracted after mangling with the data is surprising. Or, even with unmodified data, applying LSI may reveal new dimensions that show correlations unrelated to the original dimensions.

    3. Clustering
    The k-means algorithm is a nice go-to clustering to look for initial partitions, despite some of its limitations. Its greedy approach makes it obvious to understand and analyze the results of.

    4. RDF/OWL
    These set of technologies seem to point to the way to include meaning into our previous models that we have worked with, which have been glaringly lacking in that department. How this has been done is being addressed by the Semantic web, which is still in its infancy. It would be interesting to see what sorts of advantages/disadvantages it has as compared to competing technologies, such as topic maps.

    5. Recommendation systems
    Collaborative filtering by learning what the user has liked/bought in comparison with other users can bring in more information than just a single user's purchases. This brought together many concepts used previously, such as clustering and NBC, which shows how it can be and has been used in practice by companies.

  37. Before this class I had a vague idea what index was, after listening to the class I really understood how an Inverted index is built and used. During project we had to use the forward index and inverted Index, this made me understand how important an index is for a search engine.

    LSI was one interesting concept which really gave me hard time initially. Then I went through all my liner algebra notes to understand how singular value decomposition works, once I did that I really understood how LSI uses SVD to decompose term-document matrix.

    I never knew anything about ranking algorithms before this class. The concepts behind the Page-rank, Authority and hubs made me think how basic liner algebraic concepts are so effective for rank results.

    Map-reduce concepts were really interesting and the undergrad research example which professor used to explain the concept was really wonderful. This concept gave me a clear understanding how a distributed execution works.

    Collaborative filtering on recommendation systems was good. Using our reviews to help other people decide what they might find interesting was really was a unique method.


  38. Vector Similarity : This showed how linear algebra can be applied in solving computation problems. Because of this I read some of linear algebra concepts and understood very well compared to what I had read earlier.

    Page rank : Every page can be given a rank out of billions of pages is a interesting problem. And this is done by using the links that was more interesting to know.

    Indexing : The classes regarding crawling, indexing was also good to learn. A search application with just best retrieval algorithm is not sufficient but also should have efficient indexing, storing and efficient crawling.

    XML : I was surprised when professor taught XML in such a detail level. It shows the way Professor likes everything he teaches.
    I had taken a long time to understand ontology previously, and the way it was explained made it look very simpler. If I knew it would be lectured in IR class I could have saved lot of time.

    Advertisement : I was wondering how search engines make profit. The lecturer on advertisement made me understand this concept clearly.


  39. PCA/LSI :
    I really liked the PCA/LSI concept since it can detect polysemy and find the true dimensionality. The example involving a bunch of documents belonging to database and statistics really helped in understanding this concept.
    Also, the example with two new documents, one with term “Database” 50 times and the other document with term “SQL” 50 times when added to the corpus were very close to each other in the graph plotted.
    I also appreciated the fact that PCA/LSI can be used for clustering.

    Naïve Bayes Classifier :
    It is such a simple concept but very effective in text classification.
    – Peter Norvig, the director of Machine Learning at GOOGLE said, when asked about what sort of technology they use “Naïve bayes”
    – Naïve bayes classifier is darned easy to implement
    – Good learning speed, classification speed
    – Modest space storage
    – Supports incrementality
    Use of M-estimate to solve the problem with Probability of attributes = 0 was nice idea that I learned.

    K-Means :
    If we don’t find any clue how to cluster the data, then we can simply apply K-Means algorithm. The facts that K-Means does not always give the optimal clusters as it looks for local minima than global minima and the cluster optimality depends on the choice of initial clusters were informative.

    Map/Reduce :
    Though it is more of engineering related than Science, the concept of breaking large data into smaller chunks and then applying Map/Reduce programs running on independent machines to solve the problem was a concept I greatly appreciated. In today’s world the dataset with companies involved in Social Networking or even enterprise data is huge and is ever increasing. These companies are looking for ways to process the huge dataset and Map/Reduce is one of the techniques that can really help them.

    Collaborative Filtering :
    Use of Collaborative Filtering helped me understand how recommendation system of websites like work. The fact that you can find your soul-mates through collaborative filtering was exciting. There are some recent efforts in this direction. One of them is which is a community based search engine. People put up queries and the other members reply to it. The members associate themselves with groups and through so one can find soul-mate through it.

  40. Precision /Recall which are the fundemental concepts in understanding the relevance of the query results i.e “all truth and nothing but the truth” helped me understand the importance of combining the precision and recall in the query results

    The use of TF-IDF in identifying terms that are unique to the documents over the entire document corpus and thus improving the query results

    Using LSI to reduce dimensionality and to capture the polysemy and synonymy between terms using SVD was facscinating .

    The computation of A/H and pagerank using power iteration and the effect of eigen gap on the convergence and the uses of reset matrix for topic specific page rank, trust rank etc.

    The concept of Collaborative filtering used in recommendation systems like Netflix and also how LSI can again be used to improve the results.Effect of learning unlabelled examples in classification combining content based and collaborative based filtering

  41. 1) Bag of words model:
    A very simple concept both to understand and implement, and still gives very good results. How Vector model was developed from bag of words, how to find similarity between documents using vector similarity, how to handle text doubling (normalizing the tf), the importance and use of IDF to get dissimilarity among results.
    2) LSI: The need for LSI. How terms can not be considered independent. (synonymy and polysemy)
    The really good example of using two original words in the corpus u, v, and then how it got maliciously mixed up (the analogy of how signals get noise). Considering documents to be in a higher dimension. (peeling of high dimensional apple analogy was useful).
    The fish example to show how there are better dimensions to define the data for the fish (instead of the x y dimensions).Solving LSI using SVD. LSI converts the corpus into higher dimensions. SVD reduces these dimensions (choosing top K latent dimensions), while still preserving the rank of the original matrix.
    The concepts of LSI again came up during the Recommender system lectures (Netflix prize winning algorithm used LSI). It also came up in clustering lectures where LSI was used to reduce the number of dimensions.
    Once I understood LSI there were a lot of questions and ideas like:
    Why does not anyone use LSI? (HUGE computational overhead). Which makes you think, why not use it on a subset of results. There are some articles and papers which discuss these options.
    Where else can it be used? One of the interesting things I stumbled upon was the use of LSI for machine translation. Most of the "sophisticated" machine translation techniques used today use simple probability to train an algorithm and the use them on actual corpus. Since LSI takes care of synonymy, and words in one language can be considered a synonym in another language, LSI could be used to capture this correlation. Again there are some very interesting work being done in this area. (google for "machine translation LSI").
    3) Page Rank / Authority Hubs:
    I had studied this concept in an earlier course, but this course allowed an opportunity to better understand it. Specially during the projects. The various ways to improve efficiency of calculating pagerank, the comparison/relation between page rank and Auth-Hubs. (The projects showed us the differences as we discussed in class).

    4) Recommender Systems / search advertising. : Content Based filtering - hows its domain specific. Collaborative filtering, in which cases its better than Content based (not domain specific). The soul mate example, how two users who have rated more items (even if their ratings are not exactly matching) are more important than two users who have the exact ratings but for lesser number of products.
    Search advertising was another interesting lecture. The lecture went through a step by step build up of making such a advertising system (the various complications were shown as separate cases and how they might be solved). Vickery auction, which at first looks like it will not work very well in practical situations. (One of the questions in my mind was, what if a bidder quotes a ridiculously high bid, and the other bidders do "truthful" bidding, in which case the winner would get away with a lower price. But we have to keep in mind that there might be another bidder who thinks the same way and who bids slightly lower than this "ridiculous" bidder, in which case the first bidder still ends up paying a very high price. So truthful bidding is the best option for all the bidders. ).
    5) Information extraction:
    The lecture on Named entity extraction. How information can be extracted, when it is easy (templates, extracting information from pages like amazon products, since all the pages follow same template a wrapper can be written for it), when it is difficult (normal text which does not follow any templates).

  42. - Every technology aims to make its users believe that there is nothing better.But there is always is a room for improvement. One only needs to think outside of the box. If we remain satisfied with what we have we will never make progress.

    - "Finding a sweetspot" is a wonderful idea proven by the incredible success of the "Bag of words" model. It has been used successfully in numerous areas including web search, clustering and recommendation systems.

    - The idea that web is not random and that there is lot of information that can be extracted from the collective unconscious.

    - The incidence of George Dantzig where he takes up open problems because they were not labeled as open and solves them. This conveys the idea that open problems are not meant to be left open. Just because a problem has not been solved by others doesn't mean that it is unsolvable.

    - Linear algebra and probability theory have a lot of practical importance. The concepts of linear algebra and probability form the base of information retrieval , extraction and mining. It seems as if we can always look towards mathematics for finding solutions to problems in other fields.

  43. 1.Similarity measures
    The basic concepts behind the working of any search engine. I liked the concepts of using tf-idf and calculating cosine similarity and how these could be used to rank documents. The mathematical concepts are not that complex and run quite fast. But they produce highly relevant results.

    2. Scalar Cluster and Latent Semantic indexing
    I found the idea of scalar clusters capturing correlation amongst documents very interesting. The results were not only intuitively correct but also proven mathematically. Even though we know for a fact that that ”database” and ”sql” are related, scalar clusters proved it mathematically.
    The stepwise manner in which we were introduced to LSI allowed me to understand the otherwise complex looking concept of LSI very easily. I liked the idea of why and how we use dimensionality reductions to improve the quality of data. I thoroughly understood the idea of using of eigen values and eigen vectors. The whole topic gave very insightful details about what every single value at every step signifies.

    3. Authority and Hubs, PageRank analysis
    The idea of how mathematical we can incorporate relative importance of pages was not clear until I learnt the mathematical concept between Authority and Hubs. Further more, since we were asked to implement the same concept in our project, it made things easier to understand. I understood the practical importance of Authorities and Hubs.
    I had known of the page rank concept earlier but found it too complex to understand. Only after getting acquainted with the mathematics behind it, the concept became easier to grasp and very interesting to explore.
    The Google paper made concepts even clearer and helped me in understanding how exactly page rank works.

    4. Clustering methods
    I found the entire concept of clustering very interesting. The various clustering algorithms were easy to understand and interesting to apply to practical examples. I particular liked how we could combine the concepts of clustering and calculating similarity between documents using Euclidean measure or cosine similarity.It gave me a thorough understand of how clustering is done with actual documents and not just singular values. It made me understood one more practical application of mathematical concepts like cosine similarity or Euclidean distance apart from just ranking of documents.

    5.Recommendation systems:
    I found the concept of content based and collaborative filtering very interesting. Pseudo-rankings in particular were enlightening. These techniques helped me in understanding the importance of data approximation and how we can work with the data we have (labelled or unlabeled) to come up with better approximations.

  44. 1. The power of a bag of words model. I find it incredibly interesting that so much information can be thrown away by a model and still have the model provide useful results. I was especially surprised that, by running LSI on documents represented using a bag of words model, that so many associations between documents and between terms could be found-- this is information that I would have expected to be lost by treating documents as a bag of words.

    2. The method of treating language like it was produced by something other than an intelligent speaker in order to model it. When using LSI we treated language as correlated random variables, and when doing information extraction we treated language as being produced by a markov chain, but in both cases we were able to find semantic information that existed in the laguage but wasn't part of the model. I didn't expect these models to be so powerful in finding information that they didn't explicitly account for.

    3. The use of links across web pages to indicate trust or association of ideas. Normally I think of links as just an aid in navigation, or a way to get to tangential information. The usefulness of A/H and pageranks makes me reconsider how integral links are to how information is organized online.

    4. I found the discussion of everything that has to be done by the backend of a search engine mind-boggling, especially the idea that they have to prioritize which new pages to read and analyze next, because new pages are created at such a rate that a search engine can never reach the end of its queue of new pages. While I guess this shouldn't be surprising given the size of the internet, I still find the idea of that much information staggering.

    5. Possibly my favorite idea from the class, though the one that I understand the least concrete details of, is the idea (mentioned in the collective unconcious slides at the beginning of class) that it's possible to find real, semantic information by analyzing a massive volume of documents, even though that information doesn't exist explicitly in any of them. The correlations found by LSI, the judgements of A/H regarding what is authoritative, and the potential of markov chains to recognize patterns in written language (and the differences between these patterns in different documents or categories) all give information about how internet users think and communicate. This information is incredibly interesting and could be very useful if properly assimilated and presented, and can only be found by analyzing a great number of sources and documents, ignoring their explicit semantic meaning. I find it exciting that this information exists online and is accessible to anyone who knows how search for and process it, but is completely hidden without the proper tools.

  45. 1) Latent Semantic Indexing: How it can be used in dimensionality reduction and collaborative filtering.
    2) Random Surfer Model: Used for page rank (M*=c(M+Z)+(1-c)K). How K can be used personalize the search results.
    3) Relevance Feedback: How users are being considered while giving the results to an query.How Rocchio can be used in Relevance Feedback.
    4) Power Iteration method for finding the Eigen Vectors of Symmetric Matrix and how it can be used in A/H and Page Rank computation.
    5) XML: How XML structures can be either used by IR or by database people.

  46. - In my current job, I had an idea using collaborative filtering with our purchasing system. This way we can suggest other domain specific items for the users to include if they forget to purchase. For example, if someone buys wire they may also need cutters or connectors associated to that wire.

    - K-means is a useful clustering tool, and can also be used as an compression algorithm

    - Buckshot algorithm gives better starting centroids for K-means because buckshot agglomerates the document and uses those to build distinct cluster, where as K-means just chooses random documents

    - Vector space similar combined with a good inverted index gives a fast search result. For our project, the responses were around a couple mill-seconds

    - Authorities provide information and hubs provide links to the information(authorities)

  47. stupid blog just wiped all my data because i didn't copy it

    - competitive ratio between offline and online results, offline is easier than online and should always have superior (or equal) result, as such the efficiency of an online method can be bench-marked using the offline method as a reference

    - one can decrease the effects of training data that is incomplete or misrepresents the data (eg coin flips with a low number of flips) by using smoothing, this can be done by creating new samples which will pull the knowledge base closer to a pre-determined bias (such as 50-50 ratio on coin flips)

    - despite what may be intuitive, data which is unlabeled but non-random can be quite valuable for machine learning. Even for computers it is valuable to have problem sets, and then even if there is no known solution for those .

    - feature selection is similar to dimensionality reduction from LSI, it will reduce the amount of information that must be processed and may increase the chance of correct predictions. LSI is also beneficial in cases like the Netflix test where they determined that the top 2 dimensions did in fact have human relevant results (chick flick-ness of a movie and indepent-ness of a movie)

    - when using hubs/authorities a subgraph can be located defined as a set of nodes where none of the hubs point to authorities outside the subgraph, and none of the authorities are pointed to by hubs outside the subgraph. Using hubs/authorities only the largest subgraph will receive any hub or authority score at the steady state.

  48. The approach used for representation of documents and queries is completely novel to me. Many of the ideas presented by Prof gave a very clear understanding of basics of why cosine similarity was considered and how simple clean approaches solve huge problems. The concept of precision and recall was very interesting. I understood how important these 2 parameters are for construction of a smart IR engine. The gradual transition of finding similar documents for a query using tf IDF by considering more indicative words of a document was nice.
    Latent Semantic Indexing ,quoted as “The Most technical Lecture of the Course ”  was very interesting. It was very beautiful of how Mathematical concepts were applied for retrieving the documents effectively for a query posed.
    The concept of authorities and Hubs for link analysis on web was another interesting area. It never struck me that graph theory could be used so beautifully to analyse the importance of pages on web. Majority of tyranny and few ideas about how introducing a random link between disjoint clusters would make the analysis unstable was nice.
    Page Rank is one of the most powerful algorithms ever developed for the web. The intention of taking this class was also to understand this algorithm and see what everyone is talking about?! After taking this course, I got an opportunity to not only study the algorithm but also to implement it which Iis a huge take home for me.(I liked the way the offline computation was done for the entire document corpus )
    The clustering concept and different algorithms used for clustering was good. It was nice to understand the series of steps actually involved in construction of a tradition/ current IR engine. Initally we learnt about the representation of documents, finding similarities, building lexicon for words,shingles, different ways to rank the documents, and finally cluster this data. The way the content was organized and conveyed was good.
    Knowledge and entire data on web being represented using XML, RDf owl schema was interesting . I also very much liked the idea collaborative filtering and examples that were quoted using Netflix in real time these concepts are actually being used

  49. Rajasekhar Bayapu:

    1) LSI and SVD are the topics that caught my eyes. I have learnt how to make efficient dimentionality reduction.
    2) scalar clustering is my another topic which i am more interested in. The main idea which links with the correlation analysis to get term term correlation is interesting.
    3) knowledge of OWL, RDF format is more useful, as i have started using the RDF format in my job
    4) Dividing documents into bag of words, considering documents as vectors are more intresting topics. Finding the similarity of vectors to return top k documents helped me in designing algorithms for Bio-design database
    5) Page rank-Ranking of web pages using the inlinks and outlinks. Taking an example of research papers citations to justify the importance of papers is intresting.
    6) K-means clustering technique to cluster data and clearly explaining about supervized and un supervised techniques makes more clear to me about data mining

  50. 1)LSI
    An indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) involving some exiting Linear Algebra and how synonymy and polysemy is captured.

    2)Map Reduce
    The programming model which helps in implementing and processing large scale data sets and how it can be applied to variety of applications.

    Webmasters cannot control which sites link to their sites. PageRank is also Google's way of deciding a page's importance.

    4)Collaborative Filtering
    A very successful recommender system wherein personal tastes of random people are correlated.People in a Virtual Community influence each other as if they collaborated and interacted.

    5)Information Extraction
    System finds useful information about a domain and encodes that information in a structured form, suitable for populating databases.