Thursday, September 22, 2011

Re: CSE 494/598 Question

[I am answering this on the blog rather than in class because I want to use the class for more 
immediate issues in understanding LSI--which, for example, prevent you from doing the homework problems.
These questions are considering the higher order consequences of LSI.]

On Thu, Sep 22, 2011 at 6:25 AM, Shubhendra Singh <> wrote:
I have two questions

1. How does LSI tackle errors in text like spelling errors etc., or are they ignored?

LSI doesn't directly correct spelling, however, its analysis might be able to tell it that "computer" and "copmuter" are functional synonyms (like "computer" and "science") because they seem to occur in the same context. 

2. How the scenario changes when new phrases are added for example, lets say "cloud computing" is the new buzz word. I think few years back this word might not have been tokened but there must be several pages on "cloud" and "computing" separately. So do the things change in the same way as you showed in class for two documents "D1" and "D2" which were added in the with 50 occurrences of Database and Sql respectively?

If the semantics of the usage have changed over time, then unless you do LSI on a corpus reflecting the new usage, you won't capture the new semantics. (Consider, for example, running the query "computer" on a corpus of documents that were pre-1900--you will get a lot of information about human computers--since "computer" meant the human who took up a job to calculate/compute[

If "cloud computing" was not used in the original document corpus  with the modern meaning, and you are not recomputing the SVD (i.e., you are getting by with q*TF approach to convert the query), then LSI can't capture the new meaning of cloud computing. 


thank you,

No comments:

Post a Comment