Why Does SVD Improve Similarity Measurement?

In response to my earlier post on Effects of High-Order Co-occurrences on Word Semantic Similarity, Tom Landauer sent me the following note:

You have given me an idea. Because I have just been asked again to review papers that say that the way LSA works is by indirect associations, it seems that few have seen my experiments and logic. It looks like maybe posting it on your blog would do a better job. Is that possible?

Here’s the kernel. Experiments in which all direct contextual association has been removed from the training corpus for randomly selected pairs show very little effect on word-word similarities, and that separate occurrences have almost as much. This makes it seem unlikely that indirect co-occurrence is the cause of similarity in LSA. The idea might be rescued by hypothesizing that many indirect add up to much, but each weak step would weaken the next if happened by spreading activation, and the idea surely needs successful simulation or math to be believed. (Of course, if one is interested only in a technology for discovering whether words are related (as I think maybe you are?) then it might not matter?)

The simultaneous equation constraint, passage meanings required to be the sum of word meanings, as enforced by SVD, seems a much better candidate for the job, and at the same time provides an explicit account of passage meanings, which, for the most part, are the medium in which linguistic information/meaning is actually conveyed.

I had not fully appreciated this point before Tom’s note. I care about this point, because a good understanding of this issue may lead to better measures of similarity.

It is clear that using the Singular Value Decomposition to reduce the dimensionality of the word-chunk (term-document) matrix results in a better measure of word-word similarity. It seems to me that there are at least three hypotheses about why SVD improves the similarity measure:

  1. High-order co-occurrence: Dimension reduction with SVD is sensitive to high-order co-occurrence information (indirect association) that is ignored by PMI-IR and cosine similarity measures without SVD. This high-order co-occurrence information improves the similarity measure.
  2. Latent meaning: Dimension reduction with SVD creates a (relatively) low-dimensional linear mapping between row space (words) and column space (chunks of text, such as sentences, paragraphs, or documents). This low-dimensional mapping captures the latent (hidden) meaning in the words and the chunks. Limiting the number of latent dimensions forces a greater correspondence between words and chunks. This forced correspondence between words and chunks (the simultaneous equation constraint) improves the similarity measurement.
  3. Noise reduction: Dimension reduction with SVD removes random noise from the matrix (it smooths the matrix). The raw matrix contains a mixture of signal and noise. A low-dimensional linear model captures the signal and removes the noise. (This is like fitting a messy scatterplot with a clean linear regression equation.)

It seems possible for all three of these hypotheses to be true. Perhaps all three are even saying the same thing, in different ways.

Lemaire and Denhière’s paper shows that high-order co-occurrence has an impact on the similarity measure, but their paper does not demonstrate that high-order co-occurrence improves the similarity measure (maybe the impact is negative or neutral). Landauer and Dumais’ paper seems to show a benefit from indirect association, but Tom mentions (above) experiments that suggest indirect association is not important. It seems that the evidence for the first hypothesis (high-order co-occurrence) is ambiguous.

In Word Sense Discovery Based on Sense Descriptor Dissimilarity, Reinhard Rapp achieves 92.5% on the TOEFL questions, using LSA with very small chunks (strings of five words, excluding function words). This is considerably better than the 64.4% achieved by Landauer and Dumais with a much larger chunk size (151 words in the average chunk). If the second hypothesis (latent meaning) is correct, Rapp’s results suggest that the latent meaning that LSA can capture is not paragraph meaning or sentence meaning; it is the meaning of a very small number of words in a very localized context, without regard to syntax (no function words).

In The Computation of Word Associations, discussing a different experiment, Rapp writes, “We nevertheless tend to interpret the results of our comparison as evidence for the view that SVD is just another method for smoothing that has its greatest benefits for sparse data. However, we do not deny the technical value of the method.” In other words, he believes his experimental results support the third hypothesis, that SVD performs noise reduction.

Probabilistic Latent Semantic Analysis can produce results that are quite similar to LSA’s results, although there are many differences between the two algorithms. We might hope that this would help us to narrow down the range of hypotheses, but it seems to me that the same three hypotheses apply to PLSA.

In summary, I don’t know which of the three hypotheses is correct. They do not appear to be exclusive, so they might all be correct. They do not appear to be exhaustive, so a fourth hypothesis might be superior to all three. My statement of the three hypotheses (above) is informal. Depending on how the hypotheses are formalized, they might turn out to be mathematically equivalent.

2 Responses

  1. Hello,

    Nice post. I started writing a comment which turned out to become excessively long so I made my own post in response to your post. Lot of interesting articles on your page by the way.

Leave a Reply