There is a steady trickle of visitors to my post on Why Does SVD Improve Similarity Measurement?, so I gave this question a bit more thought. In that post, I offered three hypotheses about why SVD helps — high-order co-occurrence, latent meaning, and noise reduction — and I said that I didn’t know which hypothesis was correct; perhaps they are all really saying the same thing. On further reflection, it seems to me that all three of these hypotheses imply that, in the limit, as the corpus size approaches infinity, SVD would not be necessary. That is, SVD is a way of coping with small corpora. Of course, a corpus is like a hard drive: no matter how big it is, you will eventually discover that it’s too small. But I think there are (at least) two aspects to smallness, and we may be able to separate them, to see which is most important.
Let’s suppose we use a small corpus (i.e., any corpus) to generate the usual term-document matrix. Let’s consider the raw frequency values in this matrix, before we do any tf-idf or SVD processing. Suppose we convert the frequencies to estimated probabilities, by dividing each cell in the matrix by the total of all of the cells in the matrix. Let’s hypothesize that the estimated probabilities will converge to true probabilities in the limit, as the corpus size grows to infinity. (This hypothesis is slightly strange. We would need to keep the number of terms and documents constant and accommodate the growing corpus by expanding the documents. Let’s ignore this point.) Ideally, an algorithm such as truncated SVD would take the estimated probabilities and convert them to true probabilities (although that’s not likely to happen, except in our dreams).
There are two (not entirely distinct) ways in which the estimated probabilities may differ from the true probabilities, variance and sparsity. The estimated probabilities may be very rough, noisy estimates of the true probability. As the corpus grows, we’ll see these estimates jumping up and down. This is variance. Another problem is that most of the estimates will be zero. The raw frequencies of most of the elements in the matrix will be zero, and hence the estimated probabilities will mostly be zero. This is sparsity.
In general, when we talk about smoothing data, we don’t distinguish variance and sparsity, because, at an abstract level, they are essentially the same problem: noise. SVD reduces sparsity (i.e., it increases density) and it reduces variance; SVD makes no distinction between sparsity and variance. However, there is a way for us to separate these two things. We can apply truncated SVD and then restore any zero values that were removed by SVD. (Or we can apply truncated SVD and then restore any values that were nonzero before SVD to their original values, so that density increases, but the original nonzero values do not change.) We can compare regular SVD with sparsity-preserving SVD, to see which is most important (in terms of performance in a particular application, such as information retrieval), increasing density or decreasing variance.
For a fair comparison, we need to independently adjust sparsity-smoothing and variance-smoothing. That is, if we are comparing regular SVD to sparsity-preserving SVD, to see whether sparsity-smoothing is important, then we need to make sure that regular SVD and sparsity-preserving SVD are doing the same amount of variance-smoothing (by adjusting k, the rank of the truncated SVD). It should be possible to estimate the amount of variance-smoothing by some kind of Monte Carlo experiment.
This is related to the original three hypotheses (high-order co-occurrence, latent meaning, and noise reduction), in that high-order co-occurrence is concerned with sparsity more than variance. If two terms have a high semantic similarity, yet they never co-occur in any document, then it seems that we have a problem of sparsity.
Filed under: Computational Linguistics, Semantics | Tagged: data analysis, SVD, text analysis
There’s a third kind of noise that is relevant here (in addition to sparsity and variance), aliasing:
http://en.wikipedia.org/wiki/Aliasing
Suppose the true probability for a given cell in the matrix is 3/200, but the total of the frequencies of all of the cells is 100. The best possible estimated probability is either 1/100 or 2/100, since it is impossible to have a frequency of 1.5. Unfortunately, these two estimates differ by a factor of 2.
Perhaps SVD improves similarity measurement by anti-aliasing:
http://en.wikipedia.org/wiki/Anti-aliasing
That is, perhaps SVD would smooth the hypothetical cell from 1/100 to 1.5/100. We can test this hypothesis by performing a truncated SVD and then re-aliasing the matrix. Will there be a difference in performance on a given task, comparing regular truncated SVD to re-aliased truncated SVD?
Aliasing is closely related to sparsity, since (it seems to me that) the worst kind of aliasing is when a small probability is aliased to zero. Given Zipf’s Law, we can expect this kind of aliasing (small to zero) to be very common in a term-document matrix:
http://en.wikipedia.org/wiki/Zipf%27s_law
I would abstract out this problem and ask, more generally, whether dimensionality reduction (of any kind) generally helps similarity measurements.
Indeed, there are various naive or fancier alternatives to the SVD…
(For fancier alternatives… project on a more general manifold than an hyperplane…)
I would abstract out this problem and ask, more generally, whether dimensionality reduction (of any kind) generally helps similarity measurements.
Yes, wherever I say “SVD” or “truncated SVD”, substitute your favourite smoothing algorithm (PLSA, NMF, KPCA, IS, LDA, etc.). Also, wherever I talk about a “term-document matrix”, substitute your favourite kind of matrix. I’m interested in the task of measuring semantic similarity, but this post applies to various other tasks, such as collaborative filtering.
There’s evidence that smoothing improves similarity measurements. That is, smoothing helps. But what kind of smoothing is best? Is it variance-smoothing, sparsity-smoothing, anti-aliasing, or something else that’s helping the most?
Two quick comments:
(1) I don’t quite understand what you mean by “converge to the true probabilities.” The input matrix (if you normalize) can be interpreted as a maximum likelihood estimate of a unigram (multinomial) distribution of words for a given document. But I don’t see how this converges. Do you mean that the author stopped writing the document prematurely and the “true probabilities” are the probabilities that would be recovered had he written ad infinitum? This makes some sense to me, but doesn’t jive with the “problem” of getting more and more documents: why should you care about getting more documents in this notion of a limit.
(2) regarding sparsity versus variance, I quite like the Mohri and Roark paper on sampling versus structural zeros: http://www.cs.nyu.edu/~mohri/postscript/zero.pdf
(1) I don’t quite understand what you mean by “converge to the true probabilities.”
Perhaps I should have talked about some kind of generative probability model, in which there are some underlying factors (or latent variables) that probabilistically generate some surface text. Then it would be clear what I mean by “true probabilities”. However, I wanted to avoid assuming a factor model. Part of the question here is whether a factor model is the right assumption.
Do you mean that the author stopped writing the document prematurely and the “true probabilities” are the probabilities that would be recovered had he written ad infinitum?
Yes, this is what I mean. Thanks for expressing it more clearly.
This makes some sense to me, but doesn’t jive with the “problem” of getting more and more documents: why should you care about getting more documents in this notion of a limit.
That’s why I said, “We would need to keep the number of terms and documents constant and accommodate the growing corpus by expanding the documents.”
(2) regarding sparsity versus variance
Thanks for the reference!
I’m posting through a migraine so forgive me if this is incoherent or redundant to what you’ve already posted:
I believe the general principle which may encompass all of these specifics (including the utility of other forms of dimensionality reduction besides svd) is the following:
SVD rotates the dataset so that the axes are aligned with, and in sorted order of, the most common correlations (or joint aspects).
When deciding what we can infer from a finite number of observations, we need to blend those observations with some priors; and the more observations we have, the less relevant the priors become. For SVD, we know the prior for all but the first vector is zero.
When we crop the SVD to the top-N dimensions, we’re doing a crude blending of priors, noting that the first few vectors overwhelm the priors while the last few are overwhelmed by their priors, and instead of properly weighing each vector accordingly we’re just picking a threshold where we think the net return vs. priors goes negative in the unblended case.
Said differently, as applies to the case of document classification for instance, by rotating the data through an SVD (a one-to-one mapping which by itself does nothing), we, for instance, replace the occasional co-occurance of the words “keyboard” and “html” with the much much more common co-occurance of computer-related words with other computer-related words. So, the rotation itself changes nothing by itself, but when we measure our signals in the rotated space we find some of them (first N) much stronger (and able to overcome priors) than in the unrotated version.
(For this reason it’s long been my opinion that some sort of intelligent and gradual diminishing of the singular value before reconstruction would result in better performance than merely clipping to zero past some point. In principle as the dataset grows so would the effective N as the zero-prior gets progressively overridden.)
So, I guess I’m saying it’s not about sparsity or variance or whatnot, per se, in the data matrix, but rather about blending in more-optimal off-axis priors rather than using any sort of naive priors aligned with the raw data.
Anyway, dunno if that’s at all relevant to the point of your q, but it’s what came to mind.