Comments on:
Benoît Lemaire and Guy Denhière
Effects of High-Order Co-occurrences on Word Semantic Similarity
PMI-IR estimates the semantic similarity between a pair of words by how frequently they co-occur within a certain window of text. This simple measure of similarity is surprisingly good at recognizing synonyms: it seems that synonyms often appear close together in text. LSA goes beyond this simple measure by comparing the contexts in which the two words appear: synonyms tend to appear in similar contexts. Technically, we may say that PMI-IR uses first-order co-occurrence, whereas LSA includes higher-order co-occurrence information. PMI-IR is easier to compute than LSA, especially for very large collections of text, but PMI-IR ignores higher-order information. How useful is this higher-order information?
Lemaire and Denhière have performed an elegant experiment that answers this question. Their idea is to incrementally feed text into LSA and watch how the similarity measure evolves with each new paragraph of text. For a given pair of words, X and Y, each paragraph is assigned to one of five categories:
- occurrence of X but not Y,
- occurrence of Y but not X,
- first-order co-occurrence (occurrence of both X and Y),
- second-order co-occurrence (occurrence of neither X nor Y, but occurrence of at least three words, Z1, Z2, Z3, where each Zi has first-order co-occurrences with X and with Y in previously seen paragraphs),
- three-or-more-order co-occurrence (anything not covered by the other four categories).
The similarity of X and Y is measured before and after the addition of each new paragraph, and the gain or loss in similarity is noted for each of the five categories. The gains and losses were averaged over 28 pairs of words as the number of paragraphs was incremented from 2,000 to 13,637. The average gain (loss) for each of the five categories follows:
- occurrence of X but not Y: -0.15
- occurrence of Y but not X: -0.19
- first-order co-occurrence: +0.34
- second-order co-occurrence: +0.05
- three-or-more-order co-occurrence: +0.09
The results suggest that first-order co-occurrence is the most important contributor to determining similarity, but higher-order co-occurrence makes a significant contribution.
The authors conclude, “Our simulation shows that, although semantic similarity is largely associated to co-occurrence, which is coherent with the literature, the latter overestimates the former. High-order co-occurrences need to be taken into account. By only considering the frequency of co-occurrence as a measure of the semantic similarity, people probably introduce a bias.” This conclusion is slightly misleading. It is true that first-order co-occurrence (category 3) overestimates the semantic similarity (the sum of all five categories), but PMI-IR takes into account the cases where X appears without Y, and vice versa, so PMI-IR is best represented by the sum of the first three categories. Thus the results suggest that PMI-IR would tend to underestimate similarity as measured by LSA. Of course, the experiment is only concerned with LSA, not with PMI-IR, but it seems reasonable to assume that PMI-IR is approximated by the sum of the first three categories.
This paper presents a very interesting analysis of the role of different types of co-occurrence in LSA. It provides some new insight into how LSA works. The results suggest that LSA will outperform PMI-IR, when LSA is scaled up to the quantity of text that PMI-IR can handle.
Filed under: Computational Linguistics, Semantics | Tagged: LSA, PMI-IR, semantic similarity, SVD, text analysis
Sounds interesting, although the post doesn’t make it clear where X and Y come from. Are they synonyms? Are they arbitrary pairs of words? And what is the gold standard of similarity? Is it taken to be LSA?
Note that Landauer and Dumais’ paper that mentions Plato’s problem also provides a quantification of the effect of indirect word co-occurrence. In particular, they show that performance in properly identifying the synonymy of X and Y increases with the addition of text that doesn’t even contain X and Y.
I believe that, given equivalent amounts of text, LSA makes better use of it than PMI-IR. But, PMI-IR can process a whole lot more text than LSA. It would be interesting to quantify LSA’s advantage (if possible) in terms of gigabytes of training text—better to use PMI-IR if it lets you handle X more training text.
Sounds interesting, although the post doesn’t make it clear where X and Y come from.
I left out many details, but interested readers can click on the title of the paper above to get the full story.
The 28 pairs of words were arbitrarily selected from French association norms. We selected 28 words and their first associate, to make sure that their semantic similarity will not be too low. Because this experiment was very time-consuming (3 weeks of computation!) we could not use all the words of the association norms.
Because this experiment was very time-consuming (3 weeks of computation!) we could not use all the words of the association norms.
A very fast version of SVD is described here:
Drineas, P. and Frieze, A. and Kannan, R. and Vempala, S. and Vinay, V. (2004). Clustering Large Graphs via the Singular Value Decomposition, Machine Learning, 56(1): 9-33.
http://www-math.mit.edu/~vempala/papers/dfkvv.pdf
http://cs-www.cs.yale.edu/homes/kannan/Papers/5authorsjournal.pdf