Lexicons versus Corpora for Measures of Semantic Distance

Measures of semantic distance (or, inversely, semantic relatedness) have many applications in Computational Linguistics. There are three basic approaches to measuring semantic distance: lexicon-based algorithms, corpus-based algorithms, and hybrids. In an otherwise excellent paper on lexicon-based measures, Budanitsky and Hirst criticize corpus-based measures. I discuss their criticisms here.

Budanistky and Hirst give three criticisms of corpus-based measures:

First, while semantic relatedness is inherently a relation on concepts, as we emphasized in section 1.1, distributional similarity is a (corpus-dependent) relation on words. In theory, of course, if one had a large-enough sense-tagged corpus, one could derive distributional similarities of word-senses. But in practice, apart from the lack of such corpora, distributional similarities are promoted exactly for applications such as various kinds of ambiguity resolution in which it is words rather than senses that are available.

The problem with this criticism is that Pantel and Lin have shown that an unsupervised clustering algorithm can automatically discover word senses from text, without the need for manual sense tags. That is, distributional (corpus-based) approaches can infer word senses (concepts) from patterns of word usage.

Second, whereas semantic relatedness is symmetric, distributional similarity is a potentially asymmetrical relationship. If distributional similarity is conceived of as substitutability, as Weeds and Weir (2005) and Lee (1999) emphasize, then asymmetries arise when one word appears in a subset of the contexts in which the other appears; for example, the adjectives that typically modify apple are a subset of those that modify fruit, so fruit substitutes for apple better than apple substitutes for fruit. While some distributional similarity measures, such as cosine, are symmetric, many, such as alpha-skew divergence and the co-occurrence retrieval models developed by Weeds and Weir, are not.

There are at least three problems with this criticism: (1) The popular cosine similarity measure escapes this criticism. (2) If a function f(x,y) is asymmetrical, we can easily define a new symmetrical function g(x,y) as the average of f(x,y) and f(y,x). (3) There is evidence from cognitive psychology that human judgments of similarity are asymmetrical.

Third, lexical semantic relatedness depends on a pre-defined lexicographic or other knowledge resource, whereas distributional similarity is relative to a corpus. In each case, matching the measures to the resource is a research problem in itself, as this paper and (2003) show, and anomalies can arise. But the knowledge source for semantic relatedness is created by humans and may be presumed to be (in a weak sense) “true, unbiased, and complete”. A corpus, on the other hand, is not.

Bias in a corpus can be minimized by gathering a large amount of text; for example, a large collection of web pages. A manually generated lexicon, such as WordNet, will never match the coverage of Google. There is much bias involved in choosing which words and which word senses should be included in WordNet. There is much less bias in the variety of text indexed by Google.

We now turn to the hypothesis that distributional similarity can usefully stand in for semantic relatedness in NLP applications such as malapropism detection. Weeds (2003) considered the hypothesis in detail. She carried out a number of experiments using data gathered from the British National Corpus on the distribution of a set of 2000 nouns with respect to the verbs of which they were direct objects, comparing a large number of proposed measures of distributional similarity. … She found performance to be “generally fairly poor” (page 162), and undermined by the effects of varying word frequencies.

The British National Corpus consists of about 100 million words. This is very small by today’s standards. In my recent work, I have used a corpus of about 50,000,000,000 words, 500 times the size of the BNC.

A popular test for evaluating measures of semantic distance is a collection of 80 multiple-choice synonym questions from the Test of English as a Foreign Language (TOEFL). The highest TOEFL score for a lexicon-based measure is 78.75% (Jarmasz and Szpakowicz, 2003). The highest score for a corpus-based measure is 92.50% (Rapp, 2003). The highest score for a hybrid measure is 97.50% (Turney et al., 2003). To me, this is much more persuasive than the arguments of Budanitsky and Hirst.

7 Responses to “Lexicons versus Corpora for Measures of Semantic Distance”

  1. Ultimately though, isn’t a lexicon just a cleaned up statistical aggregate?

    I would expect the Computational Linguistics community to have settled on hybrid methods a long time ago. It seems difficult to believe that lexicon-based algorithms can scale up to the complexity of the task. Yet, I would expect lexicons can almost always help.

    Now, I am not part of this community, so maybe it is not so obvious.

  2. The Lexicon vs Corpus debate seems related to the Symbol Grounding Problem: should we design symbols by hand, consciously trying to capture what we think is relevant/useful (designing semantic categories such as Object or Event, for example) or should we rely on the corpus data to derive useful semantic categories?

    One other argument in favor of the Corpus view is that semantic categories derived from a corpus are cognitively real, i.e. have similarities with semantic categories in terms of which humans organize their vocabulary and in fact also their non-linguistic memory, such as gradual category membership.

  3. I don’t think Pantel and Lin’s algorithm could be called “unsupervised”. I’m not entirely sure what strictly constitutes supervised and unsupervised so I may be off the mark here. The parser Minipar that they use to parse the corpus consults a considerably large amount of grammatical data while it gets the context information that they calculate their PMI weightings from. To me that seems like “semi-supervised” and not purely corpus based.

    I agree with everything you’ve pointed out here otherwise though. Thanks for the good read, I’ve been enjoying some of your other posts too. Cheers

  4. I don’t think Pantel and Lin’s algorithm could be called “unsupervised”.

    Good point. Here is a paper by Rapp on fully unsupervised word sense discovery, including references to several other papers:

    http://www.amtaweb.org/summit/MTSummit/FinalPapers/19-Rapp-final.pdf

    By the way, here is more information about the TOEFL results:

    http://aclweb.org/aclwiki/index.php?title=TOEFL_Synonym_Questions

    Thanks for the good read, I’ve been enjoying some of your other posts too.

    Thanks!

  5. Hmm, I consider lemmatisation a form of semi-supervision too, simple because it consults another resource. i know lemmatisation is a small step away from stemming, but still makes the considerable jump of consulting external resources. Once again, I’m open to corrections on this point.

    From the abstract of Karp’s paper:
    This paper presents a morphological lexicon for English that handle more than 317000 inflected forms derived from over 90000 stems.
    http://citeseer.ist.psu.edu/585081.html

    The reason I find this interesting, is that if we eliminate the external language specific resources for supervision, we will have an engine that should perform well on any symbolic natural language.

    Sorry to analyse things so closely, I’m actually writing a paper where this information is all very useful. So really… thanks for making me think :)

  6. Hmm, I consider lemmatisation a form of semi-supervision too, simple because it consults another resource.

    Another good point. Here are some thoughts:

    (1) In Rapp’s work (and in my own work), lemmatization and stemming are ways of compensating for limited data. The larger the corpus, the less need there is for morphological analysis. Lemmatization and stemming generalize across minor word variations, which extracts more information from the data, but can also introduce noise. For example, “water” (noun) and “waters” (noun) have different meanings. With a sufficiently large corpus, it would be better not to lemmatize.

    (2) Stemming can be as simple as replacing the last few characters in a word with a wildcard. See Table V in Corpus-based learning of analogies and semantic relations:

    http://arxiv.org/abs/cs.LG/0508103

    (3) There is work in unsupervised stemming:

    http://www.cis.hut.fi/morphochallenge2007/
    http://www.cis.hut.fi/morphochallenge2005/

  7. (1) Yep, this is the very reason I don’t use stemming when processing text. I don’t have any figures on what is sensible in terms of corpus size. One day I dream of having a system so good that stemming/lemmas doesn’t make a difference, because the introduced ambiguity can still be disambiguated effectively by higher order information.

    Interesting stuff about stemming there. Thanks for the links. (yet more reading to do) Still gotta find a same way to quantify the results of what I’m working on. Confirmation date is approaching too :)

    Cheers

Leave a Reply